Step 3.5 Flash: Fast Enough to Think. Reliable Enough to Act
Summary
Step 3.5 Flash is a high-performing (81.0 score) open-source MoE model (196B/11B params) excelling in reasoning, coding, agentic tasks, tool-use, and deep research, supporting edge-cloud & local deployment.
Step 3.5 Flash delivers frontier reasoning
Step 3.5 Flash launched today as a new open-source foundation model designed for high-efficiency reasoning and agentic tasks. The model utilizes a sparse Mixture of Experts (MoE) architecture to balance high-end performance with low-latency execution. It achieves a mean score of 81.0 across eight major benchmarks, placing it in direct competition with top-tier proprietary models.
The system architecture features 196 billion total parameters, but it only activates 11 billion parameters per token. This design allows the model to maintain the reasoning depth of much larger systems while remaining fast enough for real-time interactions. Developers can access the model now to build autonomous agents that require both speed and complex logic.
Internal testing shows the model handles "Parallel Thinking" to further boost its scores in reasoning-heavy scenarios. When these enhanced settings are active, Step 3.5 Flash rivals the output of models like GPT-4.5 and Gemini 3.0 Pro. The release emphasizes "intelligence density," focusing on how much logic a model can pack into a limited compute budget.
Sparse architecture optimizes intelligence density
The technical core of Step 3.5 Flash relies on a model-system co-design that prioritizes inference speed as a primary constraint. The engineering team employed a hybrid attention layout to bypass the quadratic bottlenecks typically found in long-context processing. This layout interleaves Sliding-Window Attention (SWA) and Full Attention at a 3:1 ratio.
To accelerate output, the model integrates Multi-Token Prediction (MTP) heads that predict future tokens in parallel with the primary stream. This allows for parallel verification, effectively breaking the serial constraints that slow down standard autoregressive decoding. On NVIDIA Hopper GPUs, the model reaches a throughput of 350 tokens per second.
The model also uses several specific hardware-level optimizations to maintain stability:
- 96 query heads in SWA layers to increase representational power.
- Head-wise Gated Attention to act as an input-dependent attention sink.
- Dense layers in the initial network stages to anchor foundational knowledge.
- INT8 KVCache quantization to support context windows up to 256K tokens.
These refinements ensure that the model does not suffer from the "long-context penalty" where attention costs usually explode. By modulating information flow dynamically, the model preserves numerical stability even during massive data ingestion. This makes it a viable choice for repository-level coding and long-form document analysis.
Agentic capabilities transform static models
Step 3.5 Flash moves beyond simple text generation by focusing on "Think-and-Act" synergy within tool-heavy environments. The model acts as a central controller that can orchestrate over 80 MCP tools in a single integrated session. In a demonstrated stock investment scenario, the model aggregated market data, executed raw code for financial metrics, and triggered cloud storage protocols autonomously.
The model maintains intent-alignment even when navigating high-density toolsets that usually confuse smaller LLMs. It can pivot between raw code execution and specialized API protocols without losing the thread of the original user request. This reliability allows it to function as a resilient partner for complex, multi-step workflows.
Coding performance has also shifted from simple completion to autonomous engineering. Step 3.5 Flash is fully compatible with Claude Code, serving as an efficient backend for agent-led development. It treats code as a tool to verify logic and map out cross-module dependencies across thousands of lines of text.
Specific engineering benchmarks highlight this technical proficiency:
- AIME 2025: 99.8
- HMMT 2025 Nov: 98.0
- IMOAnswerBench: 86.7
- ARC-AGI-1: 56.5
Cloud and edge devices work together
A new edge-cloud collaboration framework allows Step 3.5 Flash to control physical devices through Step-GUI. In this setup, the cloud-based model acts as the "Brain" while the edge-deployed model acts as the "Hand." This hierarchy allows for complex tasks like searching Arxiv papers and immediately sharing summaries via mobile messaging apps.
The cloud side handles heavy planning and data synthesis, which simplifies the execution requirements for the local device. This ensures higher success rates when the model retrieves real-time data from mobile apps like Taobao or JD.com. In testing on the AndroidDaily Hard subset, this collaborative approach significantly outperformed single-agent systems.
For users prioritizing privacy, the model supports local deployment on high-end consumer hardware. Step 3.5 Flash is compatible with Apple M4 Max, NVIDIA DGX Spark, and AMD AI Max+ systems. The team released the model weights in GGUF format with INT4 quantization to facilitate these private execution environments.
On an NVIDIA DGX Spark, the model achieves a local generation speed of 20 tokens per second. This allows developers to run a frontier-class model without sending sensitive data to external servers. The integration with llama.cpp ensures that the model is accessible to the existing open-source ecosystem immediately.
New algorithms stabilize large scale training
The development of Step 3.5 Flash involved a new reinforcement learning framework designed to solve stability issues in long-horizon reasoning. Standard PPO pipelines often collapse when minor token-level discrepancies lead to extreme importance weights. To fix this, the team introduced Metropolis Independence Sampling Filtered Policy Optimization (MIS-PO).
MIS-PO replaces traditional importance weighting with a strict binary acceptance criterion. If a trajectory deviates too far from the training policy, the system simply excludes it from optimization rather than trying to scale the gradient. This design reduces variance and allows the model to learn from long reasoning sequences without crashing the training run.
The training pipeline also includes:
- Truncation-aware value bootstrapping to prevent penalties when hitting context limits.
- Routing confidence monitoring specifically for MoE model stability.
- ReAct architecture for iterative deep research investigations.
- Multi-agent orchestration for parallel search and verification loops.
These components allowed the model to achieve a 65.27% score on the Scale AI Research Rubrics. This benchmark measures factual grounding and reasoning depth in long-form reports. In one case study, the model synthesized a 10,000-word research report on neuroplasticity, demonstrating its ability to maintain coherence across massive outputs.
The model is available now via a simple bash installation. Developers can configure it through the OpenClaw WebUI by adding the new provider and pointing to the official API or local GGUF files. This release marks a shift toward models that are not just fast enough to think, but reliable enough to act autonomously.
Related Articles
Google's emissions up 50% in five years due to AI data centers
LLMs need moral competence, not just performance, for real-world roles. Challenges include fake reasoning, complex moral factors, and global standards. A roadmap with new evaluations is proposed.
GitHub’s Agentic Workflows bring “continuous AI” into the CI/CD loop
GitHub's Agentic Workflows use AI to automate tasks like issue triage and code review by watching repo events. It's a "continuous AI" layer for CI/CD, not a replacement.
Stay in the loop
Get the best AI-curated news delivered to your inbox. No spam, unsubscribe anytime.
