NVIDIA Blackwell Wins in InferenceMAX: Performance and Cost

Worldbytes » Computer Technology » Hardware » InferenceMAX Benchmarks and NVIDIA Blackwell's Dominance

InferenceMAX v1 measures real-world performance and economy with reproducible overnight testing.
NVIDIA Blackwell leads in tokens/s, cost per million tokens, and tokens per MW.
Continuous software (TensorRT-LLM, Dynamo, SGLang, vLLM) drives 5x-15x improvements.
GB200 NVL72 achieves 15x ROI and minimum TCO in dense loads and MoE models.

InferenceMAX AI Benchmarks

The conversation about inference performance in IA has accelerated, and with good reason: InferenceMAX v1 has brought order with verifiable and up-to-date data who look beyond raw speed to assess real economics. In this context, NVIDIA's Blackwell platform has not only set the pace, has swept it with unprecedented efficiency and cost per token results.

In short, we are talking about a change of era: from “how much it runs” to “how much it yields per euro and per watt in production”. Combining hardware Blackwell (B200 and GB200 NVL72), 5th generation NVLink interconnect, low precision NVFP4, and ongoing software optimizations (TensorRT-LLM, Dynamo, SGLang, vLLM) Raises the bar in tokens/s, cost per million tokens, and effective ROI in real-life scenarios.

What is InferenceMAX v1 and why it matters

The biggest complaint from the industry was that Traditional benchmarks become outdated quickly and often favor unrealistic configurations.. InferenceMAX v1 breaks with that: it is a benchmark open source, automated, and with nightly executions under the Apache 2.0 license that re-evaluates popular frameworks and models daily to capture real software progress.

For each combination of model and hardware, the system performs sweeps of tensor parallelism sizes and concurrency, and features performance curves that balance throughput and latency. In addition, IQ results are published daily and multiple frameworks are tested (SGLang, TensorRT-LLM and vLLM), allowing us to see how recent optimizations move the Pareto frontier in near real time.

At the methodological level, the tests cover single-node and multi-node with Expert Parallelism (EP), and include variable input/output sequence lengths (80%-100% ISL/OSL combinations) to mimic real loads of reasoning, document processing, summarization and chatThe result is a continuous snapshot of latency, throughput, batch sizes, and input/output ratios that represents real operating economics, not just theory.

Blackwell leads: performance, efficiency, and economies of scale

The published data leave little room for doubt: NVIDIA Blackwell Sweeps InferenceMAX v1 in Inference Performance and Efficiency across the entire load range. Compared to the Hopper generation (HGX H200), the jump to B200 and GB200 NVL72 means Orders of magnitude improvements in compute-per-watt and memory bandwidth, plus a drastic drop in the cost per million tokens.

In concrete terms, the system GB200 NVL72 achieves 15x ROI: an investment of 5 million dollars can generate 75 million in token revenueThis data is not an accounting trick: it responds to the combination of NVFP4 for native low precision, NVLink and NVLink Switch 5th Generation and the maturity of TensorRT-LLM and NVIDIA Dynamo in the software stack.

History repeats itself with the cost per token. In gpt-oss, the B200 optimizations have reduced the cost to two cents per million tokens, a 5x decrease in just two months. This trend, supported by ongoing software improvements, completely changes the economic viability of new use cases.

Methodology that captures the reality of production

InferenceMAX v1 doesn't just measure tokens per second. Map throughput versus latency on a Pareto frontier which helps decide at what point it is worth operating, according to interactivity SLA and TCO objectives. The relevant thing is how Blackwell maintains advantages throughout the range, not in a single optimal corner.

To provide representativeness, the tests include attendances from 4 to 64 (and scenarios beyond these limits in complementary analyses), various EP and DEP settings y reference models in the community, from gpt-oss 120B to Llama 3.3 70B or DeepSeek-R1. All with open repository and reproducible recipes so anyone can validate results.

Pure performance: tokens/s per GPU and interactivity

Blackwell B200 sets the pace with numbers that seemed like science fiction a year ago. With the latest stack of NVIDIA TensorRT-LLM, they get reported 60.000 tokens per second per GPU and to 1.000 tokens per second per user in gpt-oss, maintaining an interactivity that does not sacrifice experience.

How to View All Your System Information in Windows 11: Complete Guide

In dense models such as Llama 3.3 70B, which activate all parameters in inference, Blackwell achieves in InferenceMAX v1 10.000 tokens/s per GPU at 50 TPS/user, more than 4x compared to H200. This improvement is supported by NVFP4, fifth generation of Tensor Cores and the bandwidth of 1.800 GB/s bi-directional NVLink, avoiding bottlenecks between GPUs.

Efficiency is also measured in Tokens per watt and cost per million tokens. For AI factories with power limits, Blackwell delivers 10x more throughput per megawatt compared to the previous generation. In addition, it has reduced the cost per million tokens by 15x, opening the door to much more cost-effective mass deployments.

Software that improves every week: from 6K to 30K tokens/s per GPU

Beyond the hardware, speed is the defensive ditch. Following the release of gpt-oss-120b on August 5, B200 on InferenceMAX v1 was already performing well with TensorRT-LLM, but Successive optimizations have doubled and then multiplied The initial numbers. At about 100 TPS/user, the GPU throughput almost doubled in a short time with respect to the launch day.

With the TensorRT-LLM version of October 9 The parallelism assignments arrived EP and DEP, and performance at 100 TPS/user increased up to 5x compared to the initial version, going from ~6K to ~30K tokens/s per GPU. Part of this jump is achieved with higher attendance than those that InferenceMAX tests as standard (4-64), which demonstrates how much is still left to squeeze in advanced settings.

The master stroke has been to enable speculative decoding for gpt-oss-120b with model gpt-oss-120b-Eagle3-v2With EAGLE, GPU throughput at 100 TPS/user triples with respect to the published results, going from 10K to 30K tokens/sAnd the best: the cost per million tokens at 100 TPS/user has dropped from $0,11 to $0,02 in two months. Even at 400 TPS/user, it remains around 0,12$, making viable multi-agent scenarios and complex reasoning.

Real economy: 15x ROI and minimum TCO with GB200 NVL72

In the reasoning model DeepSeek-R1, the InferenceMAX v1 curves show that GB200 NVL72 reduces the cost per million tokens overwhelming against H200 at all levels of interactivity. At ~75 TPS/user, H200 ranks at 1,56$, while GB200 NVL72 falls to just over 0,10$, 15x crop. In addition, the cost curve of GB200 stays flat longer, allowing serving above 100 TPS/user without penalizing the pocket.

For mass deployments, this translates to “AI factories” can serve more users with better SLAs without triggering OPEX or giving up throughput. Added to the fact that an investment of 5 million can generate 75 million in token revenue, the message is clear: Inference is where AI returns value every day and Blackwell takes advantage of its full-stack approach.

Architecture that enables the jump: NVFP4, NVLink 5 and NVLink Switch

Blackwell's hegemony doesn't come out of nowhere. The stack is based on extreme hardware-software co-design: precision NVFP4 for efficiency without losing accuracy, Fifth generation of NVIDIA NVLink or with a NVLink Switch that allows to treat 72 GPUs as a macro-GPU, enabling very high attendance with tensor, expert and data parallelism.

This approach adds to a annual hardware cadence already continuous software improvements that, by themselves, have more than doubled Blackwell's performance since its launch. Integration with TensorRT-LLM, NVIDIA Dynamo, SGLang and vLLM complete the picture, supported by a gigantic ecosystem of millions of GPUs, CUDA developers and hundreds of open source projects.

MoE at full power: disaggregated serving with GB200, Dynamo, and TensorRT-LLM

Verified tests show that the combination of GB200 NVL72, Dynamo and TensorRT-LLM boosts the throughput of MoE models like DeepSeek-R1 under very different SLAs, leaving Hopper-based systems behind. The NVL72's scale-up design interconnects 72 GPUs with NVLink in a single domain, with up to 130 TB/s of bandwidth between GPUs, key to route expert tokens without bottlenecks of traditional interconnection.

Reversible Computing: A Step Towards an Energy Revolution in Technology

El disaggregated serving Dynamo separates prefill and decode into separate nodes, optimizing each phase with different GPU and EP distributions. Thus, the decode phase, more limited by memory, can exploit EP width for experts without slowing down the prefill phase, which is more computationally intensive.

To prevent there being Idle GPUs in broad EP deployments, TensorRT-LLM monitors the loading of experts, distributes the most used ones and can replicate them to balance. Result: high and stable utilization, with net gains in cash throughput.

Open collaboration: SGLang, vLLM and FlashInfer

Beyond Dynamo and TensorRT-LLM, NVIDIA has co-developed kernels and optimizations for Blackwell alongside SGLang and vLLM, delivered through FlashInfer. We talk about improvements in kernels for Prefill and Decode for Attention, Communication, GEMM, MNNVL, MLA, and MoE, in addition to runtime optimizations.

SGLang has incorporated capabilities of Multi-Token Prediction (MTP) and disaggregation for DeepSeek-R1. In vLLM they have arrived asynchronous schedulers with overlap to reduce host overhead, automatic graph merges and performance and functionality improvements for gpt-oss, Llama 3.3 and general architectures. Everything adds up so that Blackwell squeezes its efficiency into the most widely used open source frameworks.

Comparisons and additional technical details of the ecosystem

In technical analysis, the Blackwell architecture is positioned as a notable advance for inferences with low latency and high throughput. It highlights the mixed FP8/FP4 execution on fifth-generation tensor cores, along with NVLink 5 with up to 1,8 TB/s per GPU for communication between multiple units without strangulations.

On DGX B200 nodes with NVSwitch, configurations of up to eight GPUs with unified HBM3e memory which amounts to almost 1,44 TB added, and inference pipelines that reflect actual usage: initial prefill and subsequent autoregressive decoding. The suite measures Tokens/s, latency per request and efficiency in FLOPS, with kernel-level optimizations and specialized TensorRT-LLM engines.

Facing H100 (Hopper), Blackwell arrives at 4x the throughput in Llama 2/3 70B on a similar node, attributable to more tensor cores and improvements in Memory bandwidth (up to 5 TB/s per GPU in some benchmarks). It also mentions a linear scalability in clusters of hundreds of GPUs, maintaining high efficiencies in the use of HBM3e and avoiding costly paging to host memory.

In energy efficiency, improvements of up to 2,5x vs. H100, with consumption that, in high load scenarios, is around 700W to 1.000W per GPU depending on the configuration, and FP4 performance peaks that clearly exceed the previous generation in FLOPS per watt. tools like DCGM and telemetry with Prometheus/Grafana facilitate a first-level observability.

Operating economics, sustainability and compliance

InferenceMAX v1's focus on metrics such as Tokens per megawatt and cost per million tokens It is not posturing: it conditions decisions capex and opex. Blackwell achieves 10x more throughput per MW that the previous generation and has lowered the cost per million tokens by 15x, with direct implications for the expansion of services and sustainability.

Practices aimed at renewable energies in DGX systems and regulatory references such as EU AI Act, GDPR or NIST SP 800-53. In addition, Blackwell incorporates Confidential Computing with secure enclaves and memory encryption to protect data across sectors highly regulated such as banking or health.

Use cases: security, IT and even blockchain

Combining high performance and interactivity allows you to go from pilots to real-time security systems, from analysis of logs anomaly detection in networks petabyte scale with subsecond latenciesIn IT, hyperscalers are integrating Blackwell into offerings for hybrid workloads with storage distributed and 5G networks, leaning on RoCE for minimal latency at the edge, and companies like ByteDance strengthens its commitment by NVIDIA chips.

Even in blockchain, they are raised decentralized AI oracles and acceleration of ZK tests on networks like Ethereum or Solana thanks to tensor parallelism. Operationally, reductions in up to 40% in TCO inference due to higher density per rack and advanced liquid cooling, maintaining temperatures below 85°C under sustained load.

What to do if the SSD temperature rises above 70º for no apparent reason

Good practices and migration challenges

It's not all red carpet: migrating from Hopper requires recompile CUDA kernels and can uncover bugs in legacy pipelines. NVIDIA's best practices guides for inference with LLM recommend Profiling with Nsight Systems, detect necks in attention and decoding and apply techniques of sharding with Megatron-LM to balance loads between GPUs.

For security reasons, it is advisable to activate secure boot and runtime protections in TensorRT to prevent code injectionIn decentralized deployments, latency is contained with sidechains and offload computation to dedicated GPUs, preserving integrity with cryptographic proofs.

Community, resources and transparency

InferenceMAX v1 is a community effort. Thanks to AMD (MI355X and CDNA3) for hardware for the project and NVIDIA for access to GB200 NVL72 (via OCI) and B200. Also to the teams of inference and Dynamo, and computing providers such as Crusoe, CoreWeave, Nebius, TensorWave, Oracle and TogetherAI for promoting open source with real resources.

The platform publishes a Live dashboard on inferencemax.ai with updated results and makes available containers and configurations to reproduce benchmarks. Given the speed at which AI software evolves, night tests They are the honest way to show where performance is today, not months ago.

Industry voices and career opportunities

Infrastructure officials and scientists acknowledge that the distance between theoretical peak and actual throughput they mark it systems software, distributed strategies and low-level kernelsThat's why they value benchmarks open and reproducible that show how optimizations perform on different hardware and that light up tokens/s, cost per dollar and tokens per megawatt with transparency.

In addition, the project is looking for talent for a special projects teamAmong the responsibilities, the following stand out:

Design and execute large-scale benchmarks across multiple vendors (AMD, NVIDIA, TPU, Trainium, etc.).
Build reproducible CI/CD pipelines to automate executions.
Ensure reliability and scalability of systems shared with industry partners.

Collaborations with open models and ecosystems

NVIDIA maintains open collaborations with the community and with teams such as OpenAI (gpt-oss 120B), Meta (Flame 3 70B) and DeepSeek THERE (DeepSeek R1), in addition to contributions with FlashInfer, SGLang and vLLM. This ensures that the latest models are optimized for the world's largest inference infrastructure and kernel and runtime improvements integrate on a scale.

For companies, the framework Think SMART NVIDIA's helps navigate the jump from drivers to AI factories, fine-tuning platform decisions, cost per token, latency and utilization SLAs based on changing loads. In a world moving from one-shot responses to multi-stage reasoning and tool use, this guide becomes strategic.

Practical note: some content shared on networks such as X may require JavaScript enabled to be displayed; otherwise, the aid and policies of the site. It's a minor detail, but useful if you want to keep track of the announcements in real time.

For anyone wondering whether it's worth taking a closer look at the InferenceMAX v1 recipes, know that are open for anyone to replicate Blackwell's leadership in very different inference scenarios. It's exactly the kind of transparency that accelerates progress across the community.

After reviewing the data, software improvements, and open collaborations, one key idea remains clear: Inference is where AI turns performance into business on a daily basis.. With flat cost curves at high levels of interactivity, tokens/s per GPU that scale elegantly and an ecosystem that never stops optimizing kernels and runtimes, Blackwell consolidates itself as a reference platform for those who want to build efficient, fast, and profitable AI factories.

NVIDIA Project DIGITS: The AI revolution from your desktop

Isaac

Passionate writer about the world of bytes and technology in general. I love sharing my knowledge through writing, and that's what I'll do on this blog, show you all the most interesting things about gadgets, software, hardware, tech trends, and more. My goal is to help you navigate the digital world in a simple and entertaining way.