vllm.utils.flashinfer ¶
Compatibility wrapper for FlashInfer API changes.
Users of vLLM should always import only these wrappers.
_flashinfer_concat_mla_k ¶
Custom op wrapper for flashinfer's concat_mla_k.
This is an in-place operation that concatenates k_nope and k_pe into k.
The kernel is optimized for DeepSeek V3 dimensions: - num_heads=128 - nope_dim=128 - rope_dim=64
Key optimizations: - Warp-based processing with software pipelining - Vectorized memory access (int2 for nope, int for rope) - L2 prefetching for next row while processing current - Register reuse for rope values across all heads
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
k | Tensor | Output tensor, shape [num_tokens, num_heads, nope_dim + rope_dim]. Modified in-place. | required |
k_nope | Tensor | The nope part of k, shape [num_tokens, num_heads, nope_dim]. | required |
k_pe | Tensor | The rope part of k (shared), shape [num_tokens, 1, rope_dim]. This is broadcast to all heads. | required |
Source code in vllm/utils/flashinfer.py
_get_submodule ¶
Safely import a submodule and return it, or None if not available.
_lazy_import_wrapper ¶
_lazy_import_wrapper(
module_name: str,
attr_name: str,
fallback_fn: Callable[..., Any] = _missing,
)
Create a lazy import wrapper for a specific function.
Source code in vllm/utils/flashinfer.py
_missing ¶
Placeholder for unavailable FlashInfer backend.
Source code in vllm/utils/flashinfer.py
can_use_trtllm_attention ¶
Check if the current configuration supports TRTLLM attention.
Source code in vllm/utils/flashinfer.py
flashinfer_mm_mxfp8 ¶
flashinfer_mm_mxfp8(
a: Tensor,
b: Tensor,
block_scale_a: Tensor,
block_scale_b: Tensor,
out_dtype: dtype,
backend: str = "cutlass",
) -> Tensor
MXFP8 MM helper - mirrors flashinfer_scaled_fp4_mm API.
Takes non-transposed weights and handles transpose internally.
CRITICAL: mm_mxfp8 CUTLASS kernel requires SWIZZLED 1D scales for optimal performance and accuracy. Both input and weight scales should be in swizzled format from FlashInfer's mxfp8_quantize(is_sf_swizzled_layout=True).
Source code in vllm/utils/flashinfer.py
flashinfer_trtllm_batch_decode_sparse_mla_dsv4_raw ¶
flashinfer_trtllm_batch_decode_sparse_mla_dsv4_raw(
*,
query: Tensor,
swa_kv_cache: Tensor,
workspace_buffer: Tensor,
sparse_indices: Tensor,
compressed_kv_cache: Tensor,
sparse_topk_lens: Tensor,
seq_lens: Tensor,
out: Tensor,
bmm1_scale: float | Tensor = 1.0,
bmm2_scale: float | Tensor = 1.0,
sinks: Tensor | None = None,
cum_seq_lens_q: Tensor | None = None,
max_q_len: int | None = None,
enable_pdl: bool | None = None,
) -> Tensor
Unchecked DeepSeek V4 sparse MLA launcher for hot vLLM decode paths.
The caller must provide HND-compatible 3D/4D KV caches, contiguous INT32 metadata, a BF16 output tensor, and launcher-ready scale tensors. This skips FlashInfer's Python validation, which otherwise adds syncs and pointwise kernels on every attention layer.
Source code in vllm/utils/flashinfer.py
force_use_trtllm_attention ¶
force_use_trtllm_attention() -> bool | None
This function should only be called during initialization stage when vllm config is set. Return None if --attention-config.use_trtllm_attention is not set, return True if TRTLLM attention is forced to be used, return False if TRTLLM attention is forced to be not used.
Source code in vllm/utils/flashinfer.py
has_flashinfer cached ¶
has_flashinfer() -> bool
Return True if flashinfer-python package is available.
Source code in vllm/utils/flashinfer.py
has_flashinfer_comm cached ¶
has_flashinfer_comm() -> bool
Return True if FlashInfer comm module is available.
has_flashinfer_cubin cached ¶
has_flashinfer_cubin() -> bool
Return True if flashinfer-cubin package is available.
Source code in vllm/utils/flashinfer.py
has_flashinfer_cutedsl cached ¶
has_flashinfer_cutedsl() -> bool
Return True if FlashInfer cutedsl module is available.
has_flashinfer_cutedsl_grouped_gemm_nt_masked cached ¶
has_flashinfer_cutedsl_grouped_gemm_nt_masked() -> bool
Return True if FlashInfer CUTLASS fused MoE is available.
Source code in vllm/utils/flashinfer.py
has_flashinfer_cutedsl_moe_nvfp4 cached ¶
has_flashinfer_cutedsl_moe_nvfp4() -> bool
Return True if FlashInfer cute_dsl_fused_moe_nvfp4 is available.
Source code in vllm/utils/flashinfer.py
has_flashinfer_cutlass_fused_moe cached ¶
has_flashinfer_cutlass_fused_moe() -> bool
Return True if FlashInfer CUTLASS fused MoE is available.
Source code in vllm/utils/flashinfer.py
has_flashinfer_fp8_blockscale_gemm cached ¶
has_flashinfer_fp8_blockscale_gemm() -> bool
Return True if FlashInfer block-scale FP8 GEMM is available.
Source code in vllm/utils/flashinfer.py
has_flashinfer_moe cached ¶
has_flashinfer_moe() -> bool
Return True if FlashInfer MoE module is available.
has_flashinfer_nvlink_one_sided cached ¶
has_flashinfer_nvlink_one_sided() -> bool
Return True if FlashInfer trtllm_moe_alltoall module is available.
Source code in vllm/utils/flashinfer.py
has_flashinfer_nvlink_two_sided cached ¶
has_flashinfer_nvlink_two_sided() -> bool
Return True if FlashInfer mnnvl all2all is available.
Source code in vllm/utils/flashinfer.py
has_flashinfer_trtllm_fused_moe cached ¶
has_flashinfer_trtllm_fused_moe() -> bool
Return True if FlashInfer TRTLLM fused MoE is available.
Source code in vllm/utils/flashinfer.py
has_nvidia_artifactory cached ¶
has_nvidia_artifactory() -> bool
Return True if NVIDIA's artifactory is accessible.
This checks connectivity to the kernel inference library artifactory which is required for downloading certain cubin kernels like TRTLLM FHMA.
Source code in vllm/utils/flashinfer.py
is_flashinfer_cudnn_fp8_prefill_attn_supported cached ¶
is_flashinfer_cudnn_fp8_prefill_attn_supported() -> bool
Check if FP8 ViT attention is supported on this platform.
Requires native FP8 hardware support, the FlashInfer cuDNN backend, and cuDNN >= 9.17.1.
Source code in vllm/utils/flashinfer.py
is_flashinfer_fp8_blockscale_gemm_supported cached ¶
is_flashinfer_fp8_blockscale_gemm_supported() -> bool
Return True if FlashInfer block-scale FP8 GEMM is supported.
Source code in vllm/utils/flashinfer.py
supports_trtllm_attention cached ¶
supports_trtllm_attention() -> bool
TRTLLM attention is supported if the platform is SM100, NVIDIA artifactory is accessible, and batch-invariant mode is not enabled.
Source code in vllm/utils/flashinfer.py
use_trtllm_attention ¶
use_trtllm_attention(
num_qo_heads: int,
num_kv_heads: int,
num_tokens: int,
max_seq_len: int,
dcp_world_size: int,
kv_cache_dtype: str,
q_dtype: dtype,
is_prefill: bool,
force_use_trtllm: bool | None = None,
has_sinks: bool = False,
has_spec: bool = False,
) -> bool
Return True if TRTLLM attention is used.
Source code in vllm/utils/flashinfer.py
435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 | |