vllm.model_executor.models.deepseek_v4 ¶
DeepseekV4FP8Config ¶
Bases: Fp8Config
FP8 config for DeepSeek V4 with expert-dtype-aware MoE dispatch.
DeepSeek V4 checkpoints always use FP8 block quantization for linear/attention layers. The MoE expert weights vary by checkpoint: - expert_dtype="fp4" (e.g. DeepSeek-V4-Flash): MXFP4 experts with ue8m0 (e8m0fnu) FP8 linear scales. - expert_dtype="fp8" (e.g. DeepSeek-V4-Flash-Base): FP8 block experts with float32 FP8 linear scales.
The dispatch and the linear scale dtype are both keyed off expert_dtype from the model's hf_config; missing values default to "fp4" so existing FP4 checkpoints stay unchanged.
NOTE: expert_dtype is resolved lazily because this config is constructed during VllmConfig setup, before set_current_vllm_config is active. Reading hf_config eagerly in __init__ would always see the default "fp4" and silently misroute Flash-Base checkpoints.
Source code in vllm/model_executor/models/deepseek_v4.py
124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 | |
DeepseekV4ForCausalLM ¶
Bases: Module, SupportsPP
Source code in vllm/model_executor/models/deepseek_v4.py
1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 | |