Moonshot’s trillion-parameter model uses a mixture-of-experts sparse attention design that activates only 32 billion parameters at once, demonstrating how sparse MoE can deliver large model capacity with reduced compute.