DeepSeek, a Chinese AI start-up founded in 2023, has quickly made waves in the industry. With fewer than 200 employees and backed by the quant fund High-Flyer ($8 billion assets under management), the company released its open-source model, DeepSeek R1, one day before the announcement of OpenAI’s $500 billion Stargate project.
What sets DeepSeek apart is the prospect of radical cost efficiency. The company claims to have trained its model for just $6 million using 2,000 Nvidia H800 graphics processing units (GPUs) vs. the $80 million to $100 million cost of GPT-4 and the 16,000 H100 GPUs required for Meta’s LLaMA 3. While the comparisons are far from apples to apples, the possibilities are valuable to understand.
DeepSeek’s rapid adoption underscores its potential impact. Within days, it became the top free app in US app stores, spawned more than 700 open-source derivatives (and growing), and was onboarded by Microsoft, AWS, and Nvidia AI platforms.
DeepSeek’s performance appears to be based on a series of engineering innovations that significantly reduce inference costs while also improving training cost. Its mixture-of-experts (MoE) architecture activates only 37 billion out of 671 billion parameters for processing each token, reducing computational overhead without sacrificing performance. The company also has optimized distillation techniques, allowing reasoning capabilities from larger models to be transferred to smaller ones. By using reinforcement learning, DeepSeek enhances performance without requiring extensive supervised fine-tuning. Additionally, its multi-head latent attention (MHLA) mechanism reduces memory usage to 5% to 13% of previous methods.
Beyond model architecture, DeepSeek has improved how it handles data. Its mixed-/low-precision computation method, with FP8 mixed precision, cuts computational costs. An optimized reward function ensures compute power is allocated to high-value training data, avoiding wasted resources on redundant information. The company also has incorporated sparsity techniques, allowing the model to predict which parameters are necessary for specific inputs, improving both speed and efficiency. DeepSeek’s hardware and system-level optimizations further enhance performance. The company has developed memory compression and load balancing techniques to maximize efficiency. Specifically, one novel optimization technique was using PTX programming instead of CUDA, giving DeepSeek engineers better control over GPU instruction execution […]

Fallbrook couple had their children help them harvest magic mushrooms: DA
A baggie containing so-called magic mushrooms The U.S. Attorney’s Office,