Amol Sarva's weblog - Why was DeepSeek so cheap to make? Amol Sarva's weblog

My summary: raw data not labeled data or human aligned, curated data collections given privilege (not boil the ocean), train against other-people’s-models, “mini-models” that are trained by the big model = cheaper to train, cheaper to inference. They refine/scaffold off other models (though not specifically Llama)

Here’s a good post by a human that kind of says what you see below, and below is written by a few iterations of questions to DeepSeek.

Reinforcement Learning Focus: DeepSeek-R1-Zero uses pure reinforcement learning (RL) without supervised fine-tuning (SFT), reducing the need for costly labeled data and computation.
Efficient Training Pipeline: DeepSeek-R1 employs a multi-stage process, including fine-tuning with high-quality reasoning data and RL stages for reasoning, safety, and usefulness, minimizing wasted computation.
Knowledge Transfer via Distillation: DeepSeek transfers knowledge from larger models (e.g., Llama, Qwen) to smaller ones during distillation, reducing reliance on expensive base models for core training.
High-Quality Data Curation: Focus on curated, reasoning-focused data allows DeepSeek to achieve strong performance without massive datasets, lowering data collection and processing costs.
Flexible Base Models: DeepSeek can use various base models for distillation, offering cost savings if cheaper alternatives are available.
Cost Savings in Data: Reduced reliance on large datasets decreases data acquisition and processing expenses.
Cost Savings in Computation: Efficient training pipelines and RL reduce computational requirements.
Cost Savings in Deployment: Smaller distilled models are cheaper to deploy and maintain.
No Dependency on Llama for Core Training: While Llama models were used in distillation, DeepSeek’s core training relies on RL and curated data, not specific base models like Llama.
Reduced Need for Labeled Data: DeepSeek’s RL approach eliminates the need for extensive labeled datasets, making it more cost-efficient compared to models like Llama that rely on supervised fine-tuning.

More detail

DeepSeek’s lower training costs compared to models like Llama or OpenAI’s are due to several smart strategies:

Reinforcement Learning Focus: DeepSeek-R1-Zero uses reinforcement learning directly, skipping the costly step of supervised fine-tuning with labeled data. This saves on data annotation and computation.
Efficient Training Pipeline: DeepSeek-R1 uses a multi-stage process, including fine-tuning with high-quality reasoning data, reinforcement learning for reasoning, and a second reinforcement learning stage for safety and usefulness. This targeted approach reduces wasted computation.
Knowledge Transfer: DeepSeek uses distillation to transfer knowledge from larger models to smaller ones. While models like Llama were used in this step, they weren’t needed for the core training, allowing flexibility and cost savings.
High-Quality Data: By focusing on curated, reasoning-focused data, DeepSeek achieves strong performance without needing massive datasets, cutting data collection and processing costs.
Flexible Base Models: DeepSeek can use various base models for distillation, potentially reducing costs if cheaper options are available.

Savings:

Data Costs: Less reliance on massive datasets lowers data acquisition and processing expenses.
Computation Costs: The efficient pipeline and reinforcement learning reduce computational needs.
Deployment Costs: Smaller distilled models are cheaper to deploy and maintain.

Based on the sources, DeepSeek could have been trained without using Llama as a base model, although the Llama models were used in the distillation process. Here’s why:

DeepSeek-R1-Zero’s Pure RL Approach: DeepSeek-R1-Zero is trained using pure reinforcement learning (RL)directly on a base model, without supervised fine-tuning (SFT) as a preliminary step. This demonstrates that the reasoning capabilities of LLMs can be incentivized purely through RL, without needing SFT or any specific base model like Llama.
DeepSeek-R1’s Training Pipeline: DeepSeek-R1 is trained through a multi-stage pipeline that includes:
- Cold Start: Fine-tuning a base model with high-quality Chain-of-Thought (CoT) data.
- Reasoning-Oriented RL: Enhancing the model’s reasoning abilities.
- Rejection Sampling and SFT: Creating new SFT data through rejection sampling and fine-tuning using this data along with non-reasoning datasets.
- Second RL Stage: A second RL process aimed at improving helpfulness and harmlessness.
- Distillation: This step uses the outputs from DeepSeek-R1 to fine-tune smaller models, including models based on the Llama architecture.
Distillation as a Separate Step: While DeepSeek-R1 used Qwen and Llama models as base models for distillation, the core training of both DeepSeek-R1-Zero and DeepSeek-R1 did not fundamentally rely on the Llama architecture. The key methods for improving reasoning capabilities involved the RL process and the curated training data that was used to train DeepSeek models.
Emphasis on Reasoning Patterns: The process of using DeepSeek-R1 to distill smaller models is important, as it transfers the reasoning patterns discovered by the larger DeepSeek models to smaller models. This does not imply that the Llama models are critical for the initial training of DeepSeek-R1, only for the smaller models generated in the distillation process.
Base Models for Distillation: The models used as a base for distillation included Qwen and Llama, however, this highlights that other base models can be used. This demonstrates that the key findings from DeepSeek’s training can be implemented in other architectures.

In summary, the DeepSeek models were designed to be trained through innovative reinforcement learning strategies and a multi-stage approach. While Llama models were used for creating smaller distilled models, they were not a requirement for the core training and methodology of the DeepSeek-R1 and DeepSeek-R1-Zero models. The core of DeepSeek’s improvements are the RL strategies and the curated datasets, not the Llama base models.

Yes, Llama (Large Language Model Meta AI) uses labeled data during its training process, particularly in the supervised fine-tuning (SFT) phase. Here’s how labeled data is involved:

Pretraining: Llama is first pretrained on a large corpus of unlabeled text data (e.g., books, websites, etc.) using self-supervised learning. This phase doesn’t require labeled data.
Supervised Fine-Tuning (SFT): After pretraining, Llama (or similar models) often undergoes fine-tuning on labeled datasets. These datasets include examples of inputs paired with correct outputs (e.g., question-answer pairs, instructions, or annotated text). This step helps the model align better with specific tasks or improve its performance on reasoning, safety, or helpfulness.
Reinforcement Learning (Optional): Some versions of Llama or similar models may also use reinforcement learning with human feedback (RLHF), which involves labeled data in the form of human preferences or rankings to further refine the model.

Why Labeled Data is Used:

Labeled data helps the model learn specific behaviors, such as following instructions, answering questions, or avoiding harmful outputs.
It is critical for improving the model’s alignment with human expectations and task-specific performance.

DeepSeek’s Difference:

DeepSeek-R1-Zero, for example, skips the supervised fine-tuning step and uses pure reinforcement learning directly, which reduces the need for labeled data. This is one reason DeepSeek’s training process can be more cost-efficient compared to models like Llama that rely heavily on labeled datasets for fine-tuning.

Labeled data for training large language models like Llama typically comes from a variety of sources, depending on the task and the goals of the model. Here are the main sources of labeled data:

1. Human-Created Datasets

Crowdsourcing: Companies often hire workers (e.g., through platforms like Amazon Mechanical Turk) to create labeled datasets. For example, workers might write question-answer pairs, annotate text, or rank model outputs.
Expert Annotations: For specialized tasks (e.g., medical or legal domains), experts may be hired to create high-quality labeled data.
Open Datasets: Some labeled datasets are publicly available, such as those from academic research or open-source projects (e.g., SQuAD for question answering or CoNLL for named entity recognition).

2. Instruction-Tuning Datasets

These are datasets designed to teach models how to follow instructions. Examples include:
- OpenAI’s InstructGPT Dataset: Contains prompts and human-written responses to train models to follow instructions.
- Self-Instruct: A method where models generate their own instruction-response pairs, which are then filtered and refined by humans.

3. Human Feedback for Reinforcement Learning (RLHF)

In reinforcement learning with human feedback (RLHF), labeled data comes from humans ranking or scoring model outputs. For example:
- Humans compare multiple model responses to the same prompt and rank them by quality.
- This feedback is used to train a reward model, which guides the fine-tuning process.

4. Synthetic Data

Sometimes, labeled data is generated synthetically:
- Model-Generated Data: A model creates examples, which are then verified or corrected by humans.
- Data Augmentation: Existing labeled data is modified or expanded to create new examples.

5. Task-Specific Datasets

For specific tasks (e.g., summarization, translation, or coding), labeled data is often collected from domain-specific sources:
- Summarization: News articles paired with summaries.
- Translation: Parallel text in multiple languages.
- Coding: Code snippets paired with explanations or corrections.

6. Proprietary or Internal Data

Companies like OpenAI, Meta, or Google may use proprietary datasets collected from their own products or services (e.g., search queries, customer support interactions, or internal tools).

Challenges with Labeled Data:

Cost: Creating high-quality labeled data is expensive, as it often requires human effort.
Bias: Human-created data can introduce biases, which may affect the model’s behavior.
Scalability: Labeling enough data for large-scale training can be time-consuming and resource-intensive.

DeepSeek’s Approach:

DeepSeek reduces reliance on labeled data by using reinforcement learning directly, without supervised fine-tuning. This allows it to achieve strong performance with less labeled data, making the training process more cost-efficient compared to models like Llama that depend heavily on labeled datasets.

Since 1998

Why was DeepSeek so cheap to make?

Savings:

Why Labeled Data is Used:

DeepSeek’s Difference:

1. Human-Created Datasets

2. Instruction-Tuning Datasets

3. Human Feedback for Reinforcement Learning (RLHF)

4. Synthetic Data

5. Task-Specific Datasets

6. Proprietary or Internal Data

Challenges with Labeled Data:

DeepSeek’s Approach:

Related

Since 1998

Why was DeepSeek so cheap to make?

Savings:

Why Labeled Data is Used:

DeepSeek’s Difference:

1. Human-Created Datasets

2. Instruction-Tuning Datasets

3. Human Feedback for Reinforcement Learning (RLHF)

4. Synthetic Data

5. Task-Specific Datasets

6. Proprietary or Internal Data

Challenges with Labeled Data:

DeepSeek’s Approach:

☀

Related