DeepSeek

Computerphile - https://youtu.be/gY4Z-9QlZ64?si=5Ao4JsDppzZH_Fnd

The current situation (feb 2025)

tech companies have been competing of building the biggest, most performant model by just making everything bigger (more dataset, making model bigger), training them until they are better
- the idea is that if you have 1000s of GPUs & billions of money, you’re in an advantage because you have that power to train the largest models
most companies (like OpenAI) keep everything secret
- just expose it thru web interfaces
- dont tell how they trained it, params, the dataset, etc
- but meta has a much more open policy
the models are out of reach for most people - they can run/refine them, but cannot train them from scratch (no resource/money)
Suddenly, small company in China released a model DeepSeek
- you can train with more limited hardware and money
- you can train the most recent variant of models MUCH more efficiently in terms of the amount of data you need to collect (which was pain)
- gamechanger!

DeepSeek R1

V3
- their flagship model (another large transformer)
- lots of more performance benefits
MoE
- diagram
  - diff bits of network focused on different tasks
  - early stages will “root” to different parts of the network and only activate only a small part of it (lets say 30B params, which is a HUGE saving)
  - MUUUUCH more efficient
- pre-existing: 1 chatbot to rule all fields (master of none kind of situation)
  - training takes lots of energy (expensive)
  - inference also super expensive → if you have a model with 4-5B params, you have to store them somewhere. And when you run through it, that ALL gets activated and you have to work math through the WHOLE MODEL to get the prediction
  - since this tells you what the next word is and goes back then the next word…its too expensive
Distillation
- taking your huge model (ex. 670B), ask it a bunch of questions in a certain field, and use the answers to train a smaller model (ex. 8B) to do the exact thing (often it will work)
- certainly decent performance
- why use 670B params and all these gpus when u can get a 8B model and run it on your 4090?
R1
- latest model
- chain of thought
  - useful for problem solving
  - adds computational cost during inference but performance goes up
  - OpenAI pioneered this but they don’t tell you how to do it
  - fully public (OpenAI’s methos are closed)
- trained this with much more limited data
  - create a dataset (question - correct CoT - correct answer)
- R1 actually trained it with ONLY THE ANSWERS
  - requires much less data
  - (reinforcement learning) give it lots of math questions, reward it if its right, reward if it wrote some kind of internal monologue in correct format
  - over time it trains, and monologue gets better and better
  - and in the end it has a chat with itself and solves the problem
fireship
- https://youtu.be/hpwoGjpYygI?si=rhaTik1gJHjIlEp3
- basically distillation

leejunkim

Explorer

DeepSeek

The current situation (feb 2025)

DeepSeek R1

Graph View