When Ilya Sutskever told John Carmack "If you really learn all of these, you'll know 90% of what matters today" — he wasn't talking about trends, tools, or frameworks.
He was pointing at the deep structure of intelligence itself.
This list is not random. It is a map — from physics and information theory, through neural architectures, to reasoning, memory, retrieval, and alignment. If you zoom out, something striking appears: modern LLMs are not a single invention. They are a convergence of ideas that matured over 20+ years.
This blog is a guided tour of that convergence. Not a paper-by-paper dump, but a thematic synthesis explaining why each cluster matters and how it shows up in today's models like GPT-4, Claude, Gemini, DeepSeek, and beyond.
1. Complexity, Information, and Why Intelligence Isn't Just Entropy
Key readings: Scott Aaronson – The First Law of Complexodynamics, Aaronson & Carroll – Coffee Automaton, Grünwald – Minimum Description Length, Shen & Uspensky – Kolmogorov Complexity.
These papers answer a deceptively simple question: what does it mean for a system to be "interesting"?
Entropy always increases. But intelligence, structure, and meaning peak and then decay.
This insight quietly underpins why overfitting is bad, why compression equals understanding, and why models that memorize fail to generalize.
MDL and Kolmogorov Complexity explain why learning is compression and why simpler internal representations generalize better.
This is the philosophical backbone behind regularization, information bottlenecks, latent variable models, and modern scaling laws.
If you don't understand this cluster, deep learning looks like magic. If you do, it looks inevitable.
2. Sequence, Memory, and the Birth of Neural Time
Key readings: Karpathy – The Unreasonable Effectiveness of RNNs, Olah – Understanding LSTMs, Zaremba & Sutskever – RNN Regularization, Graves – Neural Turing Machines.
Before Transformers, time itself was the problem.
These works explain how models learned to remember, forget, store symbols, and execute algorithms. LSTMs solved vanishing gradients.
Neural Turing Machines introduced external memory — a precursor to tools, scratchpads, and chain-of-thought.
Today's LLM behaviors like step-by-step reasoning, tool calling, and long-term planning are descendants of these ideas.
Transformers didn't replace memory — they re-implemented it differently.
3. Attention, Order, and the End of Recurrence
Key readings: Bahdanau et al. – Attention & Alignment, Vaswani et al. – Attention Is All You Need, Vinyals et al. – Order Matters, Pointer Networks.
Attention solved the information routing problem.
Instead of compressing everything into a fixed vector, models learned where to look, what to copy, and what to ignore.
Pointer Networks are the missing link between attention, symbolic reasoning, and algorithmic behavior.
Transformers didn't win because they were faster. They won because they shortened the path length between any two ideas.
That single insight unlocked parallelism, scale, and global context. Every LLM today is built on this spine.
4. Vision, Depth, and Why Skip Connections Changed Everything
Key readings: Krizhevsky et al. – ImageNet, He et al. – ResNets, Yu & Koltun – Dilated Convolutions, Stanford CS231n.
Vision forced neural networks to confront depth.
ResNets proved a crucial idea: learning is easier when the default behavior is identity.
F(x) = H(x) - x → output = F(x) + xThis principle now appears everywhere: residual streams in Transformers, pre-norm architectures, and gradient stability at scale.
Dilated convolutions foreshadowed long-context reasoning — seeing wide structure without losing resolution.
Vision research didn't just help images. It taught networks how to scale safely.
5. Relational Reasoning and Structured Thought
Key readings: Santoro et al. – Relation Networks, Gilmer et al. – Message Passing Neural Networks, Relational Recurrent Neural Networks.
Intelligence isn't just pattern matching. It's about relations.
These works explain how models reason over graphs, understand object interactions, and perform symbolic-like inference.
Modern applications include molecular modeling, code analysis, reasoning over documents, and tool graphs.
This lineage leads directly to graph RAG, program synthesis, and agent planning.
6. Scaling Laws, Compression, and Why Bigger Worked (Until It Didn't)
Key readings: Kaplan et al. – Scaling Laws, Hinton – MDL & Weight Noise, Variational Lossy Autoencoders.
Scaling laws explained why brute force worked. But MDL explains why brute force eventually fails.
Together they say: scale buys capability, structure buys efficiency.
This tension is exactly what we see today: MoE models, sparse attention, FP8 training, and DeepSeek-style efficiency.
The future is not "bigger forever". It is better compression of intelligence.
7. Retrieval, Memory, and Truth
Key readings: DPR, RAG, Lost in the Middle, HyDE.
LLMs don't know things. They reconstruct them.
Retrieval solves hallucination, freshness, and factual grounding.
But "Lost in the Middle" reveals a hard truth: long context does not equal usable context.
This is why retrieval, chunking, and relevance modeling matter more than raw context windows.
8. Alignment, Distillation, and the Road to Superintelligence
Key readings: Zephyr dDPO, Legg – Machine Superintelligence, Fact-checking with LLMs.
Raw intelligence is useless without alignment.
Zephyr shows that preference distillation scales and human labels are not strictly required.
Legg's thesis reminds us: intelligence is goal achievement across environments.
The hard part isn't making models smarter. It's making them want the right things.
The Big Picture — Why Ilya's List Is So Powerful
This reading list forms a complete loop: Information → Compression → Learning → Memory → Reasoning → Action → Alignment.
Every major AI system today is a composition of these ideas. Not all at once. Not perfectly. But recognizably.
Final Takeaway
If you truly absorb this list, you stop asking: "What model should I use?"
And start asking: "What inductive bias does this problem need?"
That shift is the difference between using AI — and understanding it.
