Step-by-Step LLM Engineering Projects
Most people read about LLMs. Very few actually build, break, and measure them.
This is a curated path of LLM engineering projects, where:
each project teaches one core idea
you implement it end-to-end
you visualize what’s happening
you ablate it, break it, and see why it matters
No hand-wavy theory.
No “trust the paper.”
Just code → plots → intuition.
1. Tokenization & Embeddings
The foundation everyone underestimates
Build
Implement Byte Pair Encoding (BPE) from scratch
Train your own subword vocabulary on raw text
See
Write a token visualizer → map words/chunks → token IDs
Compare one-hot vs learned embeddings
Plot cosine distances between tokens
Core insight
Tokenization silently decides what your model can and cannot learn.
2. Positional Embeddings
Why order isn’t “obvious” to transformers
Build
Implement:
Sinusoidal
Learned positional embeddings
RoPE
ALiBi
See
Animate a toy sequence being position-encoded in 3D
Ablate positional info and watch attention collapse
Core insight
Transformers are permutation-invariant by default.
Position is injected, not inherent.
3. Self-Attention & Multi-Head Attention
The mechanism behind everything
Build
Hand-wire dot-product attention for one token
Scale it to multi-head attention
See
Plot attention weight heatmaps per head
Mask future tokens → verify causality
Core insight
Multi-head attention isn’t redundancy — it’s specialization.
4. Transformers, QKV & Stacking
From components to a real model
Build
Combine attention + residuals + LayerNorm → single transformer block
Stack blocks → a mini-Transformer
Train on toy data
Break
Swap Q/K/V
Zero them out
Watch training explode
Core insight
Q, K, V are not interchangeable.
Their roles are asymmetric and fragile.
5. Sampling: Temperature, Top-K, Top-P
Generation ≠ decoding
Build
Interactive sampler dashboard
Tune temperature / k / p live
See
Plot entropy vs output diversity
Set temperature = 0 → watch repetition hell
Core insight
Sampling strategy matters as much as the model itself.
6. KV Cache (Fast Inference)
Why inference is fast… sometimes
Build
Cache key/value states across tokens
Compare inference speed with vs without cache
See
Cache hit/miss visualizer
Memory cost vs sequence length
Core insight
KV cache trades memory for massive speedups.
7. Long-Context Tricks
Why context windows lie
Build
Sliding-window attention
Memory-efficient attention variants
Measure
Loss vs context length
Find the “context collapse” point
Core insight
Long context ≠ long memory.
8. Mixture of Experts (MoE)
Sparse models, dense thinking
Build
Simple 2-expert router
Token-level routing
See
Expert utilization histograms
FLOPs saved vs dense layers
Core insight
Capacity scales better than compute — if routing works.
9. Grouped Query Attention
A performance trick with real tradeoffs
Build
Convert multi-head attention → GQA
Measure
Latency vs number of groups
Throughput on large batches
Core insight
Most speedups come from architectural constraints, not magic.
10. Normalization & Activations
The quiet stabilizers
Build
LayerNorm
RMSNorm
GELU
SwiGLU
Ablate
Remove each and retrain
See
Activation distributions layer by layer
Core insight
Training stability is engineered, not guaranteed.
11. Pretraining Objectives
What you train for shapes what you get
Build
Masked LM
Causal LM
Prefix LM
Compare
Loss curves
Generated samples
Core insight
Objectives bias behavior long before fine-tuning.
12. Finetuning vs Instruction Tuning vs RLHF
Alignment is layered
Build
Plain finetuning
Instruction tuning (“Summarize: …”)
Tiny RLHF loop with PPO
Plot
Reward vs steps
Behavior shifts
Core insight
RLHF doesn’t teach skills — it reshapes preferences.
13. Scaling Laws & Model Capacity
When bigger actually helps
Build
Tiny → small → medium models
Measure
Loss vs size
VRAM, throughput, wall-clock time
Extrapolate
How small is too small?
Core insight
Scaling laws are empirical, not philosophical.
14. Quantization
Performance has a cost
Build
Post-training quantization
Quantization-aware training
Measure
Accuracy drop vs bit-width
Export to inference formats
Core insight
Compression reveals what the model truly relies on.
15. Inference & Training Stacks
Same model, different realities
Build
Port one model across multiple stacks
Profile
Latency
Throughput
VRAM usage
Core insight
Infrastructure choices shape model behavior.
16. Synthetic Data
Data is a lever
Build
Generate toy datasets
Add noise, dedupe, split
Compare
Learning curves: real vs synthetic
Core insight
Good synthetic data beats bad real data.
Final Philosophy
One project → one insight
Build it
Plot it
Break it
Repeat
Don’t get stuck in theory.
Don’t wait for perfection.
Post what you learned, even the ugly graphs.
That’s how you actually learn LLMs.
See you in the next issue


