Step-by-Step LLM Engineering Projects

Growtechie

Jan 08, 2026

Most people read about LLMs. Very few actually build, break, and measure them.

This is a curated path of LLM engineering projects, where:

each project teaches one core idea
you implement it end-to-end
you visualize what’s happening
you ablate it, break it, and see why it matters

No hand-wavy theory.
No “trust the paper.”
Just code → plots → intuition.

1. Tokenization & Embeddings

The foundation everyone underestimates

Build

Implement Byte Pair Encoding (BPE) from scratch
Train your own subword vocabulary on raw text

See

Write a token visualizer → map words/chunks → token IDs
Compare one-hot vs learned embeddings
Plot cosine distances between tokens

Core insight

Tokenization silently decides what your model can and cannot learn.

2. Positional Embeddings

Why order isn’t “obvious” to transformers

Build

Implement:
- Sinusoidal
- Learned positional embeddings
- RoPE
- ALiBi

See

Animate a toy sequence being position-encoded in 3D
Ablate positional info and watch attention collapse

Core insight

Transformers are permutation-invariant by default.
Position is injected, not inherent.

3. Self-Attention & Multi-Head Attention

The mechanism behind everything

Build

Hand-wire dot-product attention for one token
Scale it to multi-head attention

See

Plot attention weight heatmaps per head
Mask future tokens → verify causality

Core insight

Multi-head attention isn’t redundancy — it’s specialization.

4. Transformers, QKV & Stacking

From components to a real model

Build

Combine attention + residuals + LayerNorm → single transformer block
Stack blocks → a mini-Transformer
Train on toy data

Break

Swap Q/K/V
Zero them out
Watch training explode

Core insight

Q, K, V are not interchangeable.
Their roles are asymmetric and fragile.

5. Sampling: Temperature, Top-K, Top-P

Generation ≠ decoding

Build

Interactive sampler dashboard
Tune temperature / k / p live

See

Plot entropy vs output diversity
Set temperature = 0 → watch repetition hell

Core insight

Sampling strategy matters as much as the model itself.

6. KV Cache (Fast Inference)

Why inference is fast… sometimes

Build

Cache key/value states across tokens
Compare inference speed with vs without cache

See

Cache hit/miss visualizer
Memory cost vs sequence length

Core insight

KV cache trades memory for massive speedups.

7. Long-Context Tricks

Why context windows lie

Build

Sliding-window attention
Memory-efficient attention variants

Measure

Loss vs context length
Find the “context collapse” point

Core insight

Long context ≠ long memory.

8. Mixture of Experts (MoE)

Sparse models, dense thinking

Build

Simple 2-expert router
Token-level routing

See

Expert utilization histograms
FLOPs saved vs dense layers

Core insight

Capacity scales better than compute — if routing works.

9. Grouped Query Attention

A performance trick with real tradeoffs

Build

Convert multi-head attention → GQA

Measure

Latency vs number of groups
Throughput on large batches

Core insight

Most speedups come from architectural constraints, not magic.

10. Normalization & Activations

The quiet stabilizers

Build

LayerNorm
RMSNorm
GELU
SwiGLU

Ablate

Remove each and retrain

See

Activation distributions layer by layer

Core insight

Training stability is engineered, not guaranteed.

11. Pretraining Objectives

What you train for shapes what you get

Build

Masked LM
Causal LM
Prefix LM

Compare

Loss curves
Generated samples

Core insight

Objectives bias behavior long before fine-tuning.

12. Finetuning vs Instruction Tuning vs RLHF

Alignment is layered

Build

Plain finetuning
Instruction tuning (“Summarize: …”)
Tiny RLHF loop with PPO

Plot

Reward vs steps
Behavior shifts

Core insight

RLHF doesn’t teach skills — it reshapes preferences.

13. Scaling Laws & Model Capacity

When bigger actually helps

Build

Tiny → small → medium models

Measure

Loss vs size
VRAM, throughput, wall-clock time

Extrapolate

How small is too small?

Core insight

Scaling laws are empirical, not philosophical.

14. Quantization

Performance has a cost

Build

Post-training quantization
Quantization-aware training

Measure

Accuracy drop vs bit-width
Export to inference formats

Core insight

Compression reveals what the model truly relies on.

15. Inference & Training Stacks

Same model, different realities

Build

Port one model across multiple stacks

Profile

Latency
Throughput
VRAM usage

Core insight

Infrastructure choices shape model behavior.

16. Synthetic Data

Data is a lever

Build

Generate toy datasets
Add noise, dedupe, split

Compare

Learning curves: real vs synthetic

Core insight

Good synthetic data beats bad real data.

Final Philosophy

One project → one insight

Build it
Plot it
Break it
Repeat

Don’t get stuck in theory.
Don’t wait for perfection.
Post what you learned, even the ugly graphs.

That’s how you actually learn LLMs.

See you in the next issue

Thanks for reading Growtechie ! This post is public so feel free to share it.

Growtechie

Discussion about this post

Ready for more?