Tiny Transformer From Scratch

A small transformer implementation built to internalize how modern LLMs work under the hood. This project focuses on attention, multi-head mechanisms, and the training loop on a toy dataset.

Components Implemented

Token + positional embeddings.
Scaled dot-product attention.
Multi-Head Attention with residual connections.
Feed-forward network with LayerNorm.
Simple language modeling objective.

Why It Matters

Instead of only treating foundation models as black boxes, this project clarifies what actually happens inside an attention block and how sequence modeling emerges. That understanding feeds back into better prompt design, debugging, and model selection for real systems.