Before neural language models there was a much simpler idea: estimate the probability of a word given the words that came before it, using nothing more than counts from a training corpus. This project rebuilds that idea from scratch.
The model
An n-gram model approximates the joint probability of a sentence as the product of conditional probabilities of each token given its n − 1 predecessors:
P(w₁, …, w_T) ≈ ∏ P(w_t | w_{t-n+1}, …, w_{t-1})
Maximum-likelihood estimates of those conditional probabilities are just normalized counts from a training corpus. The interesting question — and most of the fun — is what to do when a particular n-gram never appears in training data.
What was interesting
Bigram models trained on a small corpus already produce locally-coherent prose. The failure mode is exactly what theory predicts: every two-word window looks fluent, but the sentence has no plan and no memory. It's a useful reminder of what a modern language model has to be doing implicitly when it generates text that holds together across hundreds of tokens.
Built as a study project. Repo: github.com/gabrielchasukjin/LanguageModel.