An N-Gram language model from first principles

Before neural language models there was a much simpler idea: estimate the probability of a word given the words that came before it, using nothing more than counts from a training corpus. This project rebuilds that idea from scratch.

The model

An n-gram model approximates the joint probability of a sentence as the product of conditional probabilities of each token given its n − 1 predecessors:

P(w₁, …, w_T) ≈ ∏ P(w_t | w_{t-n+1}, …, w_{t-1})

Maximum-likelihood estimates of those conditional probabilities are just normalized counts from a training corpus. The interesting question — and most of the fun — is what to do when a particular n-gram never appears in training data.

Language model output — Sampled output from the trained n-gram model.

What was interesting

Bigram models trained on a small corpus already produce locally-coherent prose. The failure mode is exactly what theory predicts: every two-word window looks fluent, but the sentence has no plan and no memory. It's a useful reminder of what a modern language model has to be doing implicitly when it generates text that holds together across hundreds of tokens.

Built as a study project. Repo: github.com/gabrielchasukjin/LanguageModel.