Created by FESB student Luka Dorić for his Master's degree (ldoric.dev).
Solving Word Ladder Using BERT and LLMs
A Master's thesis built around a bilingual Word Ladder game, a BERT model that learns distances on the word graph, and a head-to-head with large language models on the same puzzles.
The project in short
We start with English and Croatian dictionaries (4 and 5 letters) and turn them into word graphs. BERT is fine-tuned to estimate how far apart two words are on those graphs, and the solver uses that guess with A* (falling back to BFS when it has to) to build short, valid ladders. The same neighbor scoring drives the hints you see while playing.
The catch with large language models is that they write in free text, while Word Ladder needs each step to change exactly one letter and stay in the dictionary. Prompted models tend to slip on one or both, so the thesis pits this graph-grounded BERT-and-search setup against GPT- and Gemini-style generation under the same strict checks.
Word lists
For English we merged several public lists and normalized them by length and casing. For Croatian we combined Rijecalica, kkrypt0nn, and HR_Txt, keeping č, ć, đ, š, ž. Strict builds only keep words that show up in multiple sources; looser builds union more in to add edges. Either way, play happens on the largest connected piece of the graph so random ladders almost always have a path.
Training
Each model learns to guess the number of steps between two words on its graph. We generate hundreds of thousands of training pairs per mode with BFS and lean the sampling toward short distances, where good ranking matters most. The backbone is English BERT-base, with a separate regression head trained for every language and word length.
Solver stress test — 200 random puzzles
Random start and target words about 3–10 steps apart, solved with A* guided by the BERT heuristic and only falling back to BFS when it has to. Every mode lands above 90% on the headline numbers.
| Mode | A* finished alone | Needed BFS backup | Shortest length matched |
|---|---|---|---|
| English 5 | 94.0% | 6.0% | 95.0% |
| English 4 | 97.0% | 3.0% | 100.0% |
| Croatian 5 | 98.5% | 1.5% | 98.0% |
| Croatian 4 | 98.0% | 2.0% | 100.0% |
When BFS did kick in, it still returned shortest-length ladders. Four-letter graphs are denser, which makes English 5 the trickiest slice here.
Distance prediction (held-out tests)
The same models, evaluated on word pairs they never saw during training (~600k examples, six epochs, English BERT-base):
| Mode | Test MAE | Within ±1 of true distance |
|---|---|---|
| English 5 | 0.587 | 83.6% |
| English 4 | 0.308 | 97.3% |
| Croatian 5 | 0.450 | 91.0% |
| Croatian 4 | 0.230 | 99.1% |
The English 5-letter graph is sparser than the Croatian one here, which is why its errors are a bit higher. "Within ±1" just means the model's guess was at most one step off the true length—it isn't a hint failure rate, since hints only have to rank the valid neighbors in the right order.
Hints in the app
A valid next word is anything one letter away from your current one. The model scores each of those neighbors by how close it looks to the goal—lower is better—and the app surfaces the top pick from the backend. It's the same scoring the solver runs on, just stopped after one move instead of building the whole ladder.