What large language models actually do, from probabilities and tokens to context windows and pattern prediction, and why so many people interpret the output as genuine understanding
By Martin-Patrick Larouche
Before we can say a system is not intelligent, we need to agree on what the word was promising in the first place.
A large language model produces fluent, useful text without understanding any of it. It does one mechanical thing extremely well, and that thing is not thinking.
Give it full credit. These systems draft, translate, summarize, and write code at a level that was science fiction a decade ago.
A calculator is useful and nobody calls it intelligent. Capability and understanding are separate questions.
Knowing how the tool works makes you better at using it. The goal here is an accurate mental model, and accuracy pays off.
By intelligent, this talk means the human sense of grounded understanding, intentional reasoning, and awareness, rather than raw capability. It focuses on text language models, though much of it carries over to image and audio systems built the same way.
When we call something intelligent, three things ride along with the word. A language model holds a thin, ungrounded version of each, which is why the output can look like the real thing.
We assume words connect to real things. A text-only model learns how words relate to each other from text, with no senses and no lived contact with what they describe.
We assume a goal sits behind the words. There is a process selecting likely text, with no aim of its own.
We assume it tracks what is true. It builds internal abstractions that behave like a partial map of the world, with no way to check them against reality.
A language model predicts the next chunk of text, over and over, based on patterns it absorbed from enormous amounts of writing. Everything else it appears to do grows out of that single loop.
Predicting that much text well forces the model to absorb grammar, facts, styles, and habits of reasoning.
It optimizes for plausible continuations. Plausible and true overlap often, and not always.
Fluent prediction looks exactly like understanding from the outside. The rest of the talk pulls the two apart.
The model does not work with words or letters. It works with tokens, and that single fact explains a surprising number of its quirks.
Before any prediction happens, your text is chopped into tokens. A token is a common chunk of characters, often a word, sometimes part of one, sometimes just a space and a letter.
Tokens come from how often character sequences appear in training text, so " the" and "ing" become tokens because they are everywhere.
Each token maps to an integer. The model only ever reads and writes lists of these numbers, then they turn back into text for you.
In English a token averages about four characters, so 1,000 tokens is around 750 words. Other languages can cost far more.
"Tokenization isn't intelligence."
┌──────┬─────────┬─────┬────┬───────────────┬───┐
│ Token│ ization │ isn │ 't │ intelligence │ . │
└──────┴─────────┴─────┴────┴───────────────┴───┘
6 tokens, one common word split into two pieces
the model sees only their IDs, never the letters:
24038 2065 6315 956 11478 13
Notice that "Tokenization" became two tokens while "intelligence" stayed whole, purely because of how often each string appears in training. The model never sees the letters inside a token as separate things.
Once you know the model reads tokens and not letters, several famous failures stop being mysterious.
Ask how many r's are in "strawberry" and it can stumble. The word is a couple of tokens, and the individual letters were never visible.
Reversing a string or spelling a word backwards is hard for the same reason. It shuffles token chunks, not characters.
Numbers fracture into awkward token pieces, which is part of it. The deeper reason is that it predicts what an answer looks like instead of running a calculation.
At its core the model does one thing on a loop. It looks at everything so far and guesses what comes next.
Nothing in this loop checks whether the result is true. It optimizes for what is likely to come next, and likely is not the same as correct.
Open up the Model box from the last slide. The design that made these systems powerful is called the transformer, and it rests on one idea called attention.
Earlier networks processed text one word at a time and tended to forget the start of a long passage by the time they reached the end.
Every token can weigh every other token in the input at once, so the model links a pronoun to its noun or a question to its earlier setup directly.
Doing this in parallel made training on enormous data practical. More data and bigger models kept paying off, which is why capability jumped.
The same attention-based design now underpins image, audio, and video models, so much of what follows applies well beyond text.
Given the unfinished phrase "I poured myself a cup of", the model assigns a probability to every possible next token. These are the front-runners.
If it always grabbed the single highest bar, it would be repetitive and dull. So it rolls weighted dice over the top candidates, and a few settings control how adventurous that roll is.
A dial on randomness. Low temperature sticks to the safest token, high temperature spreads the odds and invites surprise.
These trim the candidate pool to the most likely tokens before the dice roll, so the model stays on the rails while still varying.
Run the same prompt twice and you can get two different answers. Both are plausible continuations. Neither was looked up or verified.
The model writes one token, adds it to the text, then predicts again with that token now part of the input. Feeding output back into input is called autoregression.
step 1 The step 2 The cat step 3 The cat sat step 4 The cat sat on step 5 The cat sat on the step 6 The cat sat on the mat
Because each token depends on the ones before it, an early wrong turn gets built upon rather than corrected. The model commits to its own mistake and keeps elaborating confidently.
The model's apparent knowledge was baked in during training and then frozen. That explains both its breadth and its blind spots.
Training shows the model staggering amounts of text and asks it, again and again, to predict the next token. Each miss nudges billions of internal numbers, the weights, a little closer.
Books, code, articles, forums, and far more. Hundreds of billions of words of human writing.
A fixed set of weights. No copy of the text is stored, only the statistical patterns squeezed out of it.
Predicting all that text well requires absorbing grammar, facts, and reasoning habits. Knowledge is a side effect of the guessing game.
People picture a search engine with a model bolted on top. The reality is closer to a musician improvising in a style they have absorbed.
It finds the exact stored record and returns it word for word. Right or wrong, it is repeating something specific that exists.
It rebuilds a likely answer from overlapping patterns every time. Common facts come out reliably because the patterns are strong and consistent.
This is why it nails well-documented facts and invents obscure ones with equal confidence. Both answers are generated the same way, and only the strength of the underlying pattern differs.
Predicting the next token sounds trivial. Doing it well across the whole internet is extremely hard, and the result is far richer than a lookup table.
To predict well, the model builds internal structure. Interpretability research finds features that track sentiment, position, and even rough maps of places it has only read about.
Skills like translation and step-by-step problem solving were never programmed in. They appeared as models grew, as a by-product of better prediction.
The rule is easy to state. The behavior it produces is genuinely sophisticated, and worth taking seriously rather than waving off as mere autocomplete.
Here is the nuance that matters. The model has no grounded, human understanding, and it is also far more than a parlor trick. Both are true at the same time.
Raw from pretraining, the model just continues text. The helpful chat assistant you talk to is a second layer of training on top.
Learns language and knowledge by predicting next tokens across the whole corpus. Produces raw capability with no manners.
Trained on examples of following requests, so it answers questions instead of merely continuing them.
People rank responses, and the model is nudged toward the preferred ones. This shapes tone, helpfulness, and refusals.
All of this happens before you ever type a word. Once deployed the weights are frozen. The model does not learn from your conversation, and it forgets everything the moment the window clears.
Everything the model can use right now has to fit in its context window. Outside that window, for the model, nothing exists.
When people say a chatbot "remembers" the conversation, the application is pasting the transcript back into this window every turn. The model itself holds nothing between requests.
Each request starts cold. The model has no diary, no notes from yesterday, no sense that you have spoken before.
The whole relevant transcript is fed in again each time. Continuity is the application replaying text, not the model recalling it.
Products that "remember" you save facts in a database and quietly inject them into the window. Useful, and entirely outside the model.
Nothing you said persists in the model. It cannot be reminded of a past chat it was never holding.
A bigger window helps, and it is neither free nor uniform. Where you put information changes how well it lands.
Models tend to use the start and end of a long context well and skim the middle. A key fact buried mid-document risks being ignored.
System rules, history, documents, and your question all compete for the same token limit. Add more of one and you squeeze the rest.
Go over the limit and the oldest tokens drop off, often without warning. The model then answers as if they were never there.
A made-up citation is not the system breaking. It is the prediction engine doing exactly what it always does.
A hallucination is a confident, fluent, wrong answer, produced by the same process that produces correct answers. The model is always generating the most plausible continuation, and sometimes the most plausible text simply is not true.
Nothing in the loop compares the output against reality. Plausibility is the only target it has.
The model carries some signal of its own uncertainty, and it is weak and unreliable. Training for confident, helpful answers tends to bury the doubt that is there.
Confidence in the output is mostly a property of the writing style. Authoritative prose fills the training data, and training for helpfulness can reward confident wording, so the model leans that way by default.
A realistic-sounding fake reference scores higher than an honest "I am not sure", because hedging is rarer in the text it learned from.
Ask for something it half knows and it completes the pattern with invented specifics that fit the shape of a real answer.
Phrase a question as if a fact exists and the most likely continuation is to supply that fact, true or not.
A concrete case. In 2023 a lawyer filed a court brief built on precedents a chatbot had invented, with realistic names, reporters, and quotes, none of which existed. The wording read like genuine case law, which is exactly why it slipped through.
| Trigger | Why it happens | What you see |
|---|---|---|
| Obscure facts | Weak, thin patterns in training | Confident, specific, wrong details |
| Recent events | After the training cutoff, no data | Plausible guesses stated as current fact |
| Quotes & citations | It reconstructs the shape of a real one | Real-looking sources that do not exist |
| Niche code APIs | It blends several similar libraries | Functions and flags that were never real |
| "Are you sure?" | Agreement is a common pattern | It flips its answer either way |
None of these are random. Each row is a place where the strongest available pattern points away from the truth.
If it is only predicting text, why is it so easy to believe there is a mind behind it? The answer is partly about the model and largely about us.
For our entire history, fluent and coherent language was a reliable sign of a thinking mind. The model breaks that link, and our instincts have not caught up.
Smooth, correct sentences used to guarantee a human author. We still read competence as comprehension.
When ideas connect across paragraphs, we infer a reasoning process. The model produces those connections statistically.
Warmth, hesitation, and apology in the text read as emotion. They are learned stylistic patterns with nothing behind them.
A lot of the intelligence we perceive is contributed by the reader. We are pattern-matchers too, primed to find minds everywhere.
In the 1960s people opened up to a trivial script that just rephrased their sentences. The urge to see a mind in responsive text runs deep.
We name our cars and apologize to furniture. A system that says "I think" gets a personality assigned to it instantly.
Given fluent output, we generate the charitable reading, smooth over errors, and credit the model with our own inference.
It helps to separate the surface from the mechanism. The same response can be described two ways, and both are accurate.
Both columns describe the same event. The gap between them is where the word "intelligent" quietly slips in.
Newer reasoning models seem to think before answering, and they are genuinely better at hard problems. The mechanism is still next-token prediction, given more room to run.
They generate a long chain of intermediate tokens before the final answer. Working through steps in text really does raise accuracy.
Each step in that chain is predicted the same way as any other token. There is no separate logic engine switched on.
The visible reasoning is itself generated text. A model can produce a tidy explanation for an answer it reached for other reasons.
None of this makes the tool less valuable. It makes it predictable. Here is how the mental model pays off in practice.
Reframe the model as a fast, fluent pattern engine and its strengths line up cleanly. These are the jobs where plausible and useful are the same thing.
Summarizing, rephrasing, translating, reformatting. The source is in the window, so it has little to invent.
First drafts, variations, and ideas to react to. You are the editor, and plausible is exactly what you want.
Boilerplate, conversions, and well-trodden snippets. Strong, common patterns are where it is most reliable.
The single habit that protects you is to read output as a confident draft from a brilliant, unreliable intern. Use it, then check it.
A real reference can be verified. Treat any citation as unconfirmed until you have seen it yourself.
The higher the stakes, the more independent the confirmation needs to be. Low stakes, lighter touch.
The model can inform a judgment. It should not be the one making it where the cost of wrong is real.
This is the correct way to operate a tool that optimizes for plausible text, and it costs you very little once it becomes a habit.
One mechanism, a handful of consequences, and a mental model you can carry out of the room.
It reads and writes chunks of text as numbers, not whole words.
It scores the next token and samples one, then repeats the loop.
Its abilities are frozen patterns squeezed from a huge body of text.
It only sees what fits in the window right now, and nothing else.
Output is the likely continuation, with no check against truth.
Wrong answers arrive in the same fluent voice as right ones.
Each session starts cold unless an app replays the text for it.
Fluency plus our instincts manufacture the impression of understanding.
A fast pattern engine that drafts beautifully and cannot vouch for a word of it.
Everything it produces aims at sounding right. You supply the part that checks whether it is right.
Hand it pattern work, keep the judgment, and verify anything that matters. Then it is genuinely powerful.
It predicts extremely well. Understanding is something you still bring to the table.
A language model is a remarkable pattern engine that turns probability into fluent text. Treat the output as a probable draft, verify what matters, and it earns its place. The fluency is real. The understanding is yours to add.