How LLM Training Actually Works: From Random Weights to Geometric Understanding

When people hear that LLMs "learn from data," they often imagine something like clustering algorithms grouping similar words together, or statistical correlation finding patterns. But LLM training is fundamentally different. It's not about finding similarities. It's about learning a representation that makes prediction efficient.

This distinction matters. Clustering would give you groups of similar words. But it wouldn't give you the geometric structure where E(king) - E(man) + E(woman) ≈ E(queen). That emerges from something deeper: optimizing a probabilistic objective that forces the model to discover compositional structure.

1. The Training Objective: Next-Word Prediction

At its core, LLM training is simple: given some words, predict the next word.

For example, given the sequence "The king sat on his", the model should predict "throne" with high probability, "chair" with moderate probability, and "bicycle" with very low probability.

Mathematically, the model learns a probability distribution:

P(next_word | previous_words)

The training process adjusts the model's parameters (weights in the neural network) to maximize the probability it assigns to the actual next word in the training data. This is called maximum likelihood estimation.

The Loss Function

More precisely, the model minimizes cross-entropy loss:

Loss = -log P(actual_next_word | context)

When the model assigns high probability to the correct word, the loss is low (good). When it assigns low probability, the loss is high (bad). Training adjusts parameters to reduce this loss across billions of examples.

The Infinite to Finite Compression

Here's the fundamental problem: the conditional probability space is infinite, but the geometric representation must be finite.

In Bayesian terms, P(next_word | previous_words) conditions on all possible sequences of previous words. For a vocabulary of 50,000 words and contexts of length 10, that's 50,000^10 possible conditioning contexts. For length 100? 50,000^100. The space of possible contexts is combinatorially infinite.

But the model only has finite parameters: perhaps 7 billion weights, and embeddings in 4096 dimensions. How can a finite geometric space represent an infinite conditional probability distribution?

The answer: compression through shared structure. The model cannot memorize P(next | context) for every possible context. Instead, it learns a finite geometric mapping that approximates the infinite probability distribution by exploiting regularities.

The key insight: most of those infinite contexts produce similar probability distributions. Consider:

  • "The king walked to the..."
  • "The queen walked to the..."
  • "The emperor walked to the..."
  • "The pharaoh walked to the..."

These are four different contexts in the infinite space, but they all predict similar next words: "throne," "palace," "castle," etc. The model doesn't need separate representations for each. It can map all four to nearby points in embedding space and compute P(next | context) as a function of geometric proximity.

This is lossy compression. The model trades perfect accuracy on every possible context for efficient generalization. By embedding contexts into finite-dimensional space, it forces contexts with similar predictive distributions to occupy similar geometric regions. The geometry becomes a compressed index into the infinite probability space.

The transformer's attention mechanism is the function that performs this mapping: context → finite vector. No matter how long or complex the context, attention compresses it into a fixed-size representation (the hidden state). This compression is learned to preserve the information most relevant for prediction.

Regularities as Symmetries

In physics and mathematics, regularities are called symmetries. A symmetry is a transformation that leaves something unchanged. And symmetries enable compression.

Consider the linguistic symmetry: "king is to queen as man is to woman." This is a substitution symmetry. If you replace (king, man) with (queen, woman) in certain contexts, the grammatical structure and semantic relationships remain valid:

  • "The king/queen ruled the kingdom"
  • "The man/woman walked home"
  • "The king/queen and his/her spouse"

This symmetry means the model doesn't need to learn separate rules for "king" contexts and "queen" contexts. It can learn one rule parameterized by a direction in embedding space (the gender vector). The symmetry compresses two distinct cases into a single geometric transformation.

Language is full of these symmetries:

  • Conjugation symmetry: run/running/ran follow the same pattern as walk/walking/walked
  • Plural symmetry: cat/cats behaves like dog/dogs
  • Comparative symmetry: good/better/best parallels bad/worse/worst
  • Analogy symmetry: Paris:France :: Rome:Italy
  • Substitution symmetry: Any sentence with "blue car" can swap to "red car" with minimal structural change

Each symmetry is a dimension of invariance: a way that contexts can differ while producing similar predictions. The embedding space learns to encode these symmetries as geometric transformations (rotations, translations, scalings). This is why vector arithmetic works: linguistic symmetries become geometric symmetries.

In physics, Noether's theorem states that every symmetry corresponds to a conservation law. In LLMs, every linguistic symmetry corresponds to a reusable geometric structure. The model conserves predictive patterns across symmetric transformations, allowing it to generalize from finite training data to infinite possible contexts.

2. Why This Creates Geometric Structure

Here's the key insight: to predict well, the model must discover compositional structure.

Consider these training examples:

  • "The king ruled wisely"
  • "The queen ruled wisely"
  • "The emperor ruled wisely"
  • "The pharaoh ruled wisely"

The naive approach: memorize that "ruled" follows each of these words individually. But this doesn't scale. There are thousands of ruler-related words, and millions of possible continuations.

The efficient approach: learn that there's a category of "rulers" that all share certain patterns. If the model can represent "king," "queen," "emperor," and "pharaoh" as points that are geometrically close, it can generalize: anything in that region of space is likely to be followed by "ruled," "commanded," "decreed," etc.

The geometry emerges because it's the efficient way to compress the prediction function. Instead of learning separate parameters for each word, the model learns a shared structure: a region of embedding space corresponds to a type of entity, and proximity in that space means similar predictive behavior.

3. How Directions Encode Relationships

Now consider these pairs:

  • "The king and his wife" / "The queen and her husband"
  • "The man walked home" / "The woman walked home"
  • "The boy played games" / "The girl played games"

The model sees a pattern: certain pairs of words differ in a consistent way. They appear in parallel grammatical structures, substitute for each other in gendered contexts, and trigger different pronoun agreements.

The efficient way to encode this? Make the vector difference between paired words point in the same direction. If:

  • E(king) - E(queen) = v_gender
  • E(man) - E(woman) = v_gender
  • E(boy) - E(girl) = v_gender

Then the model can learn a single parameter set that handles gender agreement for all gendered word pairs. The direction v_gender becomes a reusable component. This is massively more efficient than learning separate rules for each pair.

This is why E(king) - E(man) + E(woman) ≈ E(queen) works. It's not a coincidence or a clever trick. It's the optimal solution to the prediction task. By arranging these four words in a parallelogram:

  • king and queen share the "royalty" component
  • man and woman share the "common person" component
  • king and man share the "male" component
  • queen and woman share the "female" component

The model can now make predictions efficiently: "If I see 'king,' I should predict similar things to 'queen' (royalty context) but with male pronouns instead of female ones."

4. Gradient Descent: How the Geometry Forms

The model doesn't design this structure. It discovers it through gradient descent.

Initially, the embeddings are random. The model makes terrible predictions. But then:

  1. Compute loss: How wrong were the predictions?
  2. Compute gradients: Which direction should each embedding move to reduce loss?
  3. Update embeddings: Move them slightly in that direction
  4. Repeat billions of times

After seeing "The king ruled" many times, the gradient pushes the embedding for "king" closer to other words that appear before "ruled." After seeing "The queen ruled," the gradient pushes "queen" in the same direction.

But the model also sees "The king and his wife" and "The queen and her husband." These examples create gradients that push "king" and "queen" apart in a specific direction (gender). Meanwhile, "The man and his wife" and "The woman and her husband" create the same directional separation between "man" and "woman."

The parallelogram structure emerges because it satisfies both constraints simultaneously. Gradient descent finds the geometric arrangement that minimizes total loss across all training examples.

5. Why Not Just Clustering?

Clustering algorithms like k-means would group similar words together, but they wouldn't create the compositional structure. Here's why:

  • Clustering optimizes for similarity: Put similar items in the same cluster
  • LLM training optimizes for prediction: Arrange embeddings so probability calculations are accurate and efficient

Clustering might put "king" and "queen" near each other (they're similar). But it has no reason to ensure that the direction from "man" to "woman" matches the direction from "king" to "queen." That structure only emerges when you're trying to predict next words and you discover that reusable directional components reduce loss.

Furthermore, clustering operates on a fixed notion of similarity (usually cosine or Euclidean distance). LLM training learns what similarity means by adjusting embeddings to make the prediction task easier. The geometry is learned end-to-end, not imposed.

6. The Role of Context: Beyond Static Embeddings

Early word embeddings (Word2Vec, GloVe) assigned one vector per word. But this fails for words with multiple meanings:

  • "I went to the bank to deposit money" (financial institution)
  • "I sat on the river bank" (edge of water)

Modern LLMs use contextual embeddings where the same word gets different vectors depending on surrounding words. This is what the transformer architecture does with its attention mechanism.

Attention as Geometric Mixing

The attention mechanism computes how much each word in the context should influence the embedding of the current word. Mathematically:

  1. Each word starts with an initial embedding
  2. The model computes attention weights: which other words are relevant?
  3. The final embedding is a weighted combination of all words' contributions
  4. These combined embeddings are used to predict the next word

For "bank" in a financial context, the model attends to words like "deposit," "money," "account." These pull the embedding toward the financial region of space. For "bank" in a geographical context, words like "river," "water," "shore" pull it toward the geographical region.

Attention is geometric mixing. The model learns which directions in embedding space to emphasize based on context. The final contextual embedding is a point in space that has been shifted according to the semantic environment.

7. Why Higher Dimensions?

You might wonder: why 768 or 1024 dimensions? Why not 10 or 50?

The answer is capacity. Human language has thousands of overlapping, cross-cutting distinctions:

  • concrete vs. abstract
  • positive vs. negative sentiment
  • formal vs. informal register
  • animate vs. inanimate
  • past vs. present vs. future
  • agent vs. patient vs. instrument
  • literal vs. metaphorical
  • and thousands more...

Each distinction can be thought of as a dimension of meaning. To represent all these distinctions simultaneously, you need enough dimensions that they don't interfere with each other.

In low dimensions, you're forced to make trade-offs: represent sentiment accurately but lose temporal information, or vice versa. In high dimensions, you can represent everything at once. Each dimension can capture a different aspect of meaning, and the geometry becomes rich enough to encode the full complexity of language.

8. The Homomorphism Emerges

Recall from Language as Geometry that training creates an approximate homomorphism: linguistic operations map to geometric operations.

This happens because:

  1. The training objective is compositional: Predict words based on combinations of context
  2. Vector spaces are naturally compositional: Vectors combine through addition/subtraction
  3. Gradient descent finds the mapping: Align linguistic composition with vector composition to minimize loss

The homomorphism isn't perfect. Language has quirks, exceptions, and non-compositional idioms. But statistically, across billions of examples, there's enough compositional structure that the geometric approach works remarkably well.

9. What the Model Learns vs. What It Memorizes

A common misconception: "The model just memorizes the training data."

This is false. The model learns a compressed representation of the statistical patterns. Consider:

  • Training data: trillions of tokens
  • Model parameters: billions (much smaller than the data)
  • Embeddings: tens of thousands of words × ~1000 dimensions

There's no way to memorize all the data. Instead, the model must compress: extract the regularities and discard the noise. The geometric structure is this compression.

When you see E(king) - E(man) + E(woman) ≈ E(queen), you're seeing the model's compressed representation of millions of examples where gendered pairs appeared in parallel contexts. The model didn't memorize each example. It abstracted the pattern and encoded it as a geometric relationship.

10. Why This Approach Works for Language

The success of geometric embeddings reveals something deep about language itself: language has statistical regularities that align with compositional geometry.

This isn't guaranteed. We could imagine a communication system where:

  • Every word's meaning depends entirely on arbitrary context
  • No compositional rules exist
  • Relationships don't transfer across examples
  • Past patterns don't predict future usage

Such a language would be unlearnable by neural networks (and probably by humans too). The fact that LLMs work is evidence that human language is compositional, regular, and geometric.

Linguists have long theorized about compositionality (the principle that the meaning of a phrase is a function of the meanings of its parts). LLMs provide empirical confirmation: compositionality isn't just a theoretical ideal, it's statistically real, and it can be exploited through geometric representations.

11. Limitations and Failure Modes

This geometric approach has limits:

  • Discrete logic: Concepts like negation, quantification, and formal implication don't map cleanly to continuous geometry. The model approximates them, but imperfectly.
  • Counting and arithmetic: These are symbolic operations, not geometric ones. LLMs struggle with exact numerical reasoning.
  • Novel combinations: If a concept never appeared in training (even implicitly), there's no guarantee the geometric interpolation will make sense.
  • Long-range dependencies: Predictions based on information very far back in context degrade because the geometric signal gets diffused across many attention steps.

These failures reveal the boundary between what geometry can and can't represent. Language is mostly compositional and continuous, but it has discrete and symbolic aspects that resist geometric encoding.

Conclusion: Learning the Geometry of Meaning

LLM training isn't clustering. It's not pattern matching. It's learning a geometric representation that makes probabilistic prediction efficient.

Through billions of gradient descent steps, the model discovers that:

  • Semantic similarity should become spatial proximity
  • Consistent relationships should become consistent directions
  • Compositional meaning should emerge from vector arithmetic
  • Context should geometrically shift embeddings toward relevant regions

None of this is designed. It emerges from optimizing a simple objective: predict the next word. The geometry is the model's solution to that problem.

When you use an LLM, you're querying this learned geometry. You're injecting a prompt into the space, letting the model compute geometric transformations through attention layers, and reading out the result as a probability distribution over words.

The map is learned. The territory is language. The geometry is the connection between them.

Learn more: