Language as Geometry

The central question: Why does language—a discrete, symbolic system of words—map so effectively onto continuous vector geometry? Why does a mathematical operation like King - Man + Woman result in Queen?

To understand why this is possible, we have to leave the world of text and enter the world of dimensions.

Imagine being a Circle trapped inside a Square jail in Flatland. You slide along the 2D plane, but the walls are impenetrable. There is no escape. Then you dream of a third dimension—and suddenly you rise above the plane, look down, and simply step over the walls. What was an inescapable prison is now just lines on the floor.

This is the key insight: problems that are impossible to solve in a lower dimension become trivial in a higher one. The Circle could never escape in 2D—the walls were impassable. But by lifting into 3D, the impossible becomes easy. The higher dimension offers degrees of freedom that simply do not exist below.

Mathematicians call this kind of structure-preserving lift a homomorphism. The relationships that existed in 2D are preserved in 3D—the Circle is still a Circle, the Square jail still has the same shape—but now there is room to move in ways that were previously impossible.

This is exactly what we do to language. But what is the Flatland we are escaping, and how do we construct the lift?

The Ground Truth: Bayesian Probability

The Flatland of language is Bayesian Probability. It is the statistical reality of natural language—a massive, fixed space defined by how humans actually speak. In this world, words are discrete symbols trapped in probability tables, and relationships are defined by rigid conditional ratios:

  • How likely is "bank" to appear given the context "river"? P(bank | river)
  • How likely is "bank" to appear given the context "money"? P(bank | money)

In this flat world, you cannot subtract "Man" from "King." The word "King" is just a symbol in a table, unable to mathematically interact with other words.

The Lift: Embedding into Vector Space

The higher dimension we escape into is Vector Space. By embedding words into high-dimensional space—often thousands of dimensions1—we give them freedom of movement. The word "King" becomes a point in space, a Vector. Now you have room to perform complex movements: physically move the concept of "King," subtract the "Man" component, and float over to the exact coordinates of "Queen."

We solve problems in geometry because the higher dimensional space offers symmetries and operations that are impossible in the lower dimensional space of discrete symbols.

The Bridge: The Logarithmic Homomorphism

But the lift alone is not enough. We need a homomorphism—a structure-preserving map—that translates the rules of the Probability World into the rules of the Vector World. We demand that the Dot Product (our measure of vector similarity) aligns with the Logarithm of the Probability:

u · v ≈ log(P(context | word))

This is why the vector arithmetic works:

  • In Probability (Flatland): Relationships are ratios (B / A)
  • In Geometry (Vector Space): Relationships are distances (A - B)
  • The Homomorphism: The Logarithm converts ratios into subtraction
log(B / A) = log(B) - log(A)

The Mechanism: Cross-Entropy Forcing

This geometric structure isn't magic; it is forced. The vectors don't start out knowing this map. We have to teach them.

We use Cross-Entropy Loss as the objective function—the "judge"—that trains the model. The training process works like a pressure cooker:

  1. The Guess: The model calculates a dot product.
  2. The Check: It compares that guess to the actual probability of the next word in the text.
  3. The Force: If the guess is wrong, Cross-Entropy penalizes the model.

To minimize this penalty, the model has no choice but to adjust its vectors until their dot products perfectly mirror the log-likelihoods of the text.

So why does King - Man + Woman = Queen? Because the logarithm turns probability ratios into vector subtraction. The relationship between "King" and "Man" exists as a ratio in Bayesian space; in vector space, that same relationship becomes a direction you can subtract and add.

We treat language as geometry not because words are actually shapes, but because we have mathematically coerced the vector space to behave like a logarithmic map of the Bayesian territory. We lift the data out of Flatland and into a space where we can finally move freely between ideas.

This dimensional lift is an instance of a more general pattern. See Computation as Dimensional Expansion and Compression for how this principle applies beyond language models.


1 Dimensions and parameters are different. Dimensions are the size of each word's vector—"King" might be a list of 4,096 numbers. Parameters count every trainable number in the entire model. A 4,096-dimensional embedding for 50,000 words is already 200 million parameters, before counting anything else. Models have billions of parameters; the space words move through has thousands of dimensions.