Meaning in Numbers: How AI Understands Similarity
Meaning in Numbers: How AI Understands Similarity

Meaning in Numbers: How AI Understands Similarity

In the previous article, we explored how language models read and generate text. We saw that words are broken into tokens, tokens become numbers, and probabilities determine what comes next. At that stage, everything felt mechanical structured, numerical, procedural.

But a deeper question naturally follows.

If models only see numbers, how do they understand meaning?

How does a system recognize that doctor and hospital are related, or that car and automobile refer to nearly the same thing?

The answer is not vocabulary in the human sense. The answer is geometry.

Modern AI systems represent meaning as positions in a high-dimensional space. Words, sentences, and even entire documents become coordinates. Distance becomes similarity. Closeness becomes relevance.

Lets decode the vocabulary behind that idea.


1. Embedding

Core Idea: An embedding is a numerical representation of meaning.

Unlike the token IDs discussed in Part 1 which simply identify pieces of text embeddings attempt to capture semantic relationships. When you pass a sentence into an embedding model, it produces a long list of numbers. That list does not represent the texts spelling or grammar. It represents its meaning in a compressed mathematical form.

Two pieces of text with similar meaning will produce embeddings that are numerically close to each other. The model is not storing definitions like a dictionary. It is placing concepts near one another in a structured space. Meaning becomes location.

This is one of the most important conceptual shifts in modern AI: language stops being sequence and becomes geometry.


2. Vector

Core Idea: A vector is simply an ordered list of numbers.

In AI discussions, the word vector can sound intimidating, but its just a container for numeric values. An embedding is a vector. Each number inside it represents a learned feature of the input text.

Think of it like coordinates on a map except instead of two dimensions (latitude and longitude), you might have hundreds or thousands of dimensions. Each dimension captures some abstract feature the model learned during training.

When engineers say store the vectors, they are storing these numerical representations so they can later compare meaning mathematically.


3. Similarity (Cosine Similarity)

Core Idea: Cosine similarity measures how close two vectors are in space.

Once text has been converted into vectors, we need a way to compare them.

Cosine similarity is one of the most common methods used to measure how close two vectors are. Instead of comparing raw values directly, it measures the angle between the vectors in space. If the angle is small, the meanings are similar. If the angle is wide, they are unrelated.

You do not need the formula to understand the idea. Imagine two arrows pointing in nearly the same direction they represent similar concepts. If they point in completely different directions, the meanings diverge.

Similarity becomes something you can calculate.


4. Similarity Search

Core Idea: Similarity search means finding vectors that are closest to a given vector.

Suppose you embed the question: How do I reset my password? The system converts it into a vector and then searches through stored vectors to find the ones closest to it in space. The results might include documentation about account recovery or password policies even if those documents do not use the exact same wording.

This is why modern AI retrieval systems are not limited to keyword matching. They search by meaning, not just by text overlap.

Similarity search is the backbone of systems like Retrieval-Augmented Generation, which we will explore in the next article.


5. Vector Database

Core Idea: A vector database is a system optimized to store and search large numbers of embeddings efficiently.

Traditional databases are built for exact matches. They are excellent at answering questions like find rows where ID equals 42. Vector databases are designed for a different task: find items that are closest in meaning to this input.

Because embeddings can be large and numerous, specialized indexing techniques are used to make similarity search fast at scale. Without vector databases, large AI knowledge systems would be slow and impractical.

In short, a vector database is infrastructure for geometric search.


6. Latent Space

Core Idea: Latent space is the abstract space where embeddings exist.

You cannot visualize it easily because it may contain hundreds of dimensions. But conceptually, imagine a vast map where every idea has a coordinate. Related ideas cluster together. Unrelated ideas are far apart.

The model does not manually design this space. It learns it during training by adjusting internal parameters until similar inputs consistently end up near each other.

When people say a model understands something, what they really mean is that the concept occupies a meaningful position in this learned space.


7. Chunking

Core Idea: Chunking is breaking large documents into smaller pieces before creating embeddings.

If you embed an entire 100-page document as a single vector, the representation becomes too broad and retrieval becomes imprecise. Instead, documents are divided into manageable segments often a few hundred tokens each. Each chunk receives its own embedding.

This improves retrieval accuracy. When a user asks a question, the system can locate the specific portion of a document that is relevant, rather than retrieving an entire file.

Chunking may sound simple, but it plays a crucial role in building effective AI knowledge systems.


Bringing It Together

In Part 1, we saw how language becomes tokens and tokens become numbers. That was about structure.

In this article, we saw how numbers become meaning. That is about position.

  • Embeddings convert text into vectors.
  • Vectors live in latent space.
  • Similarity measures distance.
  • Vector databases make geometric search practical.
  • Chunking improves precision.

Once you see language as geometry, modern AI systems stop feeling mystical. They become structured, measurable systems operating on spatial relationships.

And that geometric foundation is what enables the next layer connecting models to real-world knowledge through retrieval systems.