Back to Blog

Let's talk about Embeddings

Elia Weiss

Embeddings are the building blocks of LLM algorithms. They are vectors (lists of numbers) that encode the semantic meaning of a word or sentence.

This definition gets thrown around casually in AI talks nowadays, but to me, the idea that a list of numbers can represent meaning is straight-up mind-blowing.

When I first came across this idea a few years ago, I was already a solid software engineer with at least a basic grasp of how most algorithms work. But in this case, I couldn't wrap my head around what kind of algorithm could take a list of numbers and somehow encode "semantic meaning." Like, how do you even define that in the first place?

There's no "regular" algorithm that does this. By that, I mean something you could follow step-by-step on paper and get the same result as a computer. Technically, you could run an LLM manually—forward pass, backprop, all that—but not only would it take you a few lifetimes, you'd also come out of it with zero intuition about what's really going on.

That's because the semantic meaning in an embedding is actually an emergent property of training. In other words—no one fully understands how it happens.

  • We can prove that meaning gets encoded, since vector distances line up with how we perceive similarity.
  • We can fine-tune training to make embeddings better reflect those perceptions.
  • But we can't describe the exact steps that generate them, or directly control the process. And that's a crucial difference between regular algorithms and machine learning. In a traditional algorithm, you can pinpoint a bug, tweak one line of code, and the rest stays intact. In ML, your only option is to retrain with different hyperparameters and hope for better benchmark results. It's slow, expensive, and often, improving one thing breaks another.

To summarize so far: Embeddings are lists of numbers that, somehow, capture the semantic meaning of text as an emergent property of ML training.


To build on that, I want to dig into three things:

  • What "emergent properties" really are, and why they matter in ML
  • How embeddings actually capture semantic meaning
  • How LLMs use embeddings to generate the next token

What "emergent properties" really are, and why they matter in ML

My take on "emergent properties" is that they're qualities that show up in a system as a whole but can't be directly traced to any single part of it.

This ties back to the famous Aristotle quote: "The whole is greater than the sum of its parts." A classic example:

  • Consciousness arises from interactions between neurons, even though no individual neuron is conscious.

People often toss around the term emergent property, but for me, it's a fundamental concept in machine learning. We don't fully understand how these properties arise, but we can nudge training processes to produce the ones we want.

So when folks argue about whether LLMs are "conscious," the real question should be: Do they exhibit an emergent property that resembles consciousness? It may sound like a semantic distinction, but it's crucial—it acknowledges that:

  1. We don't truly understand how ML models work internally
  2. We still don't have a clear definition of what consciousness even is

I used "consciousness" as an example—which might be a hot topic—but this logic applies to any property we assign to ML, like whether embeddings really capture semantic meaning.

How embeddings actually capture semantic meaning

Embeddings are points in a multi-dimensional vector space, where the closer two points are, the more semantically similar they are—at least according to human perception.

At this point, I could try to give a simplified explanation of "multi-dimensional vector space"—like comparing it to a 2D x-y graph where closeness means similarity. But I won't, for two reasons:

  1. There are already tons of YouTube explainers that do exactly that.
  2. As we've already established, this is just one interpretation of an emergent property we don't fully understand.

So let's just say:

When two embeddings are close together, it usually means the model has learned that they appear in similar contexts or share related meanings.

But embeddings have another fascinating trait:

Mathematical operations on vectors often reflect how we expect those concepts to behave semantically in language.

A classic example:

King - Queen ≈ Man - Woman or King - Queen + Man ≈ Woman

In other words, vector math on embeddings can mimic real-world relationships in a way that aligns with human intuition.

Those kinds of vector arithmetic analogies were one of the early signs that embeddings captured something deeper than surface-level statistics. When relationships like king → queen or Paris → France show up as roughly consistent vector offsets, it suggests that the model has internalized abstract relationships — gender, geography, roles, etc.

That said, it's also worth noting that this effect is not perfect and tends to appear most clearly in simpler word embeddings. In larger, contextual models (like GPT), the representations are more dynamic and depend on surrounding context, so the clean linear relationships are fuzzier — but the underlying idea still holds: semantic structure emerges in vector space in a way that mirrors how we think about meaning.

How LLMs use embeddings to generate the next token

This might feel almost redundant, since the whole premise boils down to:

LLMs are essentially mathematical operators that act on embeddings, with the emergent property that their output can be mapped to the statistically most likely next token.

But still, I want to zoom in on two of those "operators":

  • Attention
  • Linear forward layer Why? Because the fact that they actually work the way they do never stops blowing my mind.

Attention

When people talk about attention, they often describe how the Query, Key, and Value interact using matrix multiplication to produce a context-aware version of the original embedding.

But in our context, that's just an implementation detail—one researchers found effective for manipulating embeddings in a way that leads to the desired emergent properties. So instead, I want to focus on a more interesting question: What emergent property do we want the Attention operator to exhibit?

Consider the word bank in these two contexts:

  1. I need to go to the bank to deposit a check. → financial institution
  2. They had a picnic by the bank of the river. → riverbank

In both cases, the initial embedding for bank starts out the same. But what we want is for the Attention operator to adjust that embedding based on the surrounding words—so it ends up capturing the contextual meaning of bank in each sentence.

At a high level, the Attention operator adjusts a word's embedding by mathematically interacting with the embeddings of its surrounding words—essentially letting each word "attend" to its neighbors and reshape its meaning based on context.

In effect, each token's representation becomes a weighted blend of the others, allowing it to capture subtle contextual nuances.

And the emergent property we want from the Attention operator is:

how much each word should "pay attention" to every other word in the sequence.

That's an oversimplification—but it's pretty accurate if the goal is to capture the essence of the LLMs' emergent property, rather than the math tricks that make it work.

With two key caveats:

  1. Attention isn't a single, well-defined operator. It consists of multiple attention layers, each with multiple heads. Each head learns to extract a different perspective or representation, and together they form a rich, context-aware embedding of the sentence—one that can then be passed into a linear layer to predict the next token.
  2. Attention is specifically designed to be parallelized on GPUs. The emergent properties only surface when training on massive datasets, and without GPU-level parallelization, the whole thing just wouldn't scale.

Linear forward layer

The linear (forward) layer simply maps one semantic vector space to another. In the context of LLMs, after all the layers and math have shaped a context-aware embedding of the sentence, this layer's job is straightforward:

Just map that final vector to the next most likely token.

In technical terms:

translating the model's internal, high-dimensional representation into a probability distribution over the vocabulary.

In other words, after all the attention layers have distilled contextual meaning into a single embedding, the linear layer maps that vector onto the token space. The resulting emergent property is

"Given this context vector, which token is most likely to come next?"

A linear (forward) layer is one of the most basic building blocks in machine learning—you'll find it everywhere. Mathematically, it's just:

output = (weights × input) + bias

Simple on its own, but when you stack many of these layers together with non-linear functions between them, they become incredibly powerful—capable of learning almost any pattern from data. The rest is mostly clever optimization and scaling.


And that's it — That's the mystery, and the beauty, of emergent behavior: The model just transforms embeddings through layers until the next most likely token pops out.

Wait... but what?!

So far, I've focused on what embeddings are, what they represent, and how they're used to predict the next token.

But I've kind of glossed over the biggest question of all—how are they actually generated?

At this point, I'm supposed to throw something like this at you:

Embeddings (and the model itself) are trained through a process called backpropagation, where we repeatedly run training data through the model, compare its output to the expected result using a loss function, compute the derivative for each weight, and then tweak those weights slightly toward the "correct" answer.

But that always left me with a nagging "chicken vs egg" feeling. To run a forward pass, you need embeddings. But to get embeddings, you need training—which itself requires forward passes…

Until I realized—that's actually a pretty good analogy.

We start with a completely deformed egg, and the model iteratively hatches better and better versions of it… Until eventually, we end up with something that actually looks and behaves like a real egg.


Summary

My goal here was to offer a more intuitive way to understand how LLMs work—by focusing on the magical concepts that often get brushed aside in favor of implementation details, which, to me, don't offer much insight.

I also wanted to explain why those technical details don't help much: machine learning algorithms aren't driven by a logical series of steps like traditional algorithms. Instead, they rely on mathematical mechanisms that encourage certain emergent properties. While we might have some intuition about why they work, we can't break them down into step-by-step reasoning the way we can with regular code.

I started with embeddings because they're usually mentioned briefly before jumping straight into the Q/K/V math of attention—which can be confusing. Without first understanding what embeddings are and what properties they carry, the math feels arbitrary. And in some ways, it is—the math is just a means to encourage the behaviors we want.

In hindsight, it seems obvious that multiplying Q and K could help form a contextual representation. But that clarity only came after generations of researchers experimented with different ideas and slowly zeroed in on the concept of attention. While attention does come with some elegant intuitions, the real reason it matters is simple: it works—it encourages the right emergent properties in the model.

Key Takeaways

  • Embeddings are the foundation of how LLMs understand meaning. They represent words or sentences as vectors that capture context and similarity—but how they form remains partly mysterious.
  • Attention enables context-awareness. The math isn't the magic—the behavior it creates is. These technical mechanisms matter only because they help shape the emergent properties we want.
  • Emergent properties are the heart of machine learning. The most remarkable behaviors aren't explicitly programmed; they arise naturally from the training process itself.
  • Machine Learning ≠ Traditional Algorithms. Unlike step-by-step logical code, ML models depend on mathematical structures that encourage emergent behaviors—patterns that come from training, not explicit design.