LESSON 2

Embeddings

Converting tokens into meaning. Where numbers become concepts.

📍What are embeddings?

After tokenization, we have numbers. But "153" doesn't mean anything on its own. An embedding turns each token into a list of numbers (a "vector") that captures its meaning.

"cat"

→

0.1-0.30.50.2

These numbers represent "cat-ness" in different directions

Think of it like a map

Words with similar meaning are close together

Animals Vehicles

The magic: math works!

king - man + woman

≈

queen

This is real! The model learned that "king is to man as queen is to woman"

How similar are these words?

catvsdog

94%

similar

High - both animals

catvscar

-32%

similar

Low - unrelated

kingvsqueen

81%

similar

High - related

applevsorange

90%

similar

High - both fruits

Why 4096 numbers?

The trade-off

More numbers (4096)

• More nuance captured
• Better understanding
• Takes more memory

Fewer numbers (256)

• Faster to compute
• Less memory
• Loses subtlety

The sweet spot

4096 works with 32 attention heads (each gets 128 numbers). The math just works out nicely. This is why most models use this size.

Real models use different sizes

Model	Size	Why
Llama 3 8B	4096	Good balance for 8B model
Llama 3 70B	8192	Larger model = more nuance
GPT-4	~8192	Maximum understanding

Previous: Tokenizer Next: Attention

Back to The Transformer

LESSON 2

Embeddings

Converting tokens into meaning. Where numbers become concepts.

📍What are embeddings?

After tokenization, we have numbers. But "153" doesn't mean anything on its own. An embedding turns each token into a list of numbers (a "vector") that captures its meaning.

"cat"

→

0.1-0.30.50.2

These numbers represent "cat-ness" in different directions

Think of it like a map

Words with similar meaning are close together

Animals Vehicles

The magic: math works!

king - man + woman

≈

queen

This is real! The model learned that "king is to man as queen is to woman"

How similar are these words?

catvsdog

94%

similar

High - both animals

catvscar

-32%

similar

Low - unrelated

kingvsqueen

81%

similar

High - related

applevsorange

90%

similar

High - both fruits

Why 4096 numbers?

The trade-off

More numbers (4096)

• More nuance captured
• Better understanding
• Takes more memory

Fewer numbers (256)

• Faster to compute
• Less memory
• Loses subtlety

The sweet spot

4096 works with 32 attention heads (each gets 128 numbers). The math just works out nicely. This is why most models use this size.

Real models use different sizes

Model	Size	Why
Llama 3 8B	4096	Good balance for 8B model
Llama 3 70B	8192	Larger model = more nuance
GPT-4	~8192	Maximum understanding

Previous: Tokenizer Next: Attention