Loading
Loading
Converting tokens into meaning. Where numbers become concepts.
After tokenization, we have numbers. But "153" doesn't mean anything on its own. An embedding turns each token into a list of numbers (a "vector") that captures its meaning.
These numbers represent "cat-ness" in different directions
This is real! The model learned that "king is to man as queen is to woman"
4096 works with 32 attention heads (each gets 128 numbers). The math just works out nicely. This is why most models use this size.
| Model | Size | Why |
|---|---|---|
| Llama 3 8B | 4096 | Good balance for 8B model |
| Llama 3 70B | 8192 | Larger model = more nuance |
| GPT-4 | ~8192 | Maximum understanding |