Tokens: Input words or tokens (the text so far)
Multi-head Attention
Linear + Softmax Head: Directly guesses next word probabilities from big list
by sam