(English Caption from OCR)
(en_XX source language token inserted)
Multi-Head Attention Feed-Forward Network
1 Residual + LayerNorm applied at every block
(ta_IN / te_IN / ml_IN during decoding)
Self-Attention (Masked) Cross-Attention (Encoder)
+ Feed-Forward Network
Residual Connections + Layer Normalization
(Generate Next Token Probability)
(Tamil / Telugu / Malayalam Caption)
by Tanuja