YouTube Shorts / TikTok / Facebook Reels
DATA CURATION PIPELINE
Crawling & Collection
(7-class Taxonomy)
Train/Val/Test Split
(70/15/15)
VIDEO PROCESSING PIPELINE
1. Frame Sampling (F=16)
2. Face Detection & Alignment
1. Noise Suppression (RNNoise)
2. Loudness Normalization
3. Hybrid Feature Extraction
(Wav2Vec2.0 / OpenSMILE LLDs)
1. ASR (Whisper + Wav2Vec2-BN)
2. Text Normalization
3. Hybrid Text Encoder
(BanglaBERT / BiLSTM)
V ∈ ℝ⁷⁶⁸
A ∈ ℝ⁷⁶⁸
T ∈ ℝ⁷⁶⁸
(Hierarchical Cross-Modal Fusion)
● ↔ ● ↔ ●
Fused Representation F
F ∈ ℝ⁴⁶⁰⁸
1. LN + GELU (4608→1024)
2. Dropout (0.3)
3. LN + GELU (1024→512)
4. FC (512→7)
+ Cross-Modal Regularization
Raw Text(ASR Transcript)
Extractor Components
by nahida