Livoa LogoLivoa
End-to-end workflow of the proposed SentMixNet framework for Bangla short video sentiment analysis
Data Sources


YouTube Shorts / TikTok / Facebook Reels

DATA CURATION PIPELINE

Crawling & Collection

Manual Annotation


(7-class Taxonomy)

Train/Val/Test Split


(70/15/15)

VIDEO PROCESSING PIPELINE


1. Frame Sampling (F=16)

2. Face Detection & Alignment

AUDIO PROCESSING PIPELINE


1. Noise Suppression (RNNoise)

2. Loudness Normalization

3. Hybrid Feature Extraction

(Wav2Vec2.0 / OpenSMILE LLDs)

TEXT PROCESSING PIPELINE


1. ASR (Whisper + Wav2Vec2-BN)

2. Text Normalization

3. Hybrid Text Encoder

(BanglaBERT / BiLSTM)

Visual Feature V


V ∈ ℝ⁷⁶⁸

Audio Feature A


A ∈ ℝ⁷⁶⁸

Textual Feature T


T ∈ ℝ⁷⁶⁸

HCF MODULE


(Hierarchical Cross-Modal Fusion)

● ↔ ● ↔ ●

Fused Representation F

F ∈ ℝ⁴⁶⁰⁸

CLASSIFICATION HEAD


1. LN + GELU (4608→1024)

2. Dropout (0.3)

3. LN + GELU (1024→512)

4. FC (512→7)

Softmax Output
🔴 Angry
🟡 Bullying
🟢 Sad
🔵 Fun
⚫ Neutral
🟣 Mockery
🟠 Disgust
Dynamic Focal Loss (DFL)


+ Cross-Modal Regularization

Raw Video
Raw Audio

Raw Text
(ASR Transcript)

Extractor Components

ViT
AU
Detection
Optical
Flow

fig1

by nahida

0
0 uses