<strong>Self Adaptive Vision Transformer model</strong>

<p>Raw Image 1 ∈ ℝ (H×W×C) where H = W = 512</p>

Normalization
norm = (1 - μ) / σ
where μ = [0.485, 0.456, 0.406],
σ = [0.229, 0.224, 0.225]

CNN Stem
conv 7x7, stride=2, padding=3
x ∈ ℝ 64x3x7x7
BatchNorm + ReLU
MaxPool 3x3, stride=2
output: F₁ ∈ ℝ 256x256x64

Patch Embedding
Patch Extraction
P = reshape F₁
position Encoding
Z₀ = P + E_pos
output
Z₀ ∈ ℝ 256x768

Transformer Block
multi-Head Attention
Q,K,V = 2 W_q,k,v
Attention = Softmax (QK^T / √d_k) V
add & norm
Z₁ = LN Z₀ + Attention
MLP: FFN x = max(0, xW₁ + b₁) W₂ + b₂
Repeat L times

ASPP module
Parallel Convolutions
Global Average pooling + upsampling
feature concatenation
output F_aspp ∈ ℝ 256x256x256

Boundary Refinement
Reverse Attention
Progressive upsampling

Output
Segmentation mask (Head)
Y_seg = σ(conv x_i d_final)
Boundary Detection
Y_boundary = σ(conv x_i Bdy_refined)

Loss Function
Total Loss
Boundary Loss
BCE Loss
Dice Loss

Self adaptive vision transfer model for burst polyp detection and diagnosis