Santosh Kompella PRO

Sathya77

AI & ML interests

LLMs Natural Language Processing (NLP) Transformers Deep Learning Machine Learning

Recent Activity

posted an update 3 days ago

Built a ViT for ×4 image super-resolution from scratch in PyTorch — sharing the model. No pretrained weights. Every component implemented from scratch: strided Conv2d patch embedding, multi-head self-attention across 1,024 tokens, 6 pre-norm transformer blocks, and a PixelShuffle reconstruction head for learned upsampling. Trained on real-images from LSDIR dataset with fp16 AMP on a laptop GPU. Tiled inference handles arbitrary input sizes. Current architecture: patch size 2, embed dim 64, 4 attention heads, 6 transformer blocks, ~786K parameters — test PSNR 23.30 dB. The model handles broad structure well — fine textures and sharp edges need more capacity. Working on a larger configuration next. 🤗 Space: https://huggingface.co/spaces/Sathya77/ViT-ISR-Tiny-LSDIR Feedback welcome — especially on the architecture choices.

updated a model 4 days ago

Sathya77/ViT-ISR-Tiny-LSDIR

published a model 4 days ago

Sathya77/ViT-ISR-Tiny-LSDIR

View all activity

Organizations

None yet

posted an update 3 days ago

Post

534

Built a ViT for ×4 image super-resolution from scratch in PyTorch — sharing the model.

No pretrained weights. Every component implemented from scratch: strided Conv2d patch embedding, multi-head self-attention across 1,024 tokens, 6 pre-norm transformer blocks, and a PixelShuffle reconstruction head for learned upsampling.

Trained on real-images from LSDIR dataset with fp16 AMP on a laptop GPU. Tiled inference handles arbitrary input sizes.

Current architecture: patch size 2, embed dim 64, 4 attention heads, 6 transformer blocks, ~786K parameters — test PSNR 23.30 dB.

The model handles broad structure well — fine textures and sharp edges need more capacity. Working on a larger configuration next.

🤗 Space: Sathya77/ViT-ISR-Tiny-LSDIR

Feedback welcome — especially on the architecture choices.

posted an update about 1 month ago

Post

181

Trained a Swin-T from scratch on NWPU-RESISC45 — no pretrained weights, no fine-tuning.

Every component hand-coded in PyTorch: window partitioning, shifted window attention with relative positional bias, patch merging across 4 stages, ~28M parameters.

Architecture:

embed_dim=96, window_size=7, depths=[2, 2, 6, 2]
heads=[3, 6, 12, 24] across stages
Patch embed via Conv2d (4×4, stride 4) → 56×56 feature map
PatchMerging downsamples by concatenating 2×2 neighbors + linear projection
Global average pooling → linear classifier

Training:

AdamW (lr=3e-4, weight_decay=0.05)
Cosine annealing with 3-epoch linear warmup over 20 epochs
Mixed precision (autocast + GradScaler)
Gradient clipping (max_norm=1.0)
Label smoothing (0.1)
ImageNet normalization, batch size 32
80/20 train/test split, seed=42

Result: 82% test accuracy on 45 land-use categories, 31,500 images.
🔗 Sathya77/swin-transformer-satellite

What accuracy do you think is achievable on NWPU-RESISC45 with Swin-T trained from scratch, without any pretraining?

Santosh Kompella PRO

AI & ML interests

Recent Activity

Organizations

Sathya77's activity