I went down an audio classification rabbit hole. What started as “how hard can ESC-50 be?” turned into a weekend of building CNNs, Transformers, and finally catching up on self-supervised learning, only a few years behind everyone else.
This post documents everything I tried, what worked, what didn’t, and the lessons learned.
The ESC-50 dataset contains:
With only 40 samples per class, this is a challenging dataset that rewards good representations over brute-force memorization.
Classic approach: convert audio to mel spectrograms and treat them as images for a CNN.
batch_size = 32
learning_rate = 1e-3
optimizer = AdamW(weight_decay=0.01)
scheduler = OneCycleLR
Result: 80.5% accuracy
I noticed validation performance was inconsistent, so I added TTA, averaging predictions across multiple augmented versions of each test sample.
def predict_with_tta(model, spectrogram, n_augments=5):
predictions = []
for _ in range(n_augments):
aug_spec = apply_augmentation(spectrogram)
pred = model(aug_spec)
predictions.append(pred)
return torch.stack(predictions).mean(dim=0)
Result: 83.5% accuracy (+3% from TTA alone!)
Before diving into SSL, I tried transfer learning with ImageNet-pretrained EfficientNet-B0.
Result: 81.5% accuracy
This performed worse than my from-scratch ResNet. ImageNet features (edges, textures, objects) don’t transfer perfectly to spectrograms. Good to know.
This is where things got interesting. I wanted to understand why and how SSL works, not just use pretrained models.
The idea: learn representations by pulling augmented views of the same audio together while pushing different audios apart.
Audio → Mel Spectrogram → ResNet Encoder → Projection Head → Contrastive Loss
Key components:
Result: 82.5% accuracy
But here’s the catch: the contrastive task itself reached 98% accuracy during pretraining.
The model was solving a nearly trivial task.
Environmental sounds in ESC-50 are already highly separable in the spectral domain (dog bark vs chainsaw vs rain, etc.). With relatively mild augmentations, the identity of each sound barely changes. That means the encoder can solve the contrastive objective using coarse cues (energy bands, temporal envelope) instead of learning robust, invariant representations.
In other words, the model learned:
“these sounds are different”
instead of:
“these two transformed versions of the same sound are meaningfully the same”
Lesson: Contrastive learning only works when the task is hard enough. You need augmentations and negatives that force the model to learn invariances, not shortcuts.
Even with this limitation, the learned representation still reached 82.5% accuracy, which is competitive with standard supervised CNN baselines on ESC-50.
Inspired by BERT and BEATs: hide patches of the spectrogram and predict discrete tokens.
Note: This was a simplified, educational implementation to understand the BEATs framework, not a full reproduction. The real BEATs includes iterative tokenizer refinement, larger models (90M+ params), and training on millions of samples. My goal was to grasp the core concepts: patch embeddings, masking, and discrete token prediction.
Spectrogram → Patch Embedding → Encoder → Predict Masked Tokens
The approach:
The tokenizer (codebook):
# Tokenizing a patch
distances = ||patch - codebook||² # Distance to each centroid
token = argmin(distances) # Closest centroid = token ID
Result: 74.5% accuracy (with CNN encoder)
Underperformed the supervised baseline. Hmm.
I tried a full Transformer encoder instead of CNN.
Architecture:
Result: 40% accuracy (nearly random, FAAAHH!!)
This was humbling. Transformers need massive amounts of data. With only 1,600 training samples, it couldn’t learn meaningful patterns. CNNs have stronger inductive biases that help with limited data.
After all my experiments, I tried Microsoft’s BEATs model, pretrained on AudioSet (2 million+ clips).
I tested two approaches:
Frozen encoder: Only train a new classifier head on top.
Result: 94.50% accuracy
Fine-tuned with differential learning rates:
param_groups = [
{'params': model.encoder.parameters(), 'lr': 1e-5}, # Slow for pretrained
{'params': model.classifier.parameters(), 'lr': 1e-3} # Fast for new head
]
Result: 95.25% accuracy 🎯
The frozen approach gets you 94.5% with minimal compute. The AudioSet pretraining is that good. Fine-tuning squeezes out another 0.75%, which matters if you’re chasing leaderboards but honestly? Frozen is probably fine for most use cases.
Even though I knew I probably wouldn’t make a dent against the pretrained model, I had to try it.
What are “tokens” in audio? In BEATs, tokens are discrete IDs representing audio patterns. K-means clustering groups similar spectrogram patches, and each patch gets a cluster ID. It’s like creating a vocabulary of audio building blocks.
Why iterate on the tokenizer? Clever trick: the tokenizer and model improve each other. Each iteration creates more meaningful tokens, forcing the model to learn finer distinctions.
Why two learning rates? Differential rates balance preservation and adaptation. Without this, you either destroy pretrained features (high LR) or the classifier never converges (low LR).
| Method | Accuracy | Approach |
|---|---|---|
| BEATs (fine-tuned) | 95.25% | Pretrained + differential LR |
| BEATs (frozen) | 94.50% | Pretrained, classifier only |
| ResNet-34 + TTA | 83.50% | Supervised baseline |
| SimCLR | 82.50% | Contrastive SSL |
| EfficientNet-B0 | 81.50% | ImageNet transfer |
| CNN-BEATs | 74.50% | Masked prediction |
| Transformer-BEATs | 40.00% | Tomfoolery 💀 |
The gap between my from-scratch SSL attempts and the pretrained BEATs tells the whole story.
This is why I love jumping between domains: contrastive learning from NLP, masked prediction from BERT, spectrogram tricks from signal processing. The dots connect in unexpected ways.
The real takeaway? Running these experiments taught me more than any tutorial ever could. Sometimes you have to build the thing yourself to understand why it works, and why it doesn’t.
What surprised me most is that self-supervised learning doesn’t automatically produce meaningful representations. It will happily learn shortcuts if the task allows it. Designing the right objective turns out to matter just as much as the model itself.