Representation-Based Data Quality Audits for Audio

2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2026)

Alvaro Gonzalez-Jimenez*,1,3 · Fabian Gröger*,1,2 · Linda Wermelinger1,2 · Andrin Bürli4 · Iason Kastanis4 · Simone Lionetti1 · Marc Pouly1

*Equal contribution
1Lucerne University of Applied Sciences and Arts · 2University of Basel 3University Hospital of Basel 4CSEM

Unified audit targets

OT · ND · LE

Off-Topic, Near-Duplicates, Label Errors

Backbones evaluated

BEATs · M2D · EAT

Strong “out-of-the-box” embeddings

Operational benefit

Up to 34×

Annotation review speed-up (ND, α=0.05)

Abstract

Data quality issues such as off-topic samples, near duplicates, and label errors often limit the performance of audio-based systems. This paper addresses these issues by adapting SelfClean, a representation-to-rank data auditing framework, from the image to the audio domain. The approach leverages self-supervised audio representations to identify common data quality issues and produce ranked review lists within a single unified process. We benchmark the method on ESC-50, GTZAN, and a proprietary industrial dataset, using both synthetic and naturally occurring corruptions. Results show state-of-the-art ranking performance, often outperforming issue-specific baselines, and enabling significant annotation savings by efficiently guiding human review.

SelfClean-Audio overview: encoder -> latent space -> ranked lists for off-topic, near-duplicate, and label-error detection.

Results

Synthetic evaluation on ESC-50 under three corruption rates (α ∈ {0.05, 0.1, 0.2}). Below are the paper tables reporting ranking performance (AUROC, AP) for off-topic (OT), near-duplicate (ND), and label-error (LE) detection.

Table 1 — Pre-trained representations on ESC-50 (synthetic noise)

Performance of different audio embeddings for detecting OT / ND / LE across contamination rates.

Issue Model α=0.05 AUROC α=0.05 AP α=0.1 AUROC α=0.1 AP α=0.2 AUROC α=0.2 AP
OT CLMR 0.5060.050 0.5020.098 0.4970.196
OT CAV-MAE 0.3090.049 0.2600.075 0.2730.161
OT M2D 0.6890.074 0.5100.095 0.3730.159
OT EAT 0.5910.070 0.5960.138 0.5440.222
OT BEATs 0.7660.253 0.7450.316 0.6730.341
OT CLMR (SSL) 0.2220.031 0.1750.058 0.1630.118
OT BEATs (LoRA) 0.7240.202 0.7430.330 0.6530.313
ND CLMR 0.7400.001 0.7470.001 0.7440.001
ND CAV-MAE 0.7440.032 0.7240.017 0.7300.018
ND M2D 0.9920.606 0.9930.587 0.9930.617
ND EAT 0.9300.482 0.9220.468 0.9310.476
ND BEATs 0.9720.606 0.9780.595 0.9780.625
ND CLMR (SSL) 0.9110.400 0.8880.393 0.8980.384
ND BEATs (LoRA) 0.9700.608 0.9750.588 0.9770.619
LE CLMR 0.4770.049 0.4840.094 0.4920.197
LE CAV-MAE 0.7210.222 0.6930.299 0.6580.387
LE M2D 0.9980.970 0.9950.950 0.9860.943
LE EAT 0.9690.668 0.9690.759 0.9540.793
LE BEATs 0.9960.927 0.9920.908 0.9800.903
LE CLMR (SSL) 0.9570.586 0.9590.723 0.9420.792
LE BEATs (LoRA) 0.9970.932 0.9920.915 0.9780.903

Table 2 — SelfClean vs issue-specific baselines (ESC-50 synthetic)

SelfClean (with BEATs embeddings) compared against Isolation Forest (OT), Dejavu fingerprinting (ND), and Confident Learning (LE).

Issue Model α=0.05 AUROC α=0.05 AP α=0.1 AUROC α=0.1 AP α=0.2 AUROC α=0.2 AP
OT IForest 0.7910.212 0.6760.177 0.4060.188
OT SelfClean 0.7660.253 0.7450.316 0.6730.341
ND Dejavu 0.8620.017 0.8350.033 0.8450.068
ND SelfClean 0.9720.606 0.9780.595 0.9780.625
LE CLearning 0.9940.884 0.9940.951 0.9930.973
LE SelfClean 0.9960.927 0.9920.908 0.9800.903

Tip: on mobile, swipe horizontally to view full tables.

BibTeX

Copy-paste citation
@article{gonzalezjimenez2025representation,
  title   = {Representation-Based Data Quality Audits for Audio},
  author  = {Gonzalez-Jimenez, Alvaro and Gr{\"o}ger, Fabian and Wermelinger, Linda and B{\"u}rli, Andrin and Kastanis, Iason and Lionetti, Simone and Pouly, Marc},
  journal = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
  year    = {2026}
}