Multilingual Audio DeepFake Detection

Dec 2024

Motivation

The overwhelming majority of audio deepfake (DF) detection research is built on English-language datasets such as ASVspoof. Yet synthetic speech technology is language-agnostic — TTS and voice-conversion systems produce convincing fakes in any language. This raises an uncomfortable question: are our best detectors actually polyglots, or are they just good at English?

This project — which forms the core of my Master’s thesis — set out to answer that question systematically.

What We Did

We evaluated a range of state-of-the-art DF detection models on speech spanning 8 languages: English, German, Polish, Ukrainian, Russian, Spanish, French, and Italian. For each language we studied three settings:

Zero-shot — model trained only on English, evaluated on a target language.
Intra-linguistic adaptation — model fine-tuned and evaluated within the same non-English language.
Cross-linguistic adaptation — model fine-tuned on one language and evaluated on another.

We also analysed which acoustic features survive the language shift and which become noise.

Key Findings

Zero-shot transfer degrades sharply for most languages, with equal-error rates sometimes tripling compared to English performance.
Intra-linguistic fine-tuning recovers much of the performance, confirming that target-language data is the most important factor.
Cross-linguistic transfer is inconsistent: languages with shared phonology can partially share knowledge, but there is no reliable free lunch.
Models that encode language-agnostic artefacts of neural vocoders are more robust; models that latch onto language-specific prosody are brittle.

Takeaways for Practitioners

If you are deploying audio deepfake detection in a non-English environment, you should not assume English-trained baselines will protect you. Collecting even a small amount of target-language data for adaptation can make a decisive difference.

The research is described in detail in Are Audio DeepFake Detection Models Polyglots? (arXiv 2024).

DF ML

Multilingual Audio DeepFake Detection

Motivation

What We Did

Key Findings

Takeaways for Practitioners

Bartłomiej Marek

PhD Candidate

Multilingual Audio DeepFake Detection

Motivation

What We Did

Key Findings

Takeaways for Practitioners

Related Publication

Bartłomiej Marek

PhD Candidate