As Generative AI rapidly advances, malicious actors can now clone voices with terrifying accuracy using only a few seconds of source audio.
TruTone is an adaptive audio deepfake detection system that uses machine learning to distinguish between genuine human speech and artificially synthesized voice recordings. The system addresses the growing challenge posed by AI-generated audio, which has become increasingly realistic and difficult to detect using conventional methods.
The system is implemented using Python and a forensic-grade feature extraction pipeline that computes a 98-point numerical vector from each audio file. This vector captures Mel-Frequency Cepstral Coefficients (MFCCs), their first and second-order derivatives, spectral centroid, spectral roll off, spectral bandwidth, zero-crossing rate, RMS energy, chroma features, and spectral flatness.
A Weighted Soft-Voting Ensemble combining a Random Forest classifier and a Logistic Regression model is trained on a labeled dataset of 841 audio samples. The ensemble achieves an accuracy of 96.9%, a ROC-AUC score of 0.982, and a false positive rate of 2.1%. Deployed with a high-performance REST API, TruTone allows users to upload audio files and receive a forensic verdict in under 300 milliseconds.
Designing an accurate and accessible defense mechanism.
To develop a forensic feature extraction pipeline that captures a comprehensive set of spectral, temporal, and harmonic properties from audio recordings.
To train and evaluate a Weighted Soft-Voting Ensemble combining Random Forest and Logistic Regression models on a balanced dataset of genuine and synthetic speech.
To implement a multi-tiered verdict scheme that distinguishes between definitive and borderline detection results, reducing the risk of false accusations.
To deploy the detection engine as a web application with a REST API, enabling both interactive and programmatic access for seamless integration.
To evaluate the system's usability with real users and incorporate their feedback into the interface design, achieving an 81.2/100 SUS score.
A robust, scalable pipeline from audio ingestion to forensic verdict.
The architecture is designed for speed and reliability. Uploaded audio files bypass permanent storage, moving directly into the memory-resident extraction pipeline.
Optimized for near-real-time forensic analysis.
The system separates concerns across Presentation, Application, Service, and Data layers.
Watch TruTone analyze and classify audio deepfakes in real-time.
Have questions about the TruTone architecture or interested in research collaboration? Reach out to the development team.