Audio deepfake detection is emerging as a crucial field in digital media, as distinguishing real audio from deepfakes becomes increasingly challenging due to the advancement of deepfake technologies. These methods threaten information authenticity and pose serious security risks. Addressing this challenge, we propose a novel architecture that combines Convolutional Neural Networks (CNN) and Bidirectional Long Short-Term Memory (BiLSTM) for effective deepfake audio detection. Our approach is distinguished by the feature concatenation of a comprehensive set of acoustic features: Mel Frequency Cepstral Coefficients (MFCC), Mel spectrograms, Constant Q Cepstral Coefficients (CQCC), and Constant-Q Transform (CQT) vectors. In the proposed architecture, features processed by a CNN are concatenated into two multi-dimensional features for comprehensive analysis, then analyzed by a BiLSTM network to capture temporal dynamics and contextual dependencies in audio data. This synergistic method ensures an understanding of both spatial and sequential audio characteristics. We validate our model on the ASVSpoof 2019 and FoR datasets, using accuracy and Equal Error Rate (EER) metrics for the evaluation.
Dettaglio pubblicazione
2024, IH&MMSec '24: Proceedings of the 2024 ACM Workshop on Information Hiding and Multimedia Security, Pages 271-276
Detecting audio deepfakes: integrating CNN and BiLSTM with multi-feature concatenation (04b Atto di convegno in volume)
TAIBA MAJID TAIBA MAJID, Qadri Syed Asif Ahmad, Comminiello Danilo, Amerini Irene
ISBN: 979-8-4007-0637-0
keywords