AFL-Net:

INTEGRATING AUDIO, FACIAL, AND LIP MODALITIES WITH

CROSS-ATTENTION FOR ROBUST SPEAKER DIARIZATION IN THE WILD




Abstract

Speaker diarization in real-world videos presents significant challenges due to varying acoustic conditions, diverse scenes, and the presence of off-screen speakers, among other factors. This paper builds upon a previous study (AVR-Net) and introduces a novel multi-modal speaker diarization system, AFL-Net. Unlike AVR-Net, which independently extracts high-level representations from each modality, AFL-Net employs a multi-modal cross-attention mechanism. This approach generates high-level representations from each modality while conditioning on each other, ensuring a more comprehensive information fusion across modalities to enhance identity discrimination. Furthermore, the proposed AFL-Net incorporates dynamic lip movement as an additional modality to aid in distinguishing each segment's identity. We also introduce a masking strategy during training that randomly obscures the face and lip movement modalities, which increases the influence of the audio modality on system outputs. Experimental results demonstrate that our proposed model achieves state-of-the-art diarization error rates (DERs) of 23.65% and 19.76% on the AVA-AVD dataset when trained on the AVA-AVD dataset and a combined dataset of VoxCeleb1, VoxCeleb2, and AVA-AVD, respectively. These performance results represent a relative DER decrease of 13.8% and 7.0% compared to AVR-Net, respectively. Moreover, our experiments confirm the effectiveness of the proposed system, even under varying missing rates of visual features.

Method

Codes will be available upon acceptance

Demo

AFL-Net (ours) AVR-Net (baseline) Some analysis
In this video, a conversation unfolds between two individuals. Both models accurately identify the first two sentences. However, the baseline model stumbles upon the third sentence, potentially due to the confusion arising from the simultaneous appearance of both individuals' facial information on the screen. Conversely, AFL-Net correctly recognizes the sentence, a success that could be attributed to the incorporation of an additional lip movement modality.
In this video segment, the AVR-Net mistakenly identifies the second individual as the first throughout the entire segment, potentially due to their similar visual appearances. However, the proposed AFL-Net accurately recognizes the second individual, thereby validating the effectiveness of fusion strategy among different modalities.
In this video, there's an instance where the second speaker is off-screen. The AVR-Net incorrectly identifies this speaker as the first one, whereas the proposed AFL-Net accurately recognizes the speaker. This could be because the proposed model is trained to rely more heavily on the audio modality, which is directly linked to the speaker's identity.