Multimodal Fusion

Cross-attention, embeddings, and audio-visual learning

Cross-attention, embeddings, and audio-visual learning