Datawhale AI夏令营 - Multimodality Learning

Task: Video-Audio Modality Learning and deepfake Classify into 0-1 probability

<aside> 💡 Project is published at Github

</aside>

TODO List

Phase0: Acknowledge of dataset:deepfake

Input is video with .mp4 suffix, followed by a ground-truth 0-1 label, The task can be transformed into a 0-1 classification mission.The video can be segmented into distinct video and audio modalities utilizing melspectrograms, thereby enabling the execution of Video Feature Extraction and Audio Feature Extraction. There are several models with which I am acquainted for extracting video embeddings; however, the Audio Extractor remains to be thoroughly evaluated.On the other hand, it‘s necessary to study how to perform Alignment between different modalities.

File Structure

the structure of phase1 task dataset can be formulated as below:

ffdv-phase1-sample-10k/
│
├── trainset/
│   ├── file1.mp4
│   └── file2.mp4
│   └── ...
│
├── valset/
│   ├── file3.mp4
│   └── file4.mp4
│   └── ...
│
├── trainset_label.txt
└── valset_label.txt

Baseline Analysis

The Baseline model merely considers the audio feature (as the explanation on the web, Analyzing the spectrogram of the audio contributes to the identification of anomalies.), and employs a ResNet18 model to extract the feature. It is a classical model but still leaves room for improvement.

Phase1: Video Modality Feature Extraction

Video Swing Transformer

Video Swing Transformer(VST)

Frame Extraction

I don't have much understanding of video data processing yet, so for now, I've tentatively decided to simply extract 32 frames at fixed intervals from the video and normalize them, awaiting further improvements.