Task: Video-Audio Modality Learning and deepfake Classify into 0-1 probability
<aside> 💡 Project is published at Github
</aside>
Input is video with .mp4 suffix, followed by a ground-truth 0-1 label, The task can be transformed into a 0-1 classification mission.The video can be segmented into distinct video and audio modalities utilizing melspectrograms, thereby enabling the execution of Video Feature Extraction and Audio Feature Extraction. There are several models with which I am acquainted for extracting video embeddings; however, the Audio Extractor remains to be thoroughly evaluated.On the other hand, it‘s necessary to study how to perform Alignment between different modalities.
the structure of phase1 task dataset can be formulated as below:
ffdv-phase1-sample-10k/
│
├── trainset/
│ ├── file1.mp4
│ └── file2.mp4
│ └── ...
│
├── valset/
│ ├── file3.mp4
│ └── file4.mp4
│ └── ...
│
├── trainset_label.txt
└── valset_label.txt
The Baseline model merely considers the audio feature (as the explanation on the web, Analyzing the spectrogram of the audio contributes to the identification of anomalies.), and employs a ResNet18 model to extract the feature. It is a classical model but still leaves room for improvement.
I don't have much understanding of video data processing yet, so for now, I've tentatively decided to simply extract 32 frames at fixed intervals from the video and normalize them, awaiting further improvements.