🎭 DeepFakeNews Dataset: A Comprehensive Resource for Misinformation Detection

The DeepFakeNews dataset is a novel and comprehensive dataset designed for the detection of both deepfakes and fake news. This dataset is an extension and enhancement of the existing Fakeddit fake news dataset (i strongly suggest reading the related paper HERE from the authors to better understand this dataset, with significant modifications to cater specifically to the complexities of modern misinformation).

🚀 Enhancements

Derived from the Fakeddit fake news dataset, the DeepFakeNews dataset comprehends a total of 509,916 images and has been enriched with 254,958 deepfake images generated using three different generative models:

Stable Diffusion 2
Dreamlike
GLIDE

⚖️ Balance and Composition

Balanced Dataset: Contains an equal number of pristine (authentic) and generated (deepfake) images.
Removal of Hand-Modified Content: The original "manipulated content" category from Fakeddit, which consisted of images altered or modified by hand, has been removed. These have been replaced with deepfakes to provide a more relevant and challenging set of synthetic images.
Cleaning and Quality Control: The Fakeddit dataset was thoroughly cleaned, removing any images that were not found, contained only logos, or were otherwise unsuitable for deepfake detection. This cleaning process ensures a higher quality and more reliable dataset for training and evaluation.

🛠️ Application

The DeepFakeNews dataset is suitable for both deepfake detection and fake news detection. Its diverse and balanced nature makes it an excellent benchmark for evaluating multimodal detection systems that analyze both visual and textual content.

📁 Dataset Structure

The dataset is publicly available on Zenodo HERE and comes with three CSV files for training, testing, and validation sets, along with corresponding zip files containing the split images for each set. The deepfake images are named in both the CSV files and the image filenames following a specific format based on the generative model used: "SD_fake_imageid" for Stable Diffusion, "GL_fake_imageid" for GLIDE, and "DL_fake_imageid" for Dreamlike.

🔄 Deepfake Generation Pipeline

The Deepfake Generation Pipeline involves a two-step approach:

Caption Generation: First generating a caption for a pristine image using a captioning model.
Image Generation: Feeding this caption into a generative model to create a new synthetic image.

By incorporating images from multiple generative technologies, the dataset is designed to prevent any bias towards a single generation method in the training process of detection models. This choice aims to enhance the generalization capabilities of models trained on this dataset, enabling them to effectively recognize and flag deepfake content produced by a variety of different methods, not just the ones they have been exposed to during training. The other half consists of pristine, unaltered images to ensure a balanced dataset, crucial for unbiased training and evaluation of detection models.

🔙 Retrocompatibility with Fakeddit

The dataset has been structured to maintain retrocompatibility with the original Fakeddit dataset. All samples have retained their original Fakeddit class labels (6_way_label), allowing for fine-grained fake news detection across the five original categories: True, Satire/Parody, False Connection, Imposter Content, and Misleading Content. This feature ensures that the DeepFakeNews dataset can be used not only for multimodal and unimodal deepfake detection but also for traditional fake news detection tasks. It offers a versatile resource for a wide range of research scenarios, enhancing its utility in the field of digital misinformation detection.

For full info and details about dataset creation, cleaning pipeline, composition, and generation process, please refer to my Master Thesis.

Name		Name	Last commit message	Last commit date
Latest commit History 286 Commits
1_dataset_cleaning		1_dataset_cleaning
2_image_captioning		2_image_captioning
3_image_generation		3_image_generation
4_deepfake_detection		4_deepfake_detection
5_biased_detection		5_biased_detection
6_result_analysis		6_result_analysis
.gitignore		.gitignore
README.md		README.md
generate_dataset_zip_splits.py		generate_dataset_zip_splits.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎭 DeepFakeNews Dataset: A Comprehensive Resource for Misinformation Detection

🚀 Enhancements

⚖️ Balance and Composition

🛠️ Application

📁 Dataset Structure

🔄 Deepfake Generation Pipeline

🔙 Retrocompatibility with Fakeddit

About

Releases

Packages

Languages

enricollen/DeepfakeDetection

Folders and files

Latest commit

History

Repository files navigation

🎭 DeepFakeNews Dataset: A Comprehensive Resource for Misinformation Detection

🚀 Enhancements

⚖️ Balance and Composition

🛠️ Application

📁 Dataset Structure

🔄 Deepfake Generation Pipeline

🔙 Retrocompatibility with Fakeddit

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages