Skip to content

This project builds and optimizes a machine learning model that predicts the truth value for political statements to gain a better understanding of and detect fake news on social media.

Notifications You must be signed in to change notification settings

karenyxwang/Fake_News_Detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

Fake_News_Detection

This project is conducted with the goal of gaining a better understanding of fake news on social media. By utilizing relevant data from social media statements and speakers, this project will build and optimize a machine learning model that predicts the truth value for social media political statements. The detection of fake news is important because the rapid growth of fake news is becoming a cause of concern for all aspects of life, from politics to financial market to environmental protection. Election results, for example, could be manipulated by untruthful claims and deliberate lies from political candidates that sway the opinions of voters. As social media gradually becomes the primary news sources for the public, social media platforms have the responsibility to examine the quality of news contents and identify the problematic statements before they reach wide audiences.

Therefore, social media platforms, especially Twitter and Facebook, should pay attention to this project because its machine learning model for predicting fake news will potentially increase the efficiency in detecting fake news from a large volume of political statements, which could not be done manually on a large scale. In addition, fact-checking websites, such as Fact Checker at The Washington Post and FactCheck.org at the University of Pennsylvania, should also be interested in this project because the model will increase their efficiency in verifying the factual accuracy of the statements of politicians and enable them to flag suspicious tweets and debunk more misinformation faster. As a result, politicians should also care about the results of this project because the enhanced ability in detecting fake news will force politicians to be more cautious when making false claims and untruthful allegations on social media platforms. Last but not least, social media users should also pay attention to the project because the model will help them reduce misinformation and confusion, understand issues more clearly, and make better judgement in elections.

Researchers have done similar studies regarding fake news detection in recent years. Some used supervised models. The paper “Natural Language Processing based Hybrid Model for Detecting Fake News Using Content-Based Features and Social Features”, for example, built a ml model based on NLP techniques and achieved an average accuracy of 90.62% with a F1 Score of 90.33% on a standard dataset. Another paper “Weakly Supervised Learning for Fake News Detection on Twitter” also achieved a F1 score of up to 0.9 with weakly supervised approach. Other researchers concentrated on unsupervised deep learning models. The paper “Fake News Detection on Social Media using Geometric Deep Learning”, for instance, was able to achieve 92.7% ROC AUC. Another study “Fake News Identification on Twitter with Hybrid CNN and RNN Models” used a hybrid of CNN and long-short term RNN models and achieved an 82% accuracy. In conclusion, the accuracy levels of most models, supervised or unsupervised, are over 90%. Research in this area has triggered interests from social media platforms. Twitter, for instance, has acquired Fabula AI, a London-based startup using machine learning model to detect the spread of misinformation online, to help identify fake news.

This project plans to train and optimize a random forests classification model to predict the truth value for social media statements. Features should include the length of statements, the subjects of statements, the contexts of statements, the names of speakers, the job titles of speakers, the states of speakers, the party affiliations of speakers, and the total credit history counts of speakers. Two statements will be dropped because they do not have values for any of the features. In addition, there are 3565 missing values for “title” feature, 2747 missing values for “state” feature, and 129 missing values for “context” feature. The missing values in the dataset will be replaced with the string “missing” because dropping all missing values in the dataset removes about 4000 statements, which is too many and only leaves us with 8438 statements to train, validate, and test.

Labels should include the six truth values for statements, including “true”, “mostly-true”, “half-true”, “barely-true”, “false” and “pants-fire”. However, considering the difficulties of predicting six labels, this project will also classify them into two categories to observe the subsequent changes in model performance. A statement will be labeled as 1 if its label is “true”, “mostly-true”, and “half-true” because they are closer to truth, and a statement will be labeled as 0 if its label is ‘barely-true”, “false”, and “pants-fire”.

This project intends to use random hyperparameter grid, RandomizedSearchCV, to create a parameter grid to sample from during fitting. Hyperparameters will include the number of trees in the forest (n_estimators), the max number of features considered for splitting a node (max_features), the max number of levels in each decision tree (max_depth), the min number of data points placed in a node before splitting (min_samples_split), and the min number of data points allowed in a leaf node (min_samples_leaf). This project will also consider using natural language processing for textual analysis of the statements, such as using unigram feature space or n-grams feature space, if necessary. The level of performance should at least exceed the 90% accuracy for the classifier to be successful, since the accuracy levels of most models in the past research are more than 90%. Furthermore, a classifier with a lower accuracy will make more wrong predictions and have a negative impact on the stakeholders.

This project also plans to conduct a fairness audit by separating out statements by political party to measure and describe bias of the classifier by party. This is important because it measures the disparate impact of the classification system on different parties. If the classifier is found to be biased against certain parties, for instance, by having a higher false positive rate or false negative rate, then the classifier is proved to be unfair and should not be put into professional use. Politicians and their political parties will care about the fairness of the classifier because they don’t want to be unfairly discredited during the campaigns. Twitter as well as fact-checking websites will also avoid using a biased classification system because they do not want to be accused of being partisan, which might make them lose a portion of their partisan users. Last but not least, social media users should also expect the classifier to be fair because they want to receive right information about issues and be correctly informed when there is fake news.

To conduct the fairness audit, this project will examine the false positive rate and false negative rate by party and find out if there is any gap between these indicators among different parties. Such error analysis will help identify the category of errors. If there are large discrepancies between different parties’ false positive rate or false negative rate, then the classifier is proved to be biased by party. Politicians and their political parties will be most influenced by fairness audit, which should be conducted in collaboration with social media.

About

This project builds and optimizes a machine learning model that predicts the truth value for political statements to gain a better understanding of and detect fake news on social media.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published