Uncovering The ‘Headless Headlines’
Before the internet era, the news was delivered primarily via newspapers, journals, and radio broadcasts. These sources had to adhere to strict guidelines for the language and integrity of the information broadcast/published. With the advent of social media, information sharing expedited at an exponential rate and in an uncontrolled manner. At this rate of information exchange, it becomes difficult to verify the validity of every piece of news published. Given the enigmatic nature of social media, it becomes difficult to track the source of an article/information. Eventually, anyone who has access to social media can effectively write an unverified/fake news piece which can go viral in hours.
Fake news intends to mislead or damage an agency, entity, or person, and gain financially or politically. There were many instances in 2017 when fake news was widely shared across social media platforms right from Twitter to Facebook and WhatsApp, to the extent that “fake news” was also named the word of the year by Collins Dictionary due to its widespread use around the world.
Some Of The Hoax Stories That Caught Everybody’s Attention
A photoshopped picture of a shark on the street of Houston, Texas, back in August 2017 when Hurricane Harvey hit the US
Fraudulent stories during the 2016 U.S. presidential election included a viral post popularized on Facebook that Pope Francis had endorsed Trump, and another that actor Denzel Washington “backs Trump in the most epic way possible
In India, a post mentioned that Swami Vivekananda’s statue was beheaded in UP. The news spread quickly and spiraled into a communal disturbance. The cause? A tweet claiming that Muslims beheaded it went viral. Later, an anti-social element was arrested for spreading the false news
On November 8, 2016, India established a 2,000-rupee currency bill on the same day as the Indian 500, and 1,000 rupees demonetization. Fake news went viral over WhatsApp that the currency came equipped with spying technology that tracked bills 120 meters below the earth. Finance Minister Arun Jaitley refuted the falsities, but not before they had spread to the country’s mainstream news outlets
As undesirable as it may seem, the fake news problem is genuine. In one example of how severe the situation can get, seven people lost their lives in two separate incidences in Jharkhand, in a fury that was born on social media and based on falsified information that the killers received over WhatsApp messenger. With time this problem has only become abysmal and grave. The situations created by these false stories gain weight when pictures or videos back the fallacy. People tend to believe more and react fast in these cases.
Some attributes of fake news which can help identify such stories are – the tone of the article, the facts, multiple sources telling their version of stories, the reputation of publishers backed by cross-checking the same. However, the entire process of verifying the news is quite cumbersome for one person. Also, with time, sources have understood what their readers look for in an article based on the comments and likes/shares, and they tweak in the original content in such a manner that it appears genuine and gets consumed by most of the readers.
Technology can be of great help here, as the advanced algorithms can be trained on a significant amount of data and interpret what is real and fake by identifying even the subtle changes in critical elements of the news which can help determine its authenticity. We have done an exploratory data analysis of a targeted data to identify any patterns classifying fake from real and developed a classifier using machine learning for the same problem.
Data Availability And Analysis
There are a couple of more datasets available for India, but they are neither structured nor tagged or are in the regional language. Due to unavailability of data and language parsing limitation, we’ve considered data for the USA only.
There are three datasets which are hand tagged, accurate and have been worked upon by many researchers. The open source datasets are listed below:
The Kaggle fake news dataset: This has 13,000 rows (entries)
- The dataset contained specific categories of fake news articles and was collected across a month. It is very structured and includes 15 useful features.
- Dataset features: ID, author, published timestamp, title, content, language, site URL, country, domain rank, spam score, main image URL, comments, shares, type, etc.
- The data is neat and has been hand tagged by many researchers but only contains the title, text and is binary labeled.
- Dataset features: ID, title, contents and binary label (fake/real)
As the datasets are structured, minimal pre-processing/data cleaning is needed. It is essential to select the relevant features for training the models to get the most out of the available data. This helps the models perform well and give accurate insights.
The datasets contain location-specific features like country, language, author name, publishing source and therefore the algorithms will perform better if the testing location is the same as that of the training dataset.
Insights From Analysis
The dataset comprises 13,000 newsgroups posts on more than 50 topics. The dataset contains a sufficient number of data points and incorporates various dimensions of issues to do an exploratory data analysis and get some useful insights. Below is the brief EDA followed by the observations from each graph.
Count Of Data Points For Each Category Of A News Article
Bar graph of Article Category vs. Number of data samples
The above graph conveys that the data is biased and contains most articles of “BS (non-sense)” category. BS category points to those articles which are entirely different than the original news or fact and do not make sense to people.
“Bias” category contains articles which are altered in favor of a specific topic, religion, person or political party, etc. “Junksci” are articles which promise or showcase incredible technical advancements to promote products, to cause panic or to get traction.
“Fake” category contains those articles which are entirely unrelated to the original news going around at a specific time. This type of news is scarce, and the sources do not publish wholly forged news but make small changes in the articles. Publishing completely fake articles doesn’t get people’s attention most of the times as we’ll see in the further analysis and there is a high chance of legal action with publishing complete fraudulent content.
Bar graph of Top occurring words and the corresponding article count
Bar graph of Top occurring words and the corresponding article count
Median Sentiment Analysis Of The Dataset
We see that very few articles have a joyous sentiment belonging to the fake sentiment. The graph, in general, reflects the fact that the media gets its traction by preying on people’s fear and trust, which reflects across multiple categories of news.
Example: The titles ‘The world is going to end in 2022’ and ‘Global warming is getting lower these days’ are false but the former gets more traction than the latter. The reason being the same, the former belongs to the ‘fear’ sentiment.
Spam Score vs. Likes Point Plot
As mentioned earlier, most of the articles belonged to ‘hate’ and had ‘trust’ and ‘anger’ sentiments. These articles were published during the 2016 election in the USA and contain words like ‘Donald Trump’ and ‘Hillary Clinton’ in large numbers. Figure 7 shows that posts with a lot of variation got an adequate number of likes from people. A plausible reason could be that people prefer news with a ‘surprise’ element.
Word Frequency Cloud Of The Dataset
The above figure reflects that the most talked about words are ‘Trump,’ ‘elections,’ ‘Philippines’ and ‘Chronicles.’ Combining the observations from the figure above, we can conclude that the articles that talked about Trump and U.S. elections the most were more than 50% spam and received the most likes.
In 2016, the prominence of political misinformation circulating in America was the subject of substantial attention, followed by the surprise victory of President Trump. Generic posts are explicitly published when elections are around the corner, mainly populating incorrect information and misleading articles to get more page views or to impose individualistic opinions on the public. The sources use trust, fear, and anticipation as their weapons and alter the facts in the news to make them more believable.
The Signal Media fake news dataset
Data is first analyzed and then compared to the performance of different machine learning models on classifying fake and legitimate news.
The dataset contains both the labels – fake and legitimate with no class oversampled. Along with the fake or legit tag, we have the source of news attached.
Most Reliable And Unreliable Sources Of News
Comparison Of Different ML Models In News Articles Binary Classification Problem
Both the datasets analyzed show that U.S. elections were the most talked about topic in the datasets. EDA reflects that it is possible to churn out useful information from the recent public news and articles and to identify the sentiment and the category of the article. Classification of news is also possible with good accuracy.
The following traits of the data resulted in a more accurate model generation:
- When the data is well sampled over the predefined classes, articles are accurately tagged, are homogeneous in length, and have a predefined time frame, with the manner of delivery being consistent (humor/sensitive/satire, etc.).
- TFIDF showed promising results when trained over the given classifiers and combined with a couple of useful features, and the right dataset, a highly accurate model can be generated.
- Unesco Declaration of ‘Jana Gana mana’ as world’s best national anthem
- The Modern News Consumer, Pew Research Center
- Aljazeera news: Social media fake news kills several people
- Kaggle fake news dataset
- Signal Media News Articles Dataset
- India News headlines dataset
- Scikit learn newsgroups text dataset
- Signal Media – Fake news data analysis
- Advanced Machine Learning for Public Policy by Alden C. Golab