Uncovering The ‘Headless Headlines’

AUGUST 27, 2018
Gaurav Tolani

Problem Statement

“UNESCO declares India’s ‘Jana Gana Mana’ the World’s Best National Anthem.” This was one of the most eye-catching headlines in November 2017, which garnered so much attention that almost everybody in India started believing it until a news channel (India Today) wrote to Sue Williams (Chief Editorial, Bureau of Public Information, UNESCO) to confirm the news. In the end, the story was proved to be untrue/fake.

Before the internet era, the news was delivered primarily via newspapers, journals, and radio broadcasts. These sources had to adhere to strict guidelines for the language and integrity of the information broadcast/published. With the advent of social media, information sharing expedited at an exponential rate and in an uncontrolled manner. At this rate of information exchange, it becomes difficult to verify the validity of every piece of news published. Given the enigmatic nature of social media, it becomes difficult to track the source of an article/information. Eventually, anyone who has access to social media can effectively write an unverified/fake news piece which can go viral in hours.
Percentage Of People Who Prefer Getting Their News On Each Platform
Headless Headlines
A survey conducted Jan. 12 – Feb. 8, 2016, The Modern News Consumer, Pew Research Center
The table above shows that the segment of people accessing news online prefers reading instead of watching or listening to the news. In the era of social networking, the primary source of news is social networking websites and mobile applications. Unlike radio and print, social media does not have any regulations in place. Soroush Vosoughi, a data scientist at MIT who has been studying fake news articles since 2013 concluded, ‘It seems to be pretty clear that false information outperforms true information.’
Percentage Of The Age Group Who Often Get News On Each Platform
Headless Headlines
A survey conducted Jan. 12 – Feb. 8, 2016, The Modern News Consumer, Pew Research Center
The table above shows that for the age groups 18-29 and 30-49, the primary source of any news is online followed by television. While regulation for news transmitted over TV exists and news channels are cautious about what gets reported, online sources can get away without any scrutiny to publish news. A few well-known sources like WSJ, Forbes, CNN, BBC, etc. do act once the story faces backlash, most sources, however, can get away with publishing incorrect information.

Fake news intends to mislead or damage an agency, entity, or person, and gain financially or politically. There were many instances in 2017 when fake news was widely shared across social media platforms right from Twitter to Facebook and WhatsApp, to the extent that “fake news” was also named the word of the year by Collins Dictionary due to its widespread use around the world.

Some Of The Hoax Stories That Caught Everybody’s Attention

A photoshopped picture of a shark on the street of Houston, Texas, back in August 2017 when Hurricane Harvey hit the US

Fraudulent stories during the 2016 U.S. presidential election included a viral post popularized on Facebook that Pope Francis had endorsed Trump, and another that actor Denzel Washington “backs Trump in the most epic way possible

In India, a post mentioned that Swami Vivekananda’s statue was beheaded in UP. The news spread quickly and spiraled into a communal disturbance. The cause? A tweet claiming that Muslims beheaded it went viral. Later, an anti-social element was arrested for spreading the false news

On November 8, 2016, India established a 2,000-rupee currency bill on the same day as the Indian 500, and 1,000 rupees demonetization. Fake news went viral over WhatsApp that the currency came equipped with spying technology that tracked bills 120 meters below the earth. Finance Minister Arun Jaitley refuted the falsities, but not before they had spread to the country’s mainstream news outlets

As undesirable as it may seem, the fake news problem is genuine. In one example of how severe the situation can get, seven people lost their lives in two separate incidences in Jharkhand, in a fury that was born on social media and based on falsified information that the killers received over WhatsApp messenger. With time this problem has only become abysmal and grave. The situations created by these false stories gain weight when pictures or videos back the fallacy. People tend to believe more and react fast in these cases.

Some attributes of fake news which can help identify such stories are – the tone of the article, the facts, multiple sources telling their version of stories, the reputation of publishers backed by cross-checking the same. However, the entire process of verifying the news is quite cumbersome for one person. Also, with time, sources have understood what their readers look for in an article based on the comments and likes/shares, and they tweak in the original content in such a manner that it appears genuine and gets consumed by most of the readers.

Technology can be of great help here, as the advanced algorithms can be trained on a significant amount of data and interpret what is real and fake by identifying even the subtle changes in critical elements of the news which can help determine its authenticity. We have done an exploratory data analysis of a targeted data to identify any patterns classifying fake from real and developed a classifier using machine learning for the same problem.

Data Availability And Analysis

Fake news can be detected using many references and features including the author’s past data, the credibility of data source, writing style, writing pattern, data background check, etc. It can also be detected by the time and conditions in which the news gets sourced, place and medium of sharing the story. The font type too plays a vital role in deciding whether the news is fake or not. Some of the attributes required to analyze a news article are:
Headless Headlines

Data Availability

Sufficient data specific to Indian scenario is not available, and therefore we have analyzed and worked on news data particular to the US region from various sources. India’s datasets of fake news are not available in an accurate and structured format. Some of the datasets of India are: Kaggle news headlines dataset: This dataset contains only the headlines and not content. While headlines alone can be analyzed, it is the content that forms the basis for comprehending whether the article is fake. Scikitlearn newsgroups dataset: The 20 newsgroups dataset comprises around 18,000 newsgroups posts on 20 topics. This dataset contains an adequate amount of data points with the required features, but the tagging is not according to the problem statement(fake/legitimate)

There are a couple of more datasets available for India, but they are neither structured nor tagged or are in the regional language. Due to unavailability of data and language parsing limitation, we’ve considered data for the USA only.

There are three datasets which are hand tagged, accurate and have been worked upon by many researchers. The open source datasets are listed below:

The Kaggle fake news dataset: This has 13,000 rows (entries)

  1. The dataset contained specific categories of fake news articles and was collected across a month. It is very structured and includes 15 useful features.
  2. Dataset features: ID, author, published timestamp, title, content, language, site URL, country, domain rank, spam score, main image URL, comments, shares, type, etc.
George McIntire (https://github.com/GeorgeMcIntire/fake_real_news_dataset): This dataset includes fake news and real news in 1:1 ratio. The dataset has 10,558 rows (entries)
  1. The data is neat and has been hand tagged by many researchers but only contains the title, text and is binary labeled.
  2. Dataset features: ID, title, contents and binary label (fake/real)
Dataset published by Signal Media: Data features: ID, title, content, source, published timestamp, media-type, etc.

As the datasets are structured, minimal pre-processing/data cleaning is needed. It is essential to select the relevant features for training the models to get the most out of the available data. This helps the models perform well and give accurate insights.

The datasets contain location-specific features like country, language, author name, publishing source and therefore the algorithms will perform better if the testing location is the same as that of the training dataset.

Insights From Analysis

The Kaggle fake news dataset

The dataset comprises 13,000 newsgroups posts on more than 50 topics. The dataset contains a sufficient number of data points and incorporates various dimensions of issues to do an exploratory data analysis and get some useful insights. Below is the brief EDA followed by the observations from each graph.

Count Of Data Points For Each Category Of A News Article

Bar graph of Article Category vs. Number of data samples

The above graph conveys that the data is biased and contains most articles of “BS (non-sense)” category. BS category points to those articles which are entirely different than the original news or fact and do not make sense to people.

“Bias” category contains articles which are altered in favor of a specific topic, religion, person or political party, etc. Junksci” are articles which promise or showcase incredible technical advancements to promote products, to cause panic or to get traction.

Fake” category contains those articles which are entirely unrelated to the original news going around at a specific time. This type of news is scarce, and the sources do not publish wholly forged news but make small changes in the articles. Publishing completely fake articles doesn’t get people’s attention most of the times as we’ll see in the further analysis and there is a high chance of legal action with publishing complete fraudulent content.

Bar graph of Top occurring words and the corresponding article count

From the graph, we notice that the most frequent words in the dataset are ‘Trump,’ ‘Hillary,’ ‘Clinton,’ ‘election’ etc. It is easy to tell that the data was collected around the U.S. elections.

Bar graph of Top occurring words and the corresponding article count

‘Adobo chronicles’ is a news source in the Philippines which also was abuzz during 2016 Philippines election for president and therefore ‘Adobo’ appears in 4 out of the 15 most talked about topics. From the above analysis, we observe that the dataset contains a high percentage of articles about U.S. elections, Adobo chronicles and belong to BS, hate and conspiracy category.

Median Sentiment Analysis Of The Dataset

We then analyzed the sentiment of each article in the dataset and examined the median. The sentiment count contained an outlier, hate, so taking median as the parameter makes sense.

We see that very few articles have a joyous sentiment belonging to the fake sentiment. The graph, in general, reflects the fact that the media gets its traction by preying on people’s fear and trust, which reflects across multiple categories of news.

Example: The titles ‘The world is going to end in 2022’ and ‘Global warming is getting lower these days’ are false but the former gets more traction than the latter. The reason being the same, the former belongs to the ‘fear’ sentiment.
The above image has the element of ‘surprise’ as well as ‘hate’ and ‘anger.’ News sources play with the lines, sometimes adding false data to make the news believable. The US should know that the button for nuclear weapons is on my table,” was a statement made by Kim Jong-un during a speech, that was aired across news channels with a twist. Careful observation shows that the above tweet is only partially accurate and has been altered to make the news captivating. The combination of figure 4,5 and 6, depicts that the articles and posts that belong to the ‘hate category,’ mostly had ‘trust’ and ‘anger’ sentiments. This fact reflects how sources try to play with the minds of people and try to create content/news that is believable based on the category of news. ‘Trust’ sentiment is least in ‘satire’ and ‘conspiracy’ category because people tend to believe news of these categories easily and there is no need to write or present such articles in any specific manner.

Spam Score vs. Likes Point Plot

We know the kind of tweaking that helps sources get greater traction. This helps us identify the extent to which these articles are forged to gain greater traction.
The chart above suggests that the posts with a spam score >50% get 500-750 likes (which is a subsequent number of thumbs up or agreement from the public). The reason could be that the news sources tamper with the original article to make it more fascinating and therefore the spam score is high for the posts getting the most likes.

As mentioned earlier, most of the articles belonged to ‘hate’ and had ‘trust’ and ‘anger’ sentiments. These articles were published during the 2016 election in the USA and contain words like ‘Donald Trump’ and ‘Hillary Clinton’ in large numbers. Figure 7 shows that posts with a lot of variation got an adequate number of likes from people. A plausible reason could be that people prefer news with a ‘surprise’ element.

Word Frequency Cloud Of The Dataset

Now we know that the articles that were collected around U.S. elections are mostly of ‘hate’ category, contain ‘trust,’ and ‘anger’ sentiments and articles with spam score >50% got the highest number of likes. We will visualize what they talked about using dataset word cloud.

The above figure reflects that the most talked about words are ‘Trump,’ ‘elections,’ ‘Philippines’ and ‘Chronicles.’ Combining the observations from the figure above, we can conclude that the articles that talked about Trump and U.S. elections the most were more than 50% spam and received the most likes.

In 2016, the prominence of political misinformation circulating in America was the subject of substantial attention, followed by the surprise victory of President Trump. Generic posts are explicitly published when elections are around the corner, mainly populating incorrect information and misleading articles to get more page views or to impose individualistic opinions on the public. The sources use trust, fear, and anticipation as their weapons and alter the facts in the news to make them more believable.

The Signal Media fake news dataset

Data is first analyzed and then compared to the performance of different machine learning models on classifying fake and legitimate news.

The dataset contains both the labels – fake and legitimate with no class oversampled. Along with the fake or legit tag, we have the source of news attached.

Most Reliable And Unreliable Sources Of News

Reliable And Unreliable Sources
11051 articles were used to sample over two classes of articles, the ‘Reuters’ articles and the ‘Before it’s news’ articles. In particular, for fake news our examples are heavily drawn from ‘Before It’s news’. This is not an ideal approach; we would like to have a more authoritative, vetted source for comparing fake vs legitimate articles. In the process of pre-processing the data, named entities were stripped for normalizing the data and building a general classifier. The duplication articles were removed from the dataset using spaCy2.0 for tokenization and entity recognition, and Scikitlearn for training and building the classification models.

Comparison Of Different ML Models In News Articles Binary Classification Problem

The ‘bigram term frequency-inverse document frequency’ was used as the primary feature to classify using three models and their performances compared. The dataset contains 500 sample articles, split in 90:10 ratio for training and testing respectively. It is observed from Table 2 that though a simple feature of ‘bigram tfidf’ was used, its performance was satisfactory. SVM gave an accuracy of 76.2% with just 450 training data points. Hence, considering relevant factors improves the accuracy.

Conclusion

Both the datasets analyzed show that U.S. elections were the most talked about topic in the datasets. EDA reflects that it is possible to churn out useful information from the recent public news and articles and to identify the sentiment and the category of the article. Classification of news is also possible with good accuracy. 

The following traits of the data resulted in a more accurate model generation:

  1. When the data is well sampled over the predefined classes, articles are accurately tagged, are homogeneous in length, and have a predefined time frame, with the manner of delivery being consistent (humor/sensitive/satire, etc.).
  2. TFIDF showed promising results when trained over the given classifiers and combined with a couple of useful features, and the right dataset, a highly accurate model can be generated.

References