Original and Fake News Classification Model with the Tensorflow Framework and the Neural Network algorithm

5 min readOct 25, 2020

Description

Nowadays, along with the advances in internet and technology, the rate of information dissemination is also increasing rapidly. With the many platforms available today, everyone can access, share and create news, freely. Unfortunately, however, the strong desire to access this news was not accompanied by a proper news selection process. Without any selection or confirmation of the truth of the news, fraud occurs that cause material and immaterial losses such as divisions and social conflicts.

News that contains false information is very dangerous. Information can influence the emotions, feelings, thoughts, or even actions of an individual or group. It is unfortunate if the information is inaccurate or even in the form of false information (hoaxes) with provocative titles that lead readers and recipients to negative opinions (Abner et al., 2017).

With the advancement of Machine Learning and Artificial Intelligence, technology has a great opportunity to help overcome the spread of fake news, one of which is by using method Text Mining . Text Mining is a variation of data mining that can extract useful information by identifying and exploring interesting patterns from a set of unstructured textual data sources (Feldman & Sanger, 2006). Method is Text Mining implemented using for Natural Language Processing (NLP).

Natural Language Processing (NLP) is a field of computer science which is a branch of artificial intelligence, and language (linguistics) which deals with the interaction between computers and human natural languages, such as Indonesian or English. The main goal of NLP is to build machines capable of understanding and understanding the meaning of human language. In this research, the NLP method is used by applying Neural Network algorithm.

With the principle of text mining , the author intends to conduct this research to create a model that can clarify real news and fake news. For development, we hope that this model can be used as a machine that can be used to detect fake news and then it can be accessed by everyone to clarify the truth of a story. So that people are getting smarter in choosing news to read.

Scope

The scope of this projects are:

Data obtained from website Kaggle. The data is a metadata about n amount of news, collected from 2016 till 2017.
The implementation of theTensorflow framework (version 2 and above) uses the library keras twitch programming language python.

Benefits

The Model by applying the Neural Network algorithm and the result can be used as a starting point when you want to determine the truth of news so that it can help prevent the spread of fake news or hoaxes.

Data

The data we used is two datasets of news articles, the first dataset only contains fake news and the second dataset contains only true news. Data taken from the Kaggle website. The data were collected by Ahmed H, Traore I, and Saad S. The data has a period from January 1, 2016 to January 1, 2017. The fake news dataset consists of 23502 records while the true news dataset consists of 21417 records. Each dataset has 4 attributes as explained by the table below.

Stages of Activities

Data Exploration

The first stage we do is look at the data to get an overview of the distribution of data and its irregularities. By knowing the distribution of data, we can determine whether the data is sufficient or not to be used as training material. Good data will be distributed equally.

Preprocessing Data

After data exploration is carried out, it will be seen how the data characteristics are then carried out pre-processing, namely cleaning the data. in the case we are using the data will be normalized by changing all letters into a lowercase form, removing punctuation marks, removing unnecessary words like ‘I’, you ‘,’ and ‘. Then do stemming and lemmatization, namely changing the word into its basic form such as ‘running’ to ‘running’. This is done to uniform the data format before being inputted into the model.

*Figure 1. Illustration of data preprocessing workflow*

*Figure 2. Data after Preprocessing and will be used to the model*

Classification Model Making

We designed a text classification model based on the methods commonly used for text data analysis, namely the Recurrent Neural Network (RNN). RNN has proven to be efficient and accurate for building text analysis and speech recognition models. The RNN in NLP is specifically superior at making predictions at the word level. The RNN stores important information from each word in a sentence to be passed on to the next word allowing prediction for long sentences.

*Figure 3. Words that are most common in news title*

For model training we decided to set the epochs to 10. And since the first epoch the model already has a good accuracy.

Model Evaluation

To ensure that the predictor model is optimal and does not overfit, evaluation is carried out in order to get the best results. The parameter used as a reference for evaluation is ‘accuracy’. For each model iteration, we record the accuracy and then compare the results.

To do this we test the model to a set of data that hasn’t been used for training. The results is that it has total accuracy of 0.94 or 94% which is high enough and can be trusted as the first step to identify the originality of a news.

In this experiment, a classification system has been developed to distinguish fake news from true news. Based on the results that have been presented, it can be concluded that this study succeeded in classifying fake news and true news. This system consists of pre-process modules, feature extraction, feature selection, the learning process, and the testing or classification process itself. The majority of pre-processing is carried out with the aim of transforming data so that it can be easier to use for prediction and analysis. Part of the pre-processing result data is used for model training. After that, the next step is testing the model using data that has not been used for model training. For the classification process, whether training models or test models, an experiment is conducted comparing the use of classification attributes. The first experiment used the attribute title and the results showed an accuracy rate of 94%. The second experiment uses the attribute text (the content of the news), resulting in an accuracy rate of 99%.

Click here for source code.