Machine Learning Model to Classify one’s Political Orientation
Data: Reddit Posts
Objective: Liberal VS Conservative
Trying to identify whether a particular text/sentence is a liberal or Conservative classification.
Header row of the Dataset:
The dataset that is given in the JSON format contains a text column and a label, the label classifies it as liberal or conservative, this will be how we’ll train our model.
Vectorizing the Dataset:
We will vectorize the dataset and set the liberal’s as 1 and the conservatives as 0.
We will use TFIDF Vectorizer with max features of 10000
Converting Vectorized Data to NumPy Variables:
It is important to convert the Vectorized data to a NumPy Array to feed into our TensorFlow Keras model.
Visualizing the Model Layers:
As seen below we will be using a 3-layer model with two hidden layers, the first hidden layer will have 5000 features and the 2nd will have 2500 features followed by the output layer. The activation functions used are Relu and Relu for hidden layers, and Sigmoid for the Output layer because we want a 1 or 0 output for this model, therefore, we use the sigmoid/logistic function.
Layer (type) Output Shape Param #
input_layer (InputLayer) [(None, 10000)] 0
hidden_layer1 (Dense) (None, 5000) 50005000
hidden_layer2 (Dense) (None, 2500) 12502500
output_layer (Dense) (None, 1) 2501
Total params: 62,510,001
Trainable params: 62,510,001
Non-trainable params: 0
Fitting the Model:
We will fit and run the model using GPU runtime based in Google Colab. For this analysis we will save our best model using high validation accuracy, which will be justified by the following line of code.
We will only run the model for 30 epochs with a batch size of 8 and a 75%-25% split. These can be changed depending on the requirements.
We notice that Validation accuracy improves gradually.
Saving the Model:
We save it using a simple model.save() command. Saving is important because we can retrieve it at a future point to test it on any new data points we get.
Testing the model on new data:
We introduce two texts to the model:
Data1: "Anarchocapitalism, in my opinion, is a doctrinal system which, if ever implemented, would lead to forms of tyranny and oppression that have few counterparts in human history."
Data2: "Biden's Response to Putin's Invasion of Ukraine Has Been His Finest Moment"
We will pass the new data as a list to our Tf Keras model that we have saved.
New_Data = ["Anarcho-capitalism, in my opinion, is a doctrinal system which, if ever implemented, would lead to forms of tyranny and oppression that have few counterparts in human history.","Biden's Response to Putin's Invasion of Ukraine Has Been His Finest Moment"]
Loading the saved Model:
We load the saved model from our drive storage location. Now this saved model can be used to predict any text passing through it as one of the two labels, liberal(1) or conservatives(0).
We will now Vectorize and Fit the model using the New Data:
array([[0.7673081 ],[0.04765433]], dtype=float32)
The results show that the first part of the new data is predicted as Liberal 0.767 and the 2nd New data is predicted as Conservative. Anything above 0.5 is labelled as liberal and below 0.5 would be labelled as conservative based on our model.
Conclusions and Visualizations:
Thus, this is how we create a model based on labelled text data. There are of course ways we can improve this model using POS tags as well as vectorizing the current text information. We can further visualize the model accuracy and loss as below:
For more details visit my github page: