DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Low-Code Development: Leverage low and no code to streamline your workflow so that you can focus on higher priorities.

DZone Security Research: Tell us your top security strategies in 2024, influence our research, and enter for a chance to win $!

Launch your software development career: Dive head first into the SDLC and learn how to build high-quality software and teams.

Open Source Migration Practices and Patterns: Explore key traits of migrating open-source software and its impact on software development.

Related

  • How Incorporating NLP Capabilities Into an Existing Application Stack Is Easier Than Ever
  • AI Advancement for API and Microservices
  • Navigating the Complexities of Text Summarization With NLP
  • Sentiment Analysis: How Amazon Aurora Machine Learning and Comprehend Can Revolutionize Customer Review Analysis

Trending

  • Phased Approach to Data Warehouse Modernization
  • Javac and Java Katas, Part 2: Module Path
  • Apache Hudi: A Deep Dive With Python Code Examples
  • Comparing Axios, Fetch, and Angular HttpClient for Data Fetching in JavaScript
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Simple Sentiment Analysis With NLP

Simple Sentiment Analysis With NLP

Look at a simple application of sentiment analysis using Natural Language Processing techniques.

By 
Emrah Mete user avatar
Emrah Mete
·
Dec. 04, 18 · Tutorial
Like (7)
Save
Tweet
Share
22.0K Views

Join the DZone community and get the full member experience.

Join For Free

In this article, I will develop a simple application of sentiment analysis using natural language processing techniques.

natural Language Processing (NLP)Following the developments in Artificial Intelligence, the number of Artificial Intelligence applications developed for natural language processing is increasing day by day. Applications developed with NLP will enable us to use infrastructure that works faster and more accurately by eliminating human power in many jobs. Common examples of applications developed with NLP are the followings.

  • Text Classification (Spam Detector etc)
  • Sentiment Analysis
  • Author Recognition
  • Machine Translate
  • Chatbot  s 

Sentiment analysis is one of the most common applications in natural language processing. With Sentiment analysis, we can decide what emotion a text is written.

With the widespread use of social media, the need to analyze the content that people share over social media is increasing day by day. Considering the volume of data coming through social media, it is quite difficult to do this with human power. Therefore, the need for applications that can quickly detect and respond to the positive or negative comments that people write is increasing day by day. In this paper, we will develop a baseline model for simple analysis of sentiment.

First of all, we will give you information about the data set that we will make a sentiment analysis.

Data Set Name: Sentiment Labelled Sentences Data Set

Data Set Source:UCI Machine Learning Libarary

Data Set Info: This dataset was created with user reviews collected via 3 different websites (Amazon, Yelp, Imdb). These comments consist of restaurant, film and product reviews. Each record in the data set is labeled with two different emoticons. These are 1: Positive, 0: Negative.

We will create a sentiment analysis model using the data set we have given above.

We will build the Machine Learning model with the Python programming language using the sklearn and nltk library.

Now we can go to the writing part of our code.

First, let's import the libraries we will use.

import pandas as pd
import numpy as np
import pickle
import sys
import os
import io
import re
from sys import path
import numpy as np
import pickle
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer
import matplotlib.pyplot as plt
from string import punctuation, digits
from IPython.core.display import display, HTML
from nltk.corpus import stopwords
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer

Now let's upload and view our data set.

#Amazon Data
input_file = "../data/amazon_cells_labelled.txt"
amazon = pd.read_csv(input_file,delimiter='\t',header=None)
amazon.columns = ['Sentence','Class']

#Yelp Data
input_file = "../data/yelp_labelled.txt"
yelp = pd.read_csv(input_file,delimiter='\t',header=None)
yelp.columns = ['Sentence','Class']

#Imdb Data
input_file = "../data/imdb_labelled.txt"
imdb = pd.read_csv(input_file,delimiter='\t',header=None)
imdb.columns = ['Sentence','Class']


#combine all data sets
data = pd.DataFrame()
data = pd.concat([amazon, yelp, imdb])
data['index'] = data.index

data

Yes, we imported the data and viewed it. Now, let's look at the statistics about the data.

#Total Count of Each Category
pd.set_option('display.width', 4000)
pd.set_option('display.max_rows', 1000)
distOfDetails = data.groupby(by='Class', as_index=False).agg({'index': pd.Series.nunique}).sort_values(by='index', ascending=False)
distOfDetails.columns =['Class', 'COUNT']
print(distOfDetails)

#Distribution of All Categories
plt.pie(distOfDetails['COUNT'],autopct='%1.0f%%',shadow=True, startangle=360)
plt.show()

As you can see, the data set is very balanced. There are almost equal numbers of positive and negative classes.

Now, before using the data set in the model, let's do a few things to clear the text (pre processing).

#Text Preprocessing
columns = ['index','Class', 'Sentence']
df_ = pd.DataFrame(columns=columns)

#lower string
data['Sentence'] = data['Sentence'].str.lower()

#remove email adress
data['Sentence'] = data['Sentence'].replace('[a-zA-Z0-9-_.]+@[a-zA-Z0-9-_.]+', '', regex=True)

#remove IP address
data['Sentence'] = data['Sentence'].replace('((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)(\.|$)){4}', '', regex=True)

#remove punctaitions and special chracters
data['Sentence'] = data['Sentence'].str.replace('[^\w\s]','')

#remove numbers
data['Sentence'] = data['Sentence'].replace('\d', '', regex=True)

#remove stop words
for index, row in data.iterrows():
    word_tokens = word_tokenize(row['Sentence'])
    filtered_sentence = [w for w in word_tokens if not w in stopwords.words('english')]
    df_ = df_.append({"index": row['index'], "Class":  row['Class'],"Sentence": " ".join(filtered_sentence[0:])}, ignore_index=True)

data = df_

We made the pre-cleaning of the data ready for use within the model. Now, before we build our model, let's split our dataset to test (10%) and training(90%).

X_train, X_test, y_train, y_test = train_test_split(data['Sentence'].values.astype('U'),data['Class'].values.astype('int32'), test_size=0.10, random_state=0)
classes  = data['Class'].unique()

Now we can create our model using our training data. In creating the model, I will use the TF-IDF as the vectorizer and the Stochastic Gradient Descend algorithm as the classifier. We found these methods and the parameters in the method using grid search (I will not mention grid search in this article).

from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.neural_network import MLPClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier


#grid search result
vectorizer = TfidfVectorizer(analyzer='word',ngram_range=(1,2), max_features=50000,max_df=0.5,use_idf=True, norm='l2') 
counts = vectorizer.fit_transform(X_train)
vocab = vectorizer.vocabulary_
classifier = SGDClassifier(alpha=1e-05,max_iter=50,penalty='elasticnet')
targets = y_train
classifier = classifier.fit(counts, targets)
example_counts = vectorizer.transform(X_test)
predictions = classifier.predict(example_counts)

Our model has occurred. Now let's test our model with test data. Let's examine the accuracy, precision, recall and f1 results.

from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import classification_report

#Model Evaluation
acc = accuracy_score(y_test, predictions, normalize=True)
hit = precision_score(y_test, predictions, average=None,labels=classes)
capture = recall_score(y_test, predictions, average=None,labels=classes)

print('Model Accuracy:%.2f'%acc)
print(classification_report(y_test, predictions))


Model Accuracy:0.83
             precision    recall  f1-score   support

          0       0.83      0.84      0.84       139
          1       0.84      0.82      0.83       136

avg / total       0.83      0.83      0.83       275



As we have seen, the success of our model was 83%. Now let's look at the confusion matrix, where we can see more clearly how accurate our estimates are.

#source: https://www.kaggle.com/grfiv4/plot-a-confusion-matrix
import itertools
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        #print("Normalized confusion matrix")
    else:
        print()

    plt.imshow(cm, interpolation='nearest', cmap=cmap, aspect='auto')
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.figure(figsize=(150,100))
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, predictions,classes)
np.set_printoptions(precision=2)

class_names = range(1,classes.size+1)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names,title='Confusion matrix, without normalization')

classInfo = pd.DataFrame(data=[])
for i in range(0,classes.size):
    classInfo = classInfo.append([[classes[i],i+1]],ignore_index=True)

classInfo.columns=['Category','Index']
classInfo

With this study, we have developed a Natural Language Processing project. As I said at the beginning of the article, our model is a baseline model. The aim of this article was to develop an application that could be considered an introduction to Natural Language Processing. I hope there has been a useful article in terms of awareness.

NLP Sentiment analysis Data set Data (computing) application Machine learning

Opinions expressed by DZone contributors are their own.

Related

  • How Incorporating NLP Capabilities Into an Existing Application Stack Is Easier Than Ever
  • AI Advancement for API and Microservices
  • Navigating the Complexities of Text Summarization With NLP
  • Sentiment Analysis: How Amazon Aurora Machine Learning and Comprehend Can Revolutionize Customer Review Analysis

Partner Resources


Comments

ABOUT US

  • About DZone
  • Send feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: