Python in Googlecolab Text Classification with Naive Bayes

Text Classification with Naive Bayes

In this assignment, you will implement the Naive Bayes classification method and use it for sentiment classification of customer reviews. Write a report containing your answers, including the visualizations. Submit your report and your Python code/notebook.

Don't use plagiarized sources. Get Your Custom Essay on
Python in Googlecolab Text Classification with Naive Bayes
Just from $13/Page
Order Essay

Preliminaries

Read up the Naive Bayes classifier: how to compute apply the Naive Bayes rule, and how to estimate the probabilities you need.

If you wish, you may also have a look at the following classic paper:

· Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan

Thumbs up? Sentiment Classification using Machine Learning Techniques

. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002).

The dataset we are using was originally created for the experiments described in the following paper. The research described here addresses the problem of domain adaptation, such as adapting a classifier of book reviews to work with camera reviews.

· John Blitzer, Mark Dredze, and Fernando Pereira: 

Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification.

 In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL 2007).

Preparatory remarks

Frequency-counting in Python. The built-in data structure called Counter is a special type of Python dictionary that is adapted for computing frequencies. In the following example, we compute the frequencies of words in a collection of two short documents. We use Counter in three different ways, but the end results are the same (freqs1, freqs2, and freqs3 are identical at the end). The Counter will not give a KeyError if you look for a word that you didn’t see before.

from collections import Counter

example_documents = [‘the first document’.split(), ‘the second document’.split()]

freqs1 = Counter()

for doc in example_documents:

for w in doc:

freqs1[w] += 1

freqs2 = Counter()

for doc in example_documents:

freqs2.update(doc)

freqs3 = Counter(w for doc in example_documents for w in doc)

print(freqs1)

print(freqs1[‘the’])

print(freqs1[‘neverseen’])

Logarithmic probabilities. If you multiply many small probabilities you may run into problems with numeric precision: the probability becomes zero. To handle this problem, I recommend that you compute the logarithms of the probabilities instead of the probabilities. To compute the logarithm in Python, use the function log in the numpy library.

The logarithms have the mathematical property that np.log(P1 * P2) = np.log(P1) + np.log(P2). So if you use log probabilities, all multiplications (for instance, in the Naive Bayes probability formula) will be replaced by sums.

If you’d like to come back from log probabilities to normal probabilities, you can apply the exponential function, which is the inverse of the logarithm: prob = np.exp(logprob). (However, if the log probability is too small, exp will just return zero.)

Reading the review data

Download 

this file

(URL:

http://www.cse.chalmers.se/~richajo/dit862/data/all_sentiment_shuffled.txt

). This is a collection of customer reviews from six of the review topics used in the paper by Blitzer et al., (2007) mentioned above. The data has been formatted so that there is one review per line, and the texts have been split into separate words (“tokens”) and lowercased. Here is an example of a line.

music neg 544.txt i was misled and thought i was buying the entire cd and it contains one song

A line in the file is organized in columns:

· 0: topic category label (books, camera, dvd, health, music, or software)

· 1: sentiment polarity label (pos or neg)

· 2: document identifier

· 3 and on: the words in the document

Here is some Python code to read the entire collection.

from codecs import open

from __future__ import division

def read_documents(doc_file):

docs = []

labels = []

with open(doc_file, encoding=’utf-8′) as f:

for line in f:

words = line.strip().split()

docs.append(words[3:])

labels.append(words[1])

return docs, labels

We first remove the document identifier, and also the topic label, which you don’t need unless you solve the first optional task. Then, we split the data into a training and an evaluation part. For instance, we may use 80% for training and the remainder for evaluation.

all_docs, all_labels = read_documents(‘all_sentiment_shuffled.txt’)
split_point = int(0.80*len(all_docs))
train_docs = all_docs[:split_point]
train_labels = all_labels[:split_point]
eval_docs = all_docs[split_point:]
eval_labels = all_labels[split_point:]
Estimating parameters for the Naive Bayes classifier
Write a Python function that uses a training set of documents to estimate the probabilities in the Naive Bayes model. Return some data structure containing the probabilities or log probabilities. The input parameter of this function should be a list of documents and another list with the corresponding polarity labels. It could look something like this:
def train_nb(documents, labels):

(return the data you need to classify new instances)
Hint 1. In this assignment, it is acceptable if you assume that we will always use the pos and neg categories.
Hint 2. Some sort of smoothing will probably improve your results. You can implement the smoothing either in train_nb or in score_doc_label that we discuss below.
Classifying new documents
Write a function that applies the Naive Bayes formula to compute the logarithm of the probability of observing the words in a document and a sentiment polarity label. refers to what you returned in train_nb.
def score_doc_label(document, label, ):

(return the log probability)
Sanity check 1. Try to apply score_doc_label to a few very short documents; to convert the log probability back into a probability, apply np.exp or math.exp. For instance, let’s consider small documents of length 1. The probability of a positive document containing just the word “great” should be a small number, depending on your choice of smoothing parameter, it will probably be around 0.001–0.002. In any case, it should be higher than the probability of a negative document with the same word. Conversely, if you try the word “bad” instead, the negative score should be higher than the positive.
Sanity check 2. Your function score_doc_label should not crash for the document [‘a’, ‘top-quality’, ‘performance’].
Next, based on the function you just wrote, write another function that classifies a new document.
def classify_nb(document, ):

(return the guess of the classifier)
Again, apply this function to a few very small documents and make sure that you get the output you’d expect.
Evaluating the classifier
Write a function that classifies each document in the test set and returns the list of predicted sentiment labels.
def classify_documents(docs, ):

(return the classifier’s predictions for all documents in the collection)
Next, we compute the accuracy, i.e. the number of correctly classified documents divided by the total number of documents.
def accuracy(true_labels, guessed_labels):

(return the accuracy)
What accuracy do you get when evaluating the classifier on the test set?
Error analysis
Find a few misclassified documents and comment on why you think they were hard to classify. For instance, you may select a few short documents where the probabilities were particularly high in the wrong direction.
Cross-validation
Since our estimation of the accuracy is based on a fairly small set, your interval was quite wide. We will now use a trick to get a more reliable estimate and tighter interval.
In a cross-validation, we divide the data into N parts (folds) of equal size. We then carry out N evaluations: each fold once becomes a test set, while the other folds form the training set. We then combine the results of the N different evaluations. This trick allows us to get results for the whole dataset, not just a small test set.
Here is a code stub that shows the idea:
for fold_nbr in range(N):
split_point_1 = int(float(fold_nbr)/N*len(all_docs))
split_point_2 = int(float(fold_nbr+1)/N*len(all_docs))
train_docs_fold = all_docs[:split_point_1] + all_docs[split_point_2:]
train_labels_fold = all_labels[:split_point_1] + all_labels[split_point_2:]
eval_docs_fold = all_docs[split_point_1:split_point_2]

(train a classifier on train_docs_fold and train_labels_fold)
(apply the classifier to eval_docs_fold)

(combine the outputs of the classifiers in all folds)
Implement the cross-validation method. Then estimate the accuracy and compute a new interval estimate. A typical value of N would be between 4 and 10.
Reference: This is homework was copied from the following website:
http://www.cse.chalmers.se/~richajo/dit862/assignment2.html
. You may check the original materials for details.

What Will You Get?

We provide professional writing services to help you score straight A’s by submitting custom written assignments that mirror your guidelines.

Premium Quality

Get result-oriented writing and never worry about grades anymore. We follow the highest quality standards to make sure that you get perfect assignments.

Experienced Writers

Our writers have experience in dealing with papers of every educational level. You can surely rely on the expertise of our qualified professionals.

On-Time Delivery

Your deadline is our threshold for success and we take it very seriously. We make sure you receive your papers before your predefined time.

24/7 Customer Support

Someone from our customer support team is always here to respond to your questions. So, hit us up if you have got any ambiguity or concern.

Complete Confidentiality

Sit back and relax while we help you out with writing your papers. We have an ultimate policy for keeping your personal and order-related details a secret.

Authentic Sources

We assure you that your document will be thoroughly checked for plagiarism and grammatical errors as we use highly authentic and licit sources.

Moneyback Guarantee

Still reluctant about placing an order? Our 100% Moneyback Guarantee backs you up on rare occasions where you aren’t satisfied with the writing.

Order Tracking

You don’t have to wait for an update for hours; you can track the progress of your order any time you want. We share the status after each step.

image

Areas of Expertise

Although you can leverage our expertise for any writing task, we have a knack for creating flawless papers for the following document types.

Areas of Expertise

Although you can leverage our expertise for any writing task, we have a knack for creating flawless papers for the following document types.

image

Trusted Partner of 9650+ Students for Writing

From brainstorming your paper's outline to perfecting its grammar, we perform every step carefully to make your paper worthy of A grade.

Preferred Writer

Hire your preferred writer anytime. Simply specify if you want your preferred expert to write your paper and we’ll make that happen.

Grammar Check Report

Get an elaborate and authentic grammar check report with your work to have the grammar goodness sealed in your document.

One Page Summary

You can purchase this feature if you want our writers to sum up your paper in the form of a concise and well-articulated summary.

Plagiarism Report

You don’t have to worry about plagiarism anymore. Get a plagiarism report to certify the uniqueness of your work.

Free Features $66FREE

  • Most Qualified Writer $10FREE
  • Plagiarism Scan Report $10FREE
  • Unlimited Revisions $08FREE
  • Paper Formatting $05FREE
  • Cover Page $05FREE
  • Referencing & Bibliography $10FREE
  • Dedicated User Area $08FREE
  • 24/7 Order Tracking $05FREE
  • Periodic Email Alerts $05FREE
image

Our Services

Join us for the best experience while seeking writing assistance in your college life. A good grade is all you need to boost up your academic excellence and we are all about it.

  • On-time Delivery
  • 24/7 Order Tracking
  • Access to Authentic Sources
Academic Writing

We create perfect papers according to the guidelines.

Professional Editing

We seamlessly edit out errors from your papers.

Thorough Proofreading

We thoroughly read your final draft to identify errors.

image

Delegate Your Challenging Writing Tasks to Experienced Professionals

Work with ultimate peace of mind because we ensure that your academic work is our responsibility and your grades are a top concern for us!

Check Out Our Sample Work

Dedication. Quality. Commitment. Punctuality

Categories
All samples
Essay (any type)
Essay (any type)
The Value of a Nursing Degree
Undergrad. (yrs 3-4)
Nursing
2
View this sample

It May Not Be Much, but It’s Honest Work!

Here is what we have achieved so far. These numbers are evidence that we go the extra mile to make your college journey successful.

0+

Happy Clients

0+

Words Written This Week

0+

Ongoing Orders

0%

Customer Satisfaction Rate
image

Process as Fine as Brewed Coffee

We have the most intuitive and minimalistic process so that you can easily place an order. Just follow a few steps to unlock success.

See How We Helped 9000+ Students Achieve Success

image

We Analyze Your Problem and Offer Customized Writing

We understand your guidelines first before delivering any writing service. You can discuss your writing needs and we will have them evaluated by our dedicated team.

  • Clear elicitation of your requirements.
  • Customized writing as per your needs.

We Mirror Your Guidelines to Deliver Quality Services

We write your papers in a standardized way. We complete your work in such a way that it turns out to be a perfect description of your guidelines.

  • Proactive analysis of your writing.
  • Active communication to understand requirements.
image
image

We Handle Your Writing Tasks to Ensure Excellent Grades

We promise you excellent grades and academic excellence that you always longed for. Our writers stay in touch with you via email.

  • Thorough research and analysis for every order.
  • Deliverance of reliable writing service to improve your grades.
Place an Order Start Chat Now
image

Order your essay today and save 30% with the discount code Happy