Text Classification with Naive Bayes
In this assignment, you will implement the Naive Bayes classification method and use it for sentiment classification of customer reviews. Write a report containing your answers, including the visualizations. Submit your report and your Python code/notebook.
Read up the Naive Bayes classifier: how to compute apply the Naive Bayes rule, and how to estimate the probabilities you need.
If you wish, you may also have a look at the following classic paper:
· Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan
:
Thumbs up? Sentiment Classification using Machine Learning Techniques
. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002).
The dataset we are using was originally created for the experiments described in the following paper. The research described here addresses the problem of domain adaptation, such as adapting a classifier of book reviews to work with camera reviews.
· John Blitzer, Mark Dredze, and Fernando Pereira:
Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification.
In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL 2007).
Frequency-counting in Python. The built-in data structure called Counter is a special type of Python dictionary that is adapted for computing frequencies. In the following example, we compute the frequencies of words in a collection of two short documents. We use Counter in three different ways, but the end results are the same (freqs1, freqs2, and freqs3 are identical at the end). The Counter will not give a KeyError if you look for a word that you didn’t see before.
from collections import Counter
example_documents = [‘the first document’.split(), ‘the second document’.split()]
freqs1 = Counter()
for doc in example_documents:
for w in doc:
freqs1[w] += 1
freqs2 = Counter()
for doc in example_documents:
freqs2.update(doc)
freqs3 = Counter(w for doc in example_documents for w in doc)
print(freqs1)
print(freqs1[‘the’])
print(freqs1[‘neverseen’])
Logarithmic probabilities. If you multiply many small probabilities you may run into problems with numeric precision: the probability becomes zero. To handle this problem, I recommend that you compute the logarithms of the probabilities instead of the probabilities. To compute the logarithm in Python, use the function log in the numpy library.
The logarithms have the mathematical property that np.log(P1 * P2) = np.log(P1) + np.log(P2). So if you use log probabilities, all multiplications (for instance, in the Naive Bayes probability formula) will be replaced by sums.
If you’d like to come back from log probabilities to normal probabilities, you can apply the exponential function, which is the inverse of the logarithm: prob = np.exp(logprob). (However, if the log probability is too small, exp will just return zero.)
Download
this file
(URL:
http://www.cse.chalmers.se/~richajo/dit862/data/all_sentiment_shuffled.txt
). This is a collection of customer reviews from six of the review topics used in the paper by Blitzer et al., (2007) mentioned above. The data has been formatted so that there is one review per line, and the texts have been split into separate words (“tokens”) and lowercased. Here is an example of a line.
music neg 544.txt i was misled and thought i was buying the entire cd and it contains one song
A line in the file is organized in columns:
· 0: topic category label (books, camera, dvd, health, music, or software)
· 1: sentiment polarity label (pos or neg)
· 2: document identifier
· 3 and on: the words in the document
Here is some Python code to read the entire collection.
from codecs import open
from __future__ import division
def read_documents(doc_file):
docs = []
labels = []
with open(doc_file, encoding=’utf-8′) as f:
for line in f:
words = line.strip().split()
docs.append(words[3:])
labels.append(words[1])
return docs, labels
We first remove the document identifier, and also the topic label, which you don’t need unless you solve the first optional task. Then, we split the data into a training and an evaluation part. For instance, we may use 80% for training and the remainder for evaluation.
all_docs, all_labels = read_documents(‘all_sentiment_shuffled.txt’)
split_point = int(0.80*len(all_docs))
train_docs = all_docs[:split_point]
train_labels = all_labels[:split_point]
eval_docs = all_docs[split_point:]
eval_labels = all_labels[split_point:]
Estimating parameters for the Naive Bayes classifier
Write a Python function that uses a training set of documents to estimate the probabilities in the Naive Bayes model. Return some data structure containing the probabilities or log probabilities. The input parameter of this function should be a list of documents and another list with the corresponding polarity labels. It could look something like this:
def train_nb(documents, labels):
…
(return the data you need to classify new instances)
Hint 1. In this assignment, it is acceptable if you assume that we will always use the pos and neg categories.
Hint 2. Some sort of smoothing will probably improve your results. You can implement the smoothing either in train_nb or in score_doc_label that we discuss below.
Classifying new documents
Write a function that applies the Naive Bayes formula to compute the logarithm of the probability of observing the words in a document and a sentiment polarity label.
def score_doc_label(document, label,
…
(return the log probability)
Sanity check 1. Try to apply score_doc_label to a few very short documents; to convert the log probability back into a probability, apply np.exp or math.exp. For instance, let’s consider small documents of length 1. The probability of a positive document containing just the word “great” should be a small number, depending on your choice of smoothing parameter, it will probably be around 0.001–0.002. In any case, it should be higher than the probability of a negative document with the same word. Conversely, if you try the word “bad” instead, the negative score should be higher than the positive.
Sanity check 2. Your function score_doc_label should not crash for the document [‘a’, ‘top-quality’, ‘performance’].
Next, based on the function you just wrote, write another function that classifies a new document.
def classify_nb(document,
…
(return the guess of the classifier)
Again, apply this function to a few very small documents and make sure that you get the output you’d expect.
Evaluating the classifier
Write a function that classifies each document in the test set and returns the list of predicted sentiment labels.
def classify_documents(docs,
…
(return the classifier’s predictions for all documents in the collection)
Next, we compute the accuracy, i.e. the number of correctly classified documents divided by the total number of documents.
def accuracy(true_labels, guessed_labels):
…
(return the accuracy)
What accuracy do you get when evaluating the classifier on the test set?
Error analysis
Find a few misclassified documents and comment on why you think they were hard to classify. For instance, you may select a few short documents where the probabilities were particularly high in the wrong direction.
Cross-validation
Since our estimation of the accuracy is based on a fairly small set, your interval was quite wide. We will now use a trick to get a more reliable estimate and tighter interval.
In a cross-validation, we divide the data into N parts (folds) of equal size. We then carry out N evaluations: each fold once becomes a test set, while the other folds form the training set. We then combine the results of the N different evaluations. This trick allows us to get results for the whole dataset, not just a small test set.
Here is a code stub that shows the idea:
for fold_nbr in range(N):
split_point_1 = int(float(fold_nbr)/N*len(all_docs))
split_point_2 = int(float(fold_nbr+1)/N*len(all_docs))
train_docs_fold = all_docs[:split_point_1] + all_docs[split_point_2:]
train_labels_fold = all_labels[:split_point_1] + all_labels[split_point_2:]
eval_docs_fold = all_docs[split_point_1:split_point_2]
…
(train a classifier on train_docs_fold and train_labels_fold)
(apply the classifier to eval_docs_fold)
…
(combine the outputs of the classifiers in all folds)
Implement the cross-validation method. Then estimate the accuracy and compute a new interval estimate. A typical value of N would be between 4 and 10.
Reference: This is homework was copied from the following website:
http://www.cse.chalmers.se/~richajo/dit862/assignment2.html
. You may check the original materials for details.
We provide professional writing services to help you score straight A’s by submitting custom written assignments that mirror your guidelines.
Get result-oriented writing and never worry about grades anymore. We follow the highest quality standards to make sure that you get perfect assignments.
Our writers have experience in dealing with papers of every educational level. You can surely rely on the expertise of our qualified professionals.
Your deadline is our threshold for success and we take it very seriously. We make sure you receive your papers before your predefined time.
Someone from our customer support team is always here to respond to your questions. So, hit us up if you have got any ambiguity or concern.
Sit back and relax while we help you out with writing your papers. We have an ultimate policy for keeping your personal and order-related details a secret.
We assure you that your document will be thoroughly checked for plagiarism and grammatical errors as we use highly authentic and licit sources.
Still reluctant about placing an order? Our 100% Moneyback Guarantee backs you up on rare occasions where you aren’t satisfied with the writing.
You don’t have to wait for an update for hours; you can track the progress of your order any time you want. We share the status after each step.
Although you can leverage our expertise for any writing task, we have a knack for creating flawless papers for the following document types.
Although you can leverage our expertise for any writing task, we have a knack for creating flawless papers for the following document types.
From brainstorming your paper's outline to perfecting its grammar, we perform every step carefully to make your paper worthy of A grade.
Hire your preferred writer anytime. Simply specify if you want your preferred expert to write your paper and we’ll make that happen.
Get an elaborate and authentic grammar check report with your work to have the grammar goodness sealed in your document.
You can purchase this feature if you want our writers to sum up your paper in the form of a concise and well-articulated summary.
You don’t have to worry about plagiarism anymore. Get a plagiarism report to certify the uniqueness of your work.
Join us for the best experience while seeking writing assistance in your college life. A good grade is all you need to boost up your academic excellence and we are all about it.
We create perfect papers according to the guidelines.
We seamlessly edit out errors from your papers.
We thoroughly read your final draft to identify errors.
Work with ultimate peace of mind because we ensure that your academic work is our responsibility and your grades are a top concern for us!
Dedication. Quality. Commitment. Punctuality
Here is what we have achieved so far. These numbers are evidence that we go the extra mile to make your college journey successful.
We have the most intuitive and minimalistic process so that you can easily place an order. Just follow a few steps to unlock success.
We understand your guidelines first before delivering any writing service. You can discuss your writing needs and we will have them evaluated by our dedicated team.
We write your papers in a standardized way. We complete your work in such a way that it turns out to be a perfect description of your guidelines.
We promise you excellent grades and academic excellence that you always longed for. Our writers stay in touch with you via email.