Overview
1.Text Classification:
In this assignment, you will use scikit-learn, a machine learning toolkit in Python, to implement text classifiers for sentiment analysis. Please read all instructions below carefully.
2. Datasets and evaluation:
You are given the following customer reviews dataset: CR.zip, which includes positive and negative reviews. CR is a small dataset that doesn’t have train/test divisions, so you are required to evaluate the performance using 10-fold crossvalidation. Please use the following scikit-learn modules in your implementation:
scikit-learn documentation:
Bag-of-words (or ngrams) feature extraction using CountVectorizer: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html Use binary features (1/0 rather than counts).
Naïve Bayes classifier: http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html
Logistic Regression classifier: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
Cross validation: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html
Classification report: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html
Targets
Part1:
Using the scikit-learn modules described above, Implement the following models and report the performance (accuracy and F1) for the CR dataset:
a) A Naïve Bayes classifier with add-1 smoothing using binary bagof-words features.
b) A Naïve Bayes classifier with add-1 smoothing using binary bagof-ngrams features (with unigrams and bigrams).
c) Logistic Regression classifier with L2 regularization (and default parameters) using binary bag-of-words features.
d) Logistic Regression classifier with L2 regularization using binary bag-of-ngrams features (with unigrams and bigrams).
part2:
[optional