本文为学习笔记,记录了由University of Michigan推出的Coursera专项课程——Applied Data Science with Python中Course Four: Applied Text Mining in Python全部Assignment代码,均已通过测试,得分均为100/100。
目录
Module 1: Working with Text in Python - Assignment 1
Module 2: Basic Natural Language Processing - Assignment 2 - Introduction to NLTK
Part 1 - Analyzing Plots Summary Text
Module 3: Classification of Text - Assignment 3
Module 4: Topic Modeling - Assignment 4 - Document Similarity & Topic Modelling
Module 1: Working with Text in Python - Assignment 1
In this assignment, you'll be working with messy medical data and using regex to extract relevant infromation from the data.
Each line of the dates.txt
file corresponds to a medical note. Each note has a date that needs to be extracted, but each date is encoded in one of many formats.
The goal of this assignment is to correctly identify all of the different date variants encoded in this dataset and to properly normalize and sort the dates.
Here is a list of some of the variants you might encounter in this dataset:
- 04/20/2009; 04/20/09; 4/20/09; 4/3/09
- Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
- 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
- Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
- Feb 2009; Sep 2009; Oct 2010
- 6/2008; 12/2009
- 2009; 2010
Once you have extracted these date patterns from the text, the next step is to sort them in ascending chronological order accoring to the following rules:
- Assume all dates in xx/xx/xx format are mm/dd/yy
- Assume all dates where year is encoded in only two digits are years from the 1900's (e.g. 1/5/89 is January 5th, 1989)
- If the day is missing (e.g. 9/2009), assume it is the first day of the month (e.g. September 1, 2009).
- If the month is missing (e.g. 2010), assume it is the first of January of that year (e.g. January 1, 2010).
- Watch out for potential typos as this is a raw, real-life derived dataset.
With these rules in mind, find the correct date in each note and return a pandas Series in chronological order of the original Series' indices. This Series should be sorted by a tie-break sort in the format of ("extracted date", "original row number").
For example if the original series was this:
0 1999
1 2010
2 1978
3 2015
4 1985
Your function should return this:
0 2
1 4
2 0
3 1
4 3
Your score will be calculated using Kendall's tau, a correlation measure for ordinal data.
This function should return a Series of length 500 and dtype int.
import pandas as pd
doc = []
with open('assets/dates.txt') as file:
for line in file:
doc.append(line)
df = pd.Series(doc)
df.head(10)
def date_sorter():
#order = None
# YOUR CODE HERE
df_ = df.copy()
month_dict = {
'Jan':1, 'Feb':2, 'Mar':3, 'Apr':4, 'May':5, 'Jun':6,
'Jul':7, 'Aug':8, 'Sep':9, 'Oct':10, 'Nov':11, 'Dec':12,
'Janaury':1, 'January':1, 'February':2, 'March':3, 'April':4,'June':6, 'July':7, 'August':8,
'September':9, 'October':10, 'November':11, 'December':12, 'Decemeber':12
}
patterns = [
r'(?P<month>\d{1,2})[/-](?P<day>\d{1,2})[/-](?P<year>(?:\d{4}|\d{2}))\b',
r'(?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)(?:[\-\.,]? )(?P<day>\d{2}[a-z]{0,2}),? (?P<year>\d{4})',
r'(?P<day>\d{2}) (?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)[(?:. )(?:, )](?P<year>\d{4})',
r'[A-Za-z0-9]{1}(?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*),? (?P<year>\d{4})',
r'[^0-9],? (?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*),? (?P<year>\d{4})',
r'[^/0-9](?P<month>\d{1,2})/(?P<year>\d{4})',
r'^(?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*),? (?P<year>\d{4})',
r'[\(\.\"](?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*),? (?P<year>\d{4})',
r'^(?P<month>\d{1,2})[/-](?P<year>\d{4})',
r'[^0-9a-z], (?P<year>\d{4})[^0-9]', #
r'^(?P<year>\d{4})',
r'[A-Za-z\.\(~]{1}(?P<year>\d{4})',
r'Age,? \d{1,2}, (?P<year>\d{4})',
r'(?!Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec|ury|ary|rch|ril|une|uly|ust|ber|\d{3}$)[a-zA-Z:;]{3} (?P<year>\d{4})',
r'[Ii]n (?P<year>\d{4})',
r' - (?P<year>\d{4})',
r'\d{3} (?P<year>\d{4})'
]
dates = []
for idx, pattern in enumerate(patterns):
date = df_.str.extractall(pattern)
dates.append(date)
dates_df = pd.concat(dates).sort_index()
dates_df['day'] = dates_df['day'].fillna(1)
dates_df['day'] = dates_df['day'].astype('int').astype('str')
dates_df['month'] = dates_df['month'].fillna('January')
dates_df['month'].replace(month_dict, inplace=True)
dates_df['month'] = dates_df['month'].astype('int').astype('str')
dates_df['year'] = dates_df['year'].apply(lambda x: '19'+x if len(x)==2 else x)
dates_df['year'] = dates_df['year'].astype('int')
dates_df = dates_df[dates_df['year']<=2023]
dates_df['year'] = dates_df['year'].astype('str')
extracted_df = dates_df.droplevel(level='match')
extracted_df['date'] = extracted_df['month'] + '/' + extracted_df['day'] + '/' + extracted_df['year']
times_df = pd.to_datetime(extracted_df['date'])
order = pd.Series(times_df.sort_values(kind='stable').index)
# raise NotImplementedError()
return order # Your answer here
注:笔者对正则表达式并不熟悉,因此patterns部分可能存在较大优化空间。
以下代码用以自测函数date_sorter()返回结果的错误情况:
import numpy as np
s_test = date_sorter()
# check if running the code twice produces the same result
try:
assert (date_sorter() == s_test).all()
print("Passed repeatability check")
except:
print("Failed repeatability check")
# check if the result has the expected index
try:
assert type(date_sorter().index) == pd.RangeIndex
assert (date_sorter().index == pd.RangeIndex(start=0, stop=500, step=1)).all()
print("Passed index check")
except:
print("Failed index check")
# check the tie-break sort for a sample of records where some have the same date
# note that this only tests a sample and does not check the entire answer
try:
i_test = [s_test.index[s_test == v].values[0]
for v in [318, 369, 493, 252, 314, 410, 490]]
assert sorted(i_test) == i_test
print("Passed secondary sort sample check")
except:
print("Failed secondary sort sample check")
# check if the parsed dates appear to be correct and correctly sorted
# by producing some test checksums
# if you get for example a False entry in the agree column for
# index value 20 that would mean you have at least one incorrectly
# parsed or incorrectly sorted date in the **output** index
# range 20,21,...,29
try:
v_check = pd.DataFrame({'correct':
[6695, 14428, 16742, 9275, 12290, 14654, 9421, 10185, 11464, 16491,
11797, 14036, 15459, 9412, 13069, 10400, 10498, 14322, 13274, 11001,
11383, 11910, 10977, 9692, 10199, 10187, 15456, 13491, 9186, 13646,
11142, 13724, 10994, 12905, 15968, 16648, 13966, 14607, 16932, 14622,
17942, 18220, 17818, 18305, 19633, 12522, 13978, 18445, 20156, 14797],
'learner':[
(s_test.iloc[10*i:(i+1)*10].values * np.array(range(1,11))).sum() for i in range(50)]},
index=range(0,500,10)).assign(agree=lambda x:x['correct']==x['learner'])
print("Values checksums:")
print(v_check)
assert v_check['agree'].all()
print("Passed values check")
except:
print("Failed values check")
若全部通过测试,则:
Passed repeatability check
Passed index check
Passed secondary sort sample check
Values checksums:
correct learner agree
0 6695 6695 True
10 14428 14428 True
20 16742 16742 True
30 9275 9275 True
40 12290 12290 True
50 14654 14654 True
60 9421 9421 True
70 10185 10185 True
80 11464 11464 True
90 16491 16491 True
100 11797 11797 True
110 14036 14036 True
120 15459 15459 True
130 9412 9412 True
140 13069 13069 True
150 10400 10400 True
160 10498 10498 True
170 14322 14322 True
180 13274 13274 True
190 11001 11001 True
200 11383 11383 True
210 11910 11910 True
220 10977 10977 True
230 9692 9692 True
240 10199 10199 True
250 10187 10187 True
260 15456 15456 True
270 13491 13491 True
280 9186 9186 True
290 13646 13646 True
300 11142 11142 True
310 13724 13724 True
320 10994 10994 True
330 12905 12905 True
340 15968 15968 True
350 16648 16648 True
360 13966 13966 True
370 14607 14607 True
380 16932 16932 True
390 14622 14622 True
400 17942 17942 True
410 18220 18220 True
420 17818 17818 True
430 18305 18305 True
440 19633 19633 True
450 12522 12522 True
460 13978 13978 True
470 18445 18445 True
480 20156 20156 True
490 14797 14797 True
Passed values check
若索引i的agree列为False,则说明第i至第i+9行中包含解析日期错误或排序错误(以i=20为例,即第20, 21, ..., 29行中包含错误)。
Module 2: Basic Natural Language Processing - Assignment 2 - Introduction to NLTK
In part 1 of this assignment you will use nltk to explore the CMU Movie Summary Corpus. All data is released under a Creative Commons Attribution-ShareAlike License. Then in part 2 you will create a spelling recommender function that uses nltk to find words similar to the misspelling.
Part 1 - Analyzing Plots Summary Text
import nltk
import pandas as pd
import numpy as np
nltk.data.path.append("assets/")
# If you would like to work with the raw text you can use 'plots_raw'
with open('assets/plots.txt', 'rt', encoding="utf8") as f:
plots_raw = f.read()
# If you would like to work with the plot summaries in nltk.Text format you can use 'text1'.
plots_tokens = nltk.word_tokenize(plots_raw)
text1 = nltk.Text(plots_tokens)
Example 1
How many tokens (words and punctuation symbols) are in text1?
This function should return an integer.
def example_one():
return len(nltk.word_tokenize(plots_raw)) # or alternatively len(text1)
example_one()
Returns:
374441
Example 2
How many unique tokens (unique words and punctuation) does text1 have?
This function should return an integer.
def example_two():
return len(set(nltk.word_tokenize(plots_raw))) # or alternatively len(set(text1))
example_two()
Returns:
25933
Example 3
After lemmatizing the verbs, how many unique tokens does text1 have?
This function should return an integer.
from nltk.stem import WordNetLemmatizer
def example_three():
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(w,'v') for w in text1]
return len(set(lemmatized))
example_three()
Returns:
21760
Question 1
What is the lexical diversity of the given text input? (i.e. ratio of unique tokens to the total number of tokens)
This function should return a float.
def answer_one():
# YOUR CODE HERE
# raise NotImplementedError()
return example_two() / example_one()# your answer here
answer_one()
Returns:
0.06925790712021386
Question 2
What percentage of tokens is 'love'or 'Love'?
This function should return a float.
def answer_two():
# YOUR CODE HERE
dist = nltk.FreqDist(text1)
# raise NotImplementedError()
return (dist['love'] + dist['Love']) / example_one() * 100# Your answer here
answer_two()
Returns:
0.12391805384559917
Question 3
What are the 20 most frequently occurring (unique) tokens in the text? What is their frequency?
This function should return a list of 20 tuples where each tuple is of the form (token, frequency)
. The list should be sorted in descending order of frequency.
def answer_three():
# YOUR CODE HERE
dist = nltk.FreqDist(text1)
topFreq = sorted(dist, key=lambda x: dist[x], reverse=True)[:20]
tupList = [tuple((token, dist[token])) for token in topFreq]
# raise NotImplementedError()
return tupList # Your answer here
answer_three()
Returns:
[(',', 19420),
('the', 18698),
('.', 16624),
('to', 12149),
('and', 11400),
('a', 8979),
('of', 6510),
('is', 5699),
('in', 5109),
('his', 4693),
("'s", 3682),
('her', 3674),
('he', 3556),
('that', 3517),
('with', 3293),
('him', 2570),
('for', 2433),
('by', 2321),
('The', 2234),
('on', 1925)]
Question 4
What tokens have a length of greater than 5 and frequency of more than 200?
This function should return an alphabetically sorted list of the tokens that match the above constraints. To sort your list, use sorted()
def answer_four():
# YOUR CODE HERE
dist = nltk.FreqDist(text1)
occurList = [token for token in dist if len(token) > 5 and dist[token] > 200]
# raise NotImplementedError()
return sorted(occurList)# Your answer here
answer_four()
Returns:
['However',
'Meanwhile',
'another',
'because',
'becomes',
'before',
'begins',
'daughter',
'decides',
'escape',
'family',
'father',
'friend',
'friends',
'himself',
'killed',
'leaves',
'mother',
'people',
'police',
'returns',
'school',
'through']
Question 5
Find the longest token in text1 and that token's length.
This function should return a tuple (longest_word, length)
.
def answer_five():
# YOUR CODE HERE
longest_word = sorted(text1, key=lambda x: len(x), reverse=True)[0]
# raise NotImplementedError()
return longest_word, len(longest_word)# Your answer here
answer_five()
Returns:
('live-for-today-for-tomorrow-we-die', 34)
Question 6
What unique words have a frequency of more than 2000? What is their frequency?
"Hint: you may want to use isalpha()
to check if the token is a word and not punctuation."
This function should return a list of tuples of the form (frequency, word)
sorted in descending order of frequency.
def answer_six():
# YOUR CODE HERE
dist = nltk.FreqDist(text1)
tupList = [tuple((frequency, word)) for word, frequency in dist.items() if frequency > 2000 and word.isalpha()]
# raise NotImplementedError()
return sorted(tupList, key=lambda x: x[0], reverse=True)# Your answer here
answer_six()
Returns:
[(18698, 'the'),
(12149, 'to'),
(11400, 'and'),
(8979, 'a'),
(6510, 'of'),
(5699, 'is'),
(5109, 'in'),
(4693, 'his'),
(3674, 'her'),
(3556, 'he'),
(3517, 'that'),
(3293, 'with'),
(2570, 'him'),
(2433, 'for'),
(2321, 'by'),
(2234, 'The')]
Question 7
text1
is in nltk.Text
format that has been constructed using tokens output by nltk.word_tokenize(plots_raw)
.
Now, use nltk.sent_tokenize
on the tokens in text1
by joining them using whitespace to output a sentence-tokenized copy of text1
. Report the average number of whitespace separated tokens per sentence in the sentence-tokenized copy of text1
.
This function should return a float.
def answer_seven():
# YOUR CODE HERE
num_sent, num_whsp = 0, 0
for sent in nltk.sent_tokenize(' '.join(list(text1))):
num_sent += 1
num_whsp += len(sent.split(' '))
# raise NotImplementedError()
#return sent0, words
return num_whsp / num_sent# Your answer here
answer_seven()
Returns:
22.260329350216992
Question 8
What are the 5 most frequent parts of speech in text1
? What is their frequency?
This function should return a list of tuples of the form (part_of_speech, frequency)
sorted in descending order of frequency.
def answer_eight():
# YOUR CODE HERE
dist = nltk.FreqDist(pos for _, pos in nltk.pos_tag(text1))
posFreq = sorted(dist, key=lambda x: dist[x], reverse=True)[:5]
tupList = [tuple((pos, dist[pos])) for pos in posFreq]
# raise NotImplementedError()
return tupList# Your answer here
answer_eight()
Returns:
[('NN', 51452), ('IN', 39225), ('NNP', 38361), ('DT', 34471), ('VBZ', 23799)]
Part 2 - Spelling Recommender
For this part of the assignment you will create three different spelling recommenders, that each take a list of misspelled words and recommends a correctly spelled word for every word in the list.
For every misspelled word, the recommender should find find the word in correct_spellings
that has the shortest distance*, and starts with the same letter as the misspelled word, and return that word as a recommendation.
*Each of the three different recommenders will use a different distance measure (outlined below).
Each of the recommenders should provide recommendations for the three default words provided: ['cormulent', 'incendenece', 'validrate']
.
from nltk.corpus import words
correct_spellings = words.words()
Question 9
For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:
Jaccard distance on the trigrams of the two words.
Refer to:
This function should return a list of length three: ['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']
.
def answer_nine(entries=['cormulent', 'incendenece', 'validrate']):
# your code goes here
# YOUR CODE HERE
from nltk.metrics.distance import jaccard_distance
from nltk.util import ngrams
recommendations = []
tri_grams = lambda x: set(ngrams(x, 3))
jaccard_ = lambda xs, y: [jaccard_distance(tri_grams(x), tri_grams(y)) for x in xs]
recommendations = []
for entry in entries:
correct_spellings_ = [correct_spelling for correct_spelling in correct_spellings if correct_spelling[0] == entry[0]]
recommendations.append(correct_spellings_[np.argmin(jaccard_(correct_spellings_, entry))])
# raise NotImplementedError()
return recommendations# Your answer here
answer_nine()
Returns:
['corpulent', 'indecence', 'validate']
Question 10
For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:
Jaccard distance on the 4-grams of the two words.
Refer to:
This function should return a list of length three: ['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']
.
def answer_ten(entries=['cormulent', 'incendenece', 'validrate']):
# YOUR CODE HERE
from nltk.metrics.distance import jaccard_distance
from nltk.util import ngrams
four_grams = lambda x: set(ngrams(x, 4))
jaccard_ = lambda xs, y: [jaccard_distance(four_grams(x), four_grams(y)) for x in xs]
recommendations = []
for entry in entries:
correct_spellings_ = [correct_spelling for correct_spelling in correct_spellings if correct_spelling[0] == entry[0]]
recommendations.append(correct_spellings_[np.argmin(jaccard_(correct_spellings_, entry))])
# raise NotImplementedError()
return recommendations# Your answer here
answer_ten()
Returns:
['cormus', 'incendiary', 'valid']
Question 11
For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:
Edit distance on the two words with transpositions.
Refer to:
This function should return a list of length three: ['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']
.
def answer_eleven(entries=['cormulent', 'incendenece', 'validrate']):
# YOUR CODE HERE
from nltk.metrics.distance import edit_distance
edit_ = lambda x, y: edit_distance(x, y, substitution_cost=2, transpositions=True)
recommendations = [correct_spellings[np.argmin([edit_(correct_spelling, entry) for correct_spelling in correct_spellings])] for entry in entries]
# raise NotImplementedError()
return recommendations# Your answer here
answer_eleven()
Returns:
['corpulent', 'intendence', 'validate']
Module 3: Classification of Text - Assignment 3
In this assignment you will explore text message data and create models to predict if a message is spam or not.
import pandas as pd
import numpy as np
spam_data = pd.read_csv('assets/spam.csv')
spam_data['target'] = np.where(spam_data['target']=='spam',1,0)
spam_data.head(10)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(spam_data['text'],
spam_data['target'],
random_state=0)
Question 1
What percentage of the documents in spam_data
are spam?
This function should return a float, the percent value (i.e. 𝑟𝑎𝑡𝑖𝑜∗100).
def answer_one():
# YOUR CODE HERE
# raise NotImplementedError()
return len(spam_data[spam_data['target'] == 1]) / len(spam_data) * 100 #Your answer here
answer_one()
Returns:
13.406317300789663
Question 2
Fit the training data X_train
using a Count Vectorizer with default parameters.
What is the longest token in the vocabulary?
This function should return a string.
from sklearn.feature_extraction.text import CountVectorizer
def answer_two():
# YOUR CODE HERE
max_length = 0
max_token = ''
vectorizer = CountVectorizer()
vectorizer.fit(X_train)
for token in vectorizer.get_feature_names_out():
token_length = len(token)
if token_length > max_length:
max_length = token_length
max_token = token
# raise NotImplementedError()
return max_token#Your answer here
answer_two()
Returns:
'com1win150ppmx3age16subscription'
Question 3
Fit and transform the training data X_train
using a Count Vectorizer with default parameters.
Next, fit a fit a multinomial Naive Bayes classifier model with smoothing alpha=0.1
. Find the area under the curve (AUC) score using the transformed test data.
This function should return the AUC score as a float.
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score
def answer_three():
# YOUR CODE HERE
vectorizer = CountVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)
clf = MultinomialNB(alpha=0.1)
clf.fit(X_train_vectorized, y_train)
y_pred = clf.predict(X_test_vectorized)
# raise NotImplementedError()
return roc_auc_score(y_test, y_pred)#Your answer here
answer_three()
Returns:
0.9720812182741116
Question 4
Fit and transform the training data X_train
using a Tfidf Vectorizer with default parameters. The transformed data will be a compressed sparse row matrix where the number of rows is the number of documents in X_train
, the number of columns is the number of features found by the vectorizer in each document, and each value in the sparse matrix is the tf-idf value. First find the max tf-idf value for every feature.
What 20 features have the smallest tf-idf and what 20 have the largest tf-idf among the max tf-idf values?
Put these features in two series where each series is sorted by tf-idf value. The index of the series should be the feature name, and the data should be the tf-idf.
The series of 20 features with smallest tf-idfs should be sorted smallest tfidf first, the list of 20 features with largest tf-idfs should be sorted largest first. Any entries with identical tf-ids should appear in lexigraphically increasing order by their feature name in boh series. For example, if the features "a", "b", "c" had the tf-idfs 1.0, 0.5, 1.0 in the series with the largest tf-idfs, then they should occur in the returned result in the order "a", "c", "b" with values 1.0, 1.0, 0.5.
This function should return a tuple of two series (smallest tf-idfs series, largest tf-idfs series)
.
from sklearn.feature_extraction.text import TfidfVectorizer
def answer_four():
# YOUR CODE HERE
vectorizer = TfidfVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
features = vectorizer.get_feature_names_out()
max_tf_idf_values = np.max(X_train_vectorized.toarray(), axis=0)
tupList = [tuple((ftr, val)) for ftr, val in zip(features, max_tf_idf_values)]
tupList_sorted = sorted(tupList, key=lambda x: x[1])
index1, values1 = [tup[0] for tup in tupList_sorted[:20]], [tup[1] for tup in tupList_sorted[:20]]
index2, values2 = [tup[0] for tup in tupList_sorted[-20:]][::-1], [tup[1] for tup in tupList_sorted[-20:]][::-1]
Series1, Series2 = pd.Series(values1, index=index1), pd.Series(values2, index=index2)
# raise NotImplementedError()
return Series1, Series2 #Your answer here
answer_four()
Returns:
(aaniye 0.074475
athletic 0.074475
chef 0.074475
companion 0.074475
courageous 0.074475
dependable 0.074475
determined 0.074475
exterminator 0.074475
healer 0.074475
listener 0.074475
organizer 0.074475
pest 0.074475
psychiatrist 0.074475
psychologist 0.074475
pudunga 0.074475
stylist 0.074475
sympathetic 0.074475
venaam 0.074475
afternoons 0.091250
approaching 0.091250
dtype: float64,
yup 1.000000
where 1.000000
too 1.000000
thanx 1.000000
thank 1.000000
okie 1.000000
ok 1.000000
nite 1.000000
lei 1.000000
home 1.000000
havent 1.000000
er 1.000000
done 1.000000
beerage 1.000000
anytime 1.000000
anything 1.000000
645 1.000000
146tf150p 1.000000
tick 0.980166
blank 0.932702
dtype: float64)
Question 5
Fit and transform the training data X_train
using a Tfidf Vectorizer ignoring terms that have a document frequency strictly lower than 3.
Then fit a multinomial Naive Bayes classifier model with smoothing alpha=0.1
and compute the area under the curve (AUC) score using the transformed test data.
This function should return the AUC score as a float.
def answer_five():
# YOUR CODE HERE
vectorizer = TfidfVectorizer(min_df=3)
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)
clf = MultinomialNB(alpha=0.1)
clf.fit(X_train_vectorized, y_train)
y_score = clf.predict_proba(X_test_vectorized)[:, 1]
# raise NotImplementedError()
return roc_auc_score(y_test, y_score)#Your answer here
answer_five()
Returns:
0.9954968337775665
Question 6
What is the average length of documents (number of characters) for not spam and spam documents?
This function should return a tuple (average length not spam, average length spam).
def answer_six():
# YOUR CODE HERE
not_spam_docs = spam_data[spam_data['target'] == 0]['text']
spam_docs = spam_data[spam_data['target'] == 1]['text']
avg1 = not_spam_docs.apply(lambda x: len(x)).mean()
avg2 = spam_docs.apply(lambda x: len(x)).mean()
# raise NotImplementedError()
return avg1, avg2#Your answer here
answer_six()
Returns:
(71.02362694300518, 138.8661311914324)
The following function has been provided to help you combine new features into the training data:
def add_feature(X, feature_to_add):
"""
Returns sparse feature matrix with added feature.
feature_to_add can also be a list of features.
"""
from scipy.sparse import csr_matrix, hstack
return hstack([X, csr_matrix(feature_to_add).T], 'csr')
Question 7
Fit and transform the training data X_train using a Tfidf Vectorizer ignoring terms that have a document frequency strictly lower than 5.
Using this document-term matrix and an additional feature, the length of document (number of characters), fit a Support Vector Classification model with regularization C=10000
. Then compute the area under the curve (AUC) score using the transformed test data.
Hint: Since probability is set to false, use the model's decision_function
on the test data when calculating the target scores to use in roc_auc_score
This function should return the AUC score as a float.
from sklearn.svm import SVC
def answer_seven():
# YOUR CODE HERE
vectorizer = TfidfVectorizer(min_df=5)
add_ftr_train = X_train.apply(lambda x: len(x))
add_ftr_test = X_test.apply(lambda x: len(x))
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)
X_train_added = add_feature(X_train_vectorized, add_ftr_train)
X_test_added = add_feature(X_test_vectorized, add_ftr_test)
clf = SVC(C=10000)
clf.fit(X_train_added, y_train)
y_score = clf.decision_function(X_test_added)
# raise NotImplementedError()
return roc_auc_score(y_test, y_score)#Your answer here
answer_seven()
Returns:
0.9963202213809143
Question 8
What is the average number of digits per document for not spam and spam documents?
Hint: Use \d
for digit class
This function should return a tuple (average # digits not spam, average # digits spam).
def answer_eight():
# YOUR CODE HERE
not_spam_docs = spam_data[spam_data['target'] == 0]['text']
spam_docs = spam_data[spam_data['target'] == 1]['text']
avg1 = not_spam_docs.str.count('\d').mean()
avg2 = spam_docs.str.count('\d').mean()
# raise NotImplementedError()
return avg1, avg2#Your answer here
answer_eight()
Returns:
(0.2992746113989637, 15.759036144578314)
Question 9
Fit and transform the training data X_train
using a Tfidf Vectorizer ignoring terms that have a document frequency strictly lower than 5 and using word n-grams from n=1 to n=3 (unigrams, bigrams, and trigrams).
Using this document-term matrix and the following additional features:
- the length of document (number of characters)
- number of digits per document
fit a Logistic Regression model with regularization C=100
and max_iter=1000
. Then compute the area under the curve (AUC) score using the transformed test data.
This function should return the AUC score as a float.
from sklearn.linear_model import LogisticRegression
def answer_nine():
# YOUR CODE HERE
vectorizer = TfidfVectorizer(min_df=5, ngram_range=(1,3))
add_ftr_train1 = X_train.apply(lambda x: len(x))
add_ftr_test1 = X_test.apply(lambda x: len(x))
add_ftr_train2 = X_train.str.count('\d')
add_ftr_test2 = X_test.str.count('\d')
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)
X_train_added = add_feature(X_train_vectorized, add_ftr_train1)
X_test_added = add_feature(X_test_vectorized, add_ftr_test1)
X_train_added = add_feature(X_train_added, add_ftr_train2)
X_test_added = add_feature(X_test_added, add_ftr_test2)
clf = LogisticRegression(C=100, max_iter=1000)
clf.fit(X_train_added, y_train)
y_score = clf.predict_proba(X_test_added)[:, 1]
# raise NotImplementedError()
return roc_auc_score(y_test, y_score)#Your answer here
answer_nine()
Returns:
0.9973218681561211
Question 10
What is the average number of non-word characters (anything other than a letter, digit or underscore) per document for not spam and spam documents?
Hint: Use \w
and \W
character classes
This function should return a tuple (average # non-word characters not spam, average # non-word characters spam).
def answer_ten():
# YOUR CODE HERE
not_spam_docs = spam_data[spam_data['target'] == 0]['text']
spam_docs = spam_data[spam_data['target'] == 1]['text']
avg1 = not_spam_docs.str.count('\W').mean()
avg2 = spam_docs.str.count('\W').mean()
# raise NotImplementedError()
return avg1, avg2#Your answer here
answer_ten()
Returns:
(17.29181347150259, 29.041499330655956)
Question 11
Fit and transform the first 2000 rows of training data X_train using a Count Vectorizer ignoring terms that have a document frequency strictly lower than 5 and using character n-grams from n=2 to n=5.
To tell Count Vectorizer to use character n-grams pass in analyzer='char_wb'
which creates character n-grams only from text inside word boundaries. This should make the model more robust to spelling mistakes.
Using this document-term matrix and the following additional features:
- the length of document (number of characters)
- number of digits per document
- number of non-word characters (anything other than a letter, digit or underscore.)
fit a Logistic Regression model with regularization C=100 and max_iter=1000. Then compute the area under the curve (AUC) score using the transformed test data.
Also find the 10 smallest and 10 largest coefficients from the model and return them along with the AUC score in a tuple.
The list of 10 smallest coefficients should be sorted smallest first, the list of 10 largest coefficients should be sorted largest first.
The three features that were added to the document term matrix should have the following names should they appear in the list of coefficients: ['length_of_doc', 'digit_count', 'non_word_char_count']
This function should return a tuple (AUC score as a float, smallest coefs list, largest coefs list)
.
def answer_eleven():
# YOUR CODE HERE
vectorizer = CountVectorizer(min_df=5, ngram_range=(2,5), analyzer='char_wb')
add_ftr_train1 = X_train[:2000].apply(lambda x: len(x))
add_ftr_train2 = X_train[:2000].str.count('\d')
add_ftr_train3 = X_train[:2000].str.count('\W')
add_ftr_test1 = X_test.apply(lambda x: len(x))
add_ftr_test2 = X_test.str.count('\d')
add_ftr_test3 = X_test.str.count('\W')
X_train_vectorized = vectorizer.fit_transform(X_train[:2000])
X_test_vectorized = vectorizer.transform(X_test)
X_train_added = add_feature(X_train_vectorized, add_ftr_train1)
X_test_added = add_feature(X_test_vectorized, add_ftr_test1)
X_train_added = add_feature(X_train_added, add_ftr_train2)
X_test_added = add_feature(X_test_added, add_ftr_test2)
X_train_added = add_feature(X_train_added, add_ftr_train3)
X_test_added = add_feature(X_test_added, add_ftr_test3)
clf = LogisticRegression(C=100, max_iter=1000)
clf.fit(X_train_added, y_train[:2000])
y_score = clf.predict_proba(X_test_added)[:, 1]
auc = roc_auc_score(y_test, y_score)
features = vectorizer.get_feature_names_out().tolist() + ['length_of_doc', 'digit_count', 'non_word_char_count']
coefs = clf.coef_.tolist()[0]
tupList = [tuple((ftr, coef)) for ftr, coef in zip(features, coefs)]
tupList_sorted = sorted(tupList, key=lambda x: x[1])
coef_smallest = [tup[0] for tup in tupList_sorted[:10]]
coef_largest = [tup[0] for tup in tupList_sorted[-10:][::-1]]
# raise NotImplementedError()
return auc, coef_smallest, coef_largest#Your answer here
answer_eleven()
Returns:
(0.997568035583926,
['n ', ' i', 'at', 'he', ' m', '..', 'us', 'go', ' lo', ' bu'],
['digit_count', 'ne', ' st', 'co', 's ', 'xt', 'lt', 'xt ', ' ne', 'der'])
Module 4: Topic Modeling - Assignment 4 - Document Similarity & Topic Modelling
Part 1 - Document Similarity
For the first part of this assignment, you will complete the functions doc_to_synsets
and similarity_score
which will be used by document_path_similarity
to find the path similarity between two documents.
The following functions are provided:
convert_tag:
converts the tag given bynltk.pos_tag
to a tag used bywordnet.synsets
. You will need to use this function indoc_to_synsets
.document_path_similarity:
computes the symmetrical path similarity between two documents by finding the synsets in each document usingdoc_to_synsets
, then computing similarities usingsimilarity_score
.
You will need to finish writing the following functions:
doc_to_synsets:
returns a list of synsets in document. This function should first tokenize and part of speech tag the document usingnltk.word_tokenize
andnltk.pos_tag
. Then it should find each tokens corresponding synset usingwn.synsets(token, wordnet_tag)
. The first synset match should be used. If there is no match, that token is skipped.similarity_score:
returns the normalized similarity score of a list of synsets (s1) onto a second list of synsets (s2). For each synset in s1, find the synset in s2 with the largest similarity value. Sum all of the largest similarity values together and normalize this value by dividing it by the number of largest similarity values found. Be careful with data types, which should be floats. Missing values should be ignored.
Once doc_to_synsets and similarity_score have been completed, submit to the autograder which will run a test to check that these functions are running correctly.
Do not modify the functions convert_tag
and document_path_similarity
.
%%capture
import numpy as np
import nltk
nltk.download('punkt')
from nltk.corpus import wordnet as wn
import pandas as pd
nltk.data.path.append("assets/")
def convert_tag(tag):
"""Convert the tag given by nltk.pos_tag to the tag used by wordnet.synsets"""
tag_dict = {'N': 'n', 'J': 'a', 'R': 'r', 'V': 'v'}
try:
return tag_dict[tag[0]]
except KeyError:
return None
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4') # Added
def doc_to_synsets(doc):
"""
Returns a list of synsets in document.
Tokenizes and tags the words in the document doc.
Then finds the first synset for each word/tag combination.
If a synset is not found for that combination it is skipped.
Args:
doc: string to be converted
Returns:
list of synsets
Example:
doc_to_synsets('Fish are friends.')
Out: [Synset('fish.n.01'), Synset('be.v.01'), Synset('friend.n.01')]
"""
# YOUR CODE HERE
tokens = nltk.word_tokenize(doc)
pos_tags = nltk.pos_tag(tokens)
wordnet_tags = [convert_tag(tag[1]) for tag in pos_tags]
# raise NotImplementedError()
return [wn.synsets(token, wordnet_tag)[0] for token, wordnet_tag in zip(tokens, wordnet_tags) if len(wn.synsets(token, wordnet_tag)) > 0]# Your Answer Here
print(doc_to_synsets('Fish are friends.'))
def similarity_score(s1, s2):
"""
Calculate the normalized similarity score of s1 onto s2
For each synset in s1, finds the synset in s2 with the largest similarity value.
Sum of all of the largest similarity values and normalize this value by dividing it by the
number of largest similarity values found.
Args:
s1, s2: list of synsets from doc_to_synsets
Returns:
normalized similarity score of s1 onto s2
Example:
synsets1 = doc_to_synsets('I like cats')
synsets2 = doc_to_synsets('I like dogs')
similarity_score(synsets1, synsets2)
Out: 0.7333333333333333
"""
# YOUR CODE HERE
max_similarity_values = []
for syn1 in s1:
similarity_values = [syn1.path_similarity(syn2) for syn2 in s2 if syn1.path_similarity(syn2) is not None]
if similarity_values:
max_similarity_values.append(max(similarity_values))
# raise NotImplementedError()
return np.mean(max_similarity_values)# Your Answer Here
synsets1 = doc_to_synsets('I like cats')
synsets2 = doc_to_synsets('I like dogs')
print(similarity_score(synsets1, synsets2))
注:这里需要手动添加三行代码(代码块顶部),否则会出现报错。
Returns:
[Synset('fish.n.01'), Synset('be.v.01'), Synset('friend.n.01')]
0.7333333333333334
def document_path_similarity(doc1, doc2):
"""Finds the symmetrical similarity between doc1 and doc2"""
synsets1 = doc_to_synsets(doc1)
synsets2 = doc_to_synsets(doc2)
return (similarity_score(synsets1, synsets2) + similarity_score(synsets2, synsets1)) / 2
paraphrases
is a DataFrame which contains the following columns: Quality
, D1
, and D2
.
Quality
is an indicator variable which indicates if the two documents D1
and D2
are paraphrases of one another (1 for paraphrase, 0 for not paraphrase).
# Use this dataframe for questions most_similar_docs and label_accuracy
paraphrases = pd.read_csv('assets/paraphrases.csv')
paraphrases.head()
most_similar_docs
Using document_path_similarity
, find the pair of documents in paraphrases which has the maximum similarity score.
This function should return a tuple (D1, D2, similarity_score)
def most_similar_docs():
# YOUR CODE HERE
tupList = [tuple((D1, D2, document_path_similarity(D1, D2))) for D1, D2 in zip(paraphrases['D1'], paraphrases['D2'])]
# raise NotImplementedError()
return sorted(tupList, key=lambda x: x[2], reverse=True)[0]# Your Answer Here
most_similar_docs()
Returns:
('"Indeed, Iran should be put on notice that efforts to try to remake Iraq in their image will be aggressively put down," he said.',
'"Iran should be on notice that attempts to remake Iraq in Iran\'s image will be aggressively put down," he said.\n',
0.9590643274853801)
label_accuracy
Provide labels for the twenty pairs of documents by computing the similarity for each pair using document_path_similarity
. Let the classifier rule be that if the score is greater than 0.75, label is paraphrase (1), else label is paraphrase (0). Report accuracy of the classifier using scikit-learn's accuracy_score.
This function should return a float.
def label_accuracy():
from sklearn.metrics import accuracy_score
# YOUR CODE HERE
y_true = paraphrases['Quality']
y_pred = pd.Series([document_path_similarity(D1, D2) for D1, D2 in zip(paraphrases['D1'], paraphrases['D2'])]).apply(lambda x: 1 if x > 0.75 else 0)
# raise NotImplementedError()
return accuracy_score(y_true, y_pred)# Your Answer Here
label_accuracy()
Returns:
0.7
Part 2 - Topic Modelling
For the second part of this assignment, you will use Gensim's LDA (Latent Dirichlet Allocation) model to model topics in newsgroup_data
. You will first need to finish the code in the cell below by using gensim.models.ldamodel.LdaModel constructor to estimate LDA model parameters on the corpus, and save to the variable ldamodel
. Extract 10 topics using corpus
and id_map
, and with passes=25
and random_state=34
.
import pickle
import gensim
from sklearn.feature_extraction.text import CountVectorizer
# Load the list of documents
with open('assets/newsgroups', 'rb') as f:
newsgroup_data = pickle.load(f)
# Use CountVectorizor to find three letter tokens, remove stop_words,
# remove tokens that don't appear in at least 20 documents,
# remove tokens that appear in more than 20% of the documents
vect = CountVectorizer(min_df=20, max_df=0.2, stop_words='english',
token_pattern='(?u)\\b\\w\\w\\w+\\b')
# Fit and transform
X = vect.fit_transform(newsgroup_data)
# Convert sparse matrix to gensim corpus.
corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)
# Mapping from word IDs to words (To be used in LdaModel's id2word parameter)
id_map = dict((v, k) for k, v in vect.vocabulary_.items())
# Use the gensim.models.ldamodel.LdaModel constructor to estimate
# LDA model parameters on the corpus, and save to the variable `ldamodel`
# YOUR CODE HERE
ldamodel = gensim.models.ldamodel.LdaModel(corpus, id2word=id_map, num_topics=10, passes=25, random_state=34)
# raise NotImplementedError()
lda_topics
Using ldamodel
, find a list of the 10 topics and the most significant 10 words in each topic. This should be structured as a list of 10 tuples where each tuple takes on the form:
(9, '0.068*"space" + 0.036*"nasa" + 0.021*"science" + 0.020*"edu" + 0.019*"data" + 0.017*"shuttle" + 0.015*"launch" + 0.015*"available" + 0.014*"center" + 0.013*"information"')
for example.
This function should return a list of tuples.
def lda_topics():
# YOUR CODE HERE
# raise NotImplementedError()
return ldamodel.print_topics(num_topics=10)# Your Answer Here
lda_topics()
Returns:
[(0,
'0.056*"edu" + 0.043*"com" + 0.033*"thanks" + 0.022*"mail" + 0.021*"know" + 0.020*"does" + 0.014*"info" + 0.012*"monitor" + 0.010*"looking" + 0.010*"don"'),
(1,
'0.024*"ground" + 0.018*"current" + 0.018*"just" + 0.013*"want" + 0.013*"use" + 0.011*"using" + 0.011*"used" + 0.010*"power" + 0.010*"speed" + 0.010*"output"'),
(2,
'0.061*"drive" + 0.042*"disk" + 0.033*"scsi" + 0.030*"drives" + 0.028*"hard" + 0.028*"controller" + 0.027*"card" + 0.020*"rom" + 0.018*"floppy" + 0.017*"bus"'),
(3,
'0.023*"time" + 0.015*"atheism" + 0.014*"list" + 0.013*"left" + 0.012*"alt" + 0.012*"faq" + 0.012*"probably" + 0.011*"know" + 0.011*"send" + 0.010*"months"'),
(4,
'0.025*"car" + 0.016*"just" + 0.014*"don" + 0.014*"bike" + 0.012*"good" + 0.011*"new" + 0.011*"think" + 0.010*"year" + 0.010*"cars" + 0.010*"time"'),
(5,
'0.030*"game" + 0.027*"team" + 0.023*"year" + 0.017*"games" + 0.016*"play" + 0.012*"season" + 0.012*"players" + 0.012*"win" + 0.011*"hockey" + 0.011*"good"'),
(6,
'0.017*"information" + 0.014*"help" + 0.014*"medical" + 0.012*"new" + 0.012*"use" + 0.012*"000" + 0.012*"research" + 0.011*"university" + 0.010*"number" + 0.010*"program"'),
(7,
'0.022*"don" + 0.021*"people" + 0.018*"think" + 0.017*"just" + 0.012*"say" + 0.011*"know" + 0.011*"does" + 0.011*"good" + 0.010*"god" + 0.009*"way"'),
(8,
'0.034*"use" + 0.023*"apple" + 0.020*"power" + 0.016*"time" + 0.015*"data" + 0.015*"software" + 0.012*"pin" + 0.012*"memory" + 0.012*"simms" + 0.011*"port"'),
(9,
'0.068*"space" + 0.036*"nasa" + 0.021*"science" + 0.020*"edu" + 0.019*"data" + 0.017*"shuttle" + 0.015*"launch" + 0.015*"available" + 0.014*"center" + 0.014*"sci"')]
topic_distribution
For the new document new_doc
, find the topic distribution. Remember to use vect.transform on the the new doc, and Sparse2Corpus to convert the sparse matrix to gensim corpus.
This function should return a list of tuples, where each tuple is (#topic, probability)
new_doc = ["\n\nIt's my understanding that the freezing will start to occur because \
of the\ngrowing distance of Pluto and Charon from the Sun, due to it's\nelliptical orbit. \
It is not due to shadowing effects. \n\n\nPluto can shadow Charon, and vice-versa.\n\nGeorge \
Krumins\n-- "]
def topic_distribution():
# YOUR CODE HERE
X = vect.transform(new_doc)
corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)
# raise NotImplementedError()
return list(ldamodel.get_document_topics(corpus))[0]# Your Answer Here
topic_distribution()
Returns:
[(0, 0.020003108),
(1, 0.020003324),
(2, 0.020001281),
(3, 0.49674824),
(4, 0.020004038),
(5, 0.020004129),
(6, 0.020002972),
(7, 0.020002645),
(8, 0.020003129),
(9, 0.34322715)]
topic_names
From the list of the following given topics, assign topic names to the topics you found. If none of these names best matches the topics you found, create a new 1-3 word "title" for the topic.
Topics: Health, Science, Automobiles, Politics, Government, Travel, Computers & IT, Sports, Business, Society & Lifestyle, Religion, Education.
This function should return a list of 10 strings.
def topic_names():
# YOUR CODE HERE
# raise NotImplementedError()
return ['Education','Science','Computers & IT','Religion','Automobiles','Sports','Science','Religion','Computers & IT','Science']# Your Answer Here