Coursera | Applied Data Science with Python专项课程 | Applied Text Mining in Python

最新推荐文章于 2025-05-20 18:44:14 发布

NJ_Xavier

最新推荐文章于 2025-05-20 18:44:14 发布

阅读量2.2k

点赞数

分类专栏： Python学习笔记文章标签： python nlp scikit-learn pandas 支持向量机

本文链接：https://blog.csdn.net/NJ_Xavier/article/details/131894307

版权

Python学习笔记专栏收录该内容

8 篇文章

订阅专栏

本文为学习笔记，记录了由University of Michigan推出的Coursera专项课程——Applied Data Science with Python中Course Four: Applied Text Mining in Python全部Assignment代码，均已通过测试，得分均为100/100。

Module 1: Working with Text in Python - Assignment 1

Module 2: Basic Natural Language Processing - Assignment 2 - Introduction to NLTK

Part 1 - Analyzing Plots Summary Text

Part 2 - Spelling Recommender

Question 9

Question 10

Question 11

Module 3: Classification of Text - Assignment 3

Module 4: Topic Modeling - Assignment 4 - Document Similarity & Topic Modelling

Part 1 - Document Similarity

most_similar_docs

label_accuracy

Part 2 - Topic Modelling

lda_topics

topic_distribution

topic_names

Module 1: Working with Text in Python - Assignment 1

In this assignment, you'll be working with messy medical data and using regex to extract relevant infromation from the data.

Each line of the dates.txt file corresponds to a medical note. Each note has a date that needs to be extracted, but each date is encoded in one of many formats.

The goal of this assignment is to correctly identify all of the different date variants encoded in this dataset and to properly normalize and sort the dates.

Here is a list of some of the variants you might encounter in this dataset:

04/20/2009; 04/20/09; 4/20/09; 4/3/09
Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
Feb 2009; Sep 2009; Oct 2010
6/2008; 12/2009
2009; 2010

Once you have extracted these date patterns from the text, the next step is to sort them in ascending chronological order accoring to the following rules:

Assume all dates in xx/xx/xx format are mm/dd/yy
Assume all dates where year is encoded in only two digits are years from the 1900's (e.g. 1/5/89 is January 5th, 1989)
If the day is missing (e.g. 9/2009), assume it is the first day of the month (e.g. September 1, 2009).
If the month is missing (e.g. 2010), assume it is the first of January of that year (e.g. January 1, 2010).
Watch out for potential typos as this is a raw, real-life derived dataset.

With these rules in mind, find the correct date in each note and return a pandas Series in chronological order of the original Series' indices. This Series should be sorted by a tie-break sort in the format of ("extracted date", "original row number").

For example if the original series was this:

Your function should return this:

Your score will be calculated using Kendall's tau, a correlation measure for ordinal data.

This function should return a Series of length 500 and dtype int.

import pandas as pd

doc = []
with open('assets/dates.txt') as file:
    for line in file:
        doc.append(line)

df = pd.Series(doc)
df.head(10)

def date_sorter():
    
    #order = None
    # YOUR CODE HERE
    df_ = df.copy()
    month_dict = {
        'Jan':1, 'Feb':2, 'Mar':3, 'Apr':4, 'May':5, 'Jun':6,
        'Jul':7, 'Aug':8, 'Sep':9, 'Oct':10, 'Nov':11, 'Dec':12,
        'Janaury':1, 'January':1, 'February':2, 'March':3, 'April':4,'June':6, 'July':7, 'August':8, 
        'September':9, 'October':10, 'November':11, 'December':12, 'Decemeber':12
    }
    patterns = [
        r'(?P<month>\d{1,2})[/-](?P<day>\d{1,2})[/-](?P<year>(?:\d{4}|\d{2}))\b',
        r'(?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)(?:[\-\.,]? )(?P<day>\d{2}[a-z]{0,2}),? (?P<year>\d{4})',
        r'(?P<day>\d{2}) (?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)[(?:. )(?:, )](?P<year>\d{4})',
        r'[A-Za-z0-9]{1}(?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*),? (?P<year>\d{4})',
        r'[^0-9],? (?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*),? (?P<year>\d{4})',
        r'[^/0-9](?P<month>\d{1,2})/(?P<year>\d{4})',
        r'^(?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*),? (?P<year>\d{4})',
        r'[\(\.\"](?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*),? (?P<year>\d{4})',
        r'^(?P<month>\d{1,2})[/-](?P<year>\d{4})',
        r'[^0-9a-z], (?P<year>\d{4})[^0-9]', #
        r'^(?P<year>\d{4})',
        r'[A-Za-z\.\(~]{1}(?P<year>\d{4})',
        r'Age,? \d{1,2}, (?P<year>\d{4})',
        r'(?!Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec|ury|ary|rch|ril|une|uly|ust|ber|\d{3}$)[a-zA-Z:;]{3} (?P<year>\d{4})',
        r'[Ii]n (?P<year>\d{4})',
        r' - (?P<year>\d{4})',
        r'\d{3} (?P<year>\d{4})'
    ]
    
    dates = []
    for idx, pattern in enumerate(patterns):
        date = df_.str.extractall(pattern)
        dates.append(date)
    
    dates_df = pd.concat(dates).sort_index()
    
    dates_df['day'] = dates_df['day'].fillna(1)
    dates_df['day'] = dates_df['day'].astype('int').astype('str')
    
    dates_df['month'] = dates_df['month'].fillna('January')
    dates_df['month'].replace(month_dict, inplace=True)
    dates_df['month'] = dates_df['month'].astype('int').astype('str')
    
    dates_df['year'] = dates_df['year'].apply(lambda x: '19'+x if len(x)==2 else x)
    dates_df['year'] = dates_df['year'].astype('int')
    dates_df = dates_df[dates_df['year']<=2023]
    dates_df['year'] = dates_df['year'].astype('str')
    
    extracted_df = dates_df.droplevel(level='match')
    extracted_df['date'] = extracted_df['month'] + '/' + extracted_df['day'] + '/' + extracted_df['year']
    
    times_df = pd.to_datetime(extracted_df['date'])
    order = pd.Series(times_df.sort_values(kind='stable').index)
    # raise NotImplementedError()
    return order # Your answer here

注：笔者对正则表达式并不熟悉，因此patterns部分可能存在较大优化空间。

以下代码用以自测函数date_sorter()返回结果的错误情况：

import numpy as np
s_test = date_sorter()

# check if running the code twice produces the same result
try:
    assert (date_sorter() == s_test).all()
    print("Passed repeatability check")
except:
    print("Failed repeatability check")

# check if the result has the expected index
try:
    assert type(date_sorter().index) == pd.RangeIndex
    assert (date_sorter().index == pd.RangeIndex(start=0, stop=500, step=1)).all()
    print("Passed index check")
except:
    print("Failed index check")

# check the tie-break sort for a sample of records where some have the same date
# note that this only tests a sample and does not check the entire answer
try:
    i_test = [s_test.index[s_test == v].values[0]
              for v in [318, 369, 493, 252, 314, 410, 490]]
    assert sorted(i_test) == i_test
    print("Passed secondary sort sample check")
except:
    print("Failed secondary sort sample check")

# check if the parsed dates appear to be correct and correctly sorted
# by producing some test checksums
# if you get for example a False entry in the agree column for
# index value 20 that would mean you have at least one incorrectly
# parsed or incorrectly sorted date in the **output** index
# range 20,21,...,29
try:
    v_check = pd.DataFrame({'correct':
    [6695, 14428, 16742, 9275, 12290, 14654, 9421, 10185, 11464, 16491,
     11797, 14036, 15459, 9412, 13069, 10400, 10498, 14322, 13274, 11001,
     11383, 11910, 10977, 9692, 10199, 10187, 15456, 13491, 9186, 13646,
     11142, 13724, 10994, 12905, 15968, 16648, 13966, 14607, 16932, 14622,
     17942, 18220, 17818, 18305, 19633, 12522, 13978, 18445, 20156, 14797],
    'learner':[
    (s_test.iloc[10*i:(i+1)*10].values * np.array(range(1,11))).sum() for i in range(50)]},
    index=range(0,500,10)).assign(agree=lambda x:x['correct']==x['learner'])
    print("Values checksums:")
    print(v_check)
    assert v_check['agree'].all()
    print("Passed values check")
except:
    print("Failed values check")

若全部通过测试，则：

Passed repeatability check
Passed index check
Passed secondary sort sample check
Values checksums:
     correct  learner  agree
0       6695     6695   True
10     14428    14428   True
20     16742    16742   True
30      9275     9275   True
40     12290    12290   True
50     14654    14654   True
60      9421     9421   True
70     10185    10185   True
80     11464    11464   True
90     16491    16491   True
100    11797    11797   True
110    14036    14036   True
120    15459    15459   True
130     9412     9412   True
140    13069    13069   True
150    10400    10400   True
160    10498    10498   True
170    14322    14322   True
180    13274    13274   True
190    11001    11001   True
200    11383    11383   True
210    11910    11910   True
220    10977    10977   True
230     9692     9692   True
240    10199    10199   True
250    10187    10187   True
260    15456    15456   True
270    13491    13491   True
280     9186     9186   True
290    13646    13646   True
300    11142    11142   True
310    13724    13724   True
320    10994    10994   True
330    12905    12905   True
340    15968    15968   True
350    16648    16648   True
360    13966    13966   True
370    14607    14607   True
380    16932    16932   True
390    14622    14622   True
400    17942    17942   True
410    18220    18220   True
420    17818    17818   True
430    18305    18305   True
440    19633    19633   True
450    12522    12522   True
460    13978    13978   True
470    18445    18445   True
480    20156    20156   True
490    14797    14797   True
Passed values check

若索引i的agree列为False，则说明第i至第i+9行中包含解析日期错误或排序错误（以i=20为例，即第20, 21, ..., 29行中包含错误）。

Module 2: Basic Natural Language Processing - Assignment 2 - Introduction to NLTK

In part 1 of this assignment you will use nltk to explore the CMU Movie Summary Corpus. All data is released under a Creative Commons Attribution-ShareAlike License. Then in part 2 you will create a spelling recommender function that uses nltk to find words similar to the misspelling.

Part 1 - Analyzing Plots Summary Text

import nltk
import pandas as pd
import numpy as np

nltk.data.path.append("assets/")

# If you would like to work with the raw text you can use 'plots_raw'
with open('assets/plots.txt', 'rt', encoding="utf8") as f:
    plots_raw = f.read()

# If you would like to work with the plot summaries in nltk.Text format you can use 'text1'.
plots_tokens = nltk.word_tokenize(plots_raw)
text1 = nltk.Text(plots_tokens)

Example 1

How many tokens (words and punctuation symbols) are in text1?

This function should return an integer.

def example_one():
    
    return len(nltk.word_tokenize(plots_raw)) # or alternatively len(text1)

example_one()

Returns:

Example 2

How many unique tokens (unique words and punctuation) does text1 have?

This function should return an integer.

def example_two():
    
    return len(set(nltk.word_tokenize(plots_raw))) # or alternatively len(set(text1))

example_two()

Returns:

Example 3

After lemmatizing the verbs, how many unique tokens does text1 have?

This function should return an integer.

from nltk.stem import WordNetLemmatizer

def example_three():

    lemmatizer = WordNetLemmatizer()
    lemmatized = [lemmatizer.lemmatize(w,'v') for w in text1]

    return len(set(lemmatized))

example_three()

Returns:

Question 1

What is the lexical diversity of the given text input? (i.e. ratio of unique tokens to the total number of tokens)

This function should return a float.

def answer_one():

    # YOUR CODE HERE
    # raise NotImplementedError()
    return example_two() / example_one()# your answer here

answer_one()

Returns:

0.06925790712021386

Question 2

What percentage of tokens is 'love'or 'Love'?

This function should return a float.

def answer_two():

    # YOUR CODE HERE
    dist = nltk.FreqDist(text1)
    # raise NotImplementedError() 
    return (dist['love'] + dist['Love']) / example_one() * 100# Your answer here

answer_two()

Returns:

0.12391805384559917

Question 3

What are the 20 most frequently occurring (unique) tokens in the text? What is their frequency?

This function should return a list of 20 tuples where each tuple is of the form (token, frequency). The list should be sorted in descending order of frequency.

def answer_three():

    # YOUR CODE HERE
    dist = nltk.FreqDist(text1)
    topFreq = sorted(dist, key=lambda x: dist[x], reverse=True)[:20]
    tupList = [tuple((token, dist[token])) for token in topFreq]
    # raise NotImplementedError()
    return tupList # Your answer here

answer_three()

Returns:

[(',', 19420),
 ('the', 18698),
 ('.', 16624),
 ('to', 12149),
 ('and', 11400),
 ('a', 8979),
 ('of', 6510),
 ('is', 5699),
 ('in', 5109),
 ('his', 4693),
 ("'s", 3682),
 ('her', 3674),
 ('he', 3556),
 ('that', 3517),
 ('with', 3293),
 ('him', 2570),
 ('for', 2433),
 ('by', 2321),
 ('The', 2234),
 ('on', 1925)]

Question 4

What tokens have a length of greater than 5 and frequency of more than 200?

This function should return an alphabetically sorted list of the tokens that match the above constraints. To sort your list, use sorted()

def answer_four():

    # YOUR CODE HERE
    dist = nltk.FreqDist(text1)
    occurList = [token for token in dist if len(token) > 5 and dist[token] > 200]
    # raise NotImplementedError()
    return sorted(occurList)# Your answer here

answer_four()

Returns:

['However',
 'Meanwhile',
 'another',
 'because',
 'becomes',
 'before',
 'begins',
 'daughter',
 'decides',
 'escape',
 'family',
 'father',
 'friend',
 'friends',
 'himself',
 'killed',
 'leaves',
 'mother',
 'people',
 'police',
 'returns',
 'school',
 'through']

Question 5

Find the longest token in text1 and that token's length.

This function should return a tuple (longest_word, length).

def answer_five():

    # YOUR CODE HERE
    longest_word = sorted(text1, key=lambda x: len(x), reverse=True)[0]
    # raise NotImplementedError()
    return longest_word, len(longest_word)# Your answer here

answer_five()

Returns:

('live-for-today-for-tomorrow-we-die', 34)

Question 6

What unique words have a frequency of more than 2000? What is their frequency?

"Hint: you may want to use isalpha() to check if the token is a word and not punctuation."

This function should return a list of tuples of the form (frequency, word) sorted in descending order of frequency.

def answer_six():

    # YOUR CODE HERE
    dist = nltk.FreqDist(text1)
    tupList = [tuple((frequency, word)) for word, frequency in dist.items() if frequency > 2000 and word.isalpha()]
    # raise NotImplementedError()
    return sorted(tupList, key=lambda x: x[0], reverse=True)# Your answer here

answer_six()

Returns:

[(18698, 'the'),
 (12149, 'to'),
 (11400, 'and'),
 (8979, 'a'),
 (6510, 'of'),
 (5699, 'is'),
 (5109, 'in'),
 (4693, 'his'),
 (3674, 'her'),
 (3556, 'he'),
 (3517, 'that'),
 (3293, 'with'),
 (2570, 'him'),
 (2433, 'for'),
 (2321, 'by'),
 (2234, 'The')]

Question 7

text1 is in nltk.Text format that has been constructed using tokens output by nltk.word_tokenize(plots_raw).

Now, use nltk.sent_tokenize on the tokens in text1 by joining them using whitespace to output a sentence-tokenized copy of text1. Report the average number of whitespace separated tokens per sentence in the sentence-tokenized copy of text1.

This function should return a float.

def answer_seven():

    # YOUR CODE HERE
    num_sent, num_whsp = 0, 0
    for sent in nltk.sent_tokenize(' '.join(list(text1))):
        num_sent += 1
        num_whsp += len(sent.split(' '))
    # raise NotImplementedError()
    #return sent0, words
    return num_whsp / num_sent# Your answer here

answer_seven()

Returns:

22.260329350216992

Question 8

What are the 5 most frequent parts of speech in text1? What is their frequency?

This function should return a list of tuples of the form (part_of_speech, frequency) sorted in descending order of frequency.

def answer_eight():

    # YOUR CODE HERE
    dist = nltk.FreqDist(pos for _, pos in nltk.pos_tag(text1))
    posFreq = sorted(dist, key=lambda x: dist[x], reverse=True)[:5]
    tupList = [tuple((pos, dist[pos])) for pos in posFreq]
    # raise NotImplementedError()
    return tupList# Your answer here

answer_eight()

Returns:

[('NN', 51452), ('IN', 39225), ('NNP', 38361), ('DT', 34471), ('VBZ', 23799)]

Part 2 - Spelling Recommender

For this part of the assignment you will create three different spelling recommenders, that each take a list of misspelled words and recommends a correctly spelled word for every word in the list.

For every misspelled word, the recommender should find find the word in correct_spellings that has the shortest distance*, and starts with the same letter as the misspelled word, and return that word as a recommendation.

*Each of the three different recommenders will use a different distance measure (outlined below).

Each of the recommenders should provide recommendations for the three default words provided: ['cormulent', 'incendenece', 'validrate'].

from nltk.corpus import words

correct_spellings = words.words()

Question 9

For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:

Jaccard distance on the trigrams of the two words.

Refer to:

This function should return a list of length three: ['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation'].

def answer_nine(entries=['cormulent', 'incendenece', 'validrate']):
    # your code goes here
    # YOUR CODE HERE
    from nltk.metrics.distance import jaccard_distance
    from nltk.util import ngrams
    recommendations = []
    tri_grams = lambda x: set(ngrams(x, 3))
    jaccard_ = lambda xs, y: [jaccard_distance(tri_grams(x), tri_grams(y)) for x in xs]
    
    recommendations = []
    for entry in entries:
        correct_spellings_ = [correct_spelling for correct_spelling in correct_spellings if correct_spelling[0] == entry[0]]
        recommendations.append(correct_spellings_[np.argmin(jaccard_(correct_spellings_, entry))])
    # raise NotImplementedError()
    return recommendations# Your answer here
    
answer_nine()

Returns:

['corpulent', 'indecence', 'validate']

Question 10

For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:

Jaccard distance on the 4-grams of the two words.

Refer to:

This function should return a list of length three: ['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation'].

def answer_ten(entries=['cormulent', 'incendenece', 'validrate']):

    # YOUR CODE HERE
    from nltk.metrics.distance import jaccard_distance
    from nltk.util import ngrams
    four_grams = lambda x: set(ngrams(x, 4))
    jaccard_ = lambda xs, y: [jaccard_distance(four_grams(x), four_grams(y)) for x in xs]
    
    recommendations = []
    for entry in entries:
        correct_spellings_ = [correct_spelling for correct_spelling in correct_spellings if correct_spelling[0] == entry[0]]
        recommendations.append(correct_spellings_[np.argmin(jaccard_(correct_spellings_, entry))])
    # raise NotImplementedError()
    return recommendations# Your answer here
    
answer_ten()

Returns:

['cormus', 'incendiary', 'valid']

Question 11

For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:

Edit distance on the two words with transpositions.

Refer to:

NLTK edit distance

This function should return a list of length three: ['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation'].

def answer_eleven(entries=['cormulent', 'incendenece', 'validrate']):

    # YOUR CODE HERE
    from nltk.metrics.distance import edit_distance
    edit_ = lambda x, y: edit_distance(x, y, substitution_cost=2, transpositions=True)
    
    recommendations = [correct_spellings[np.argmin([edit_(correct_spelling, entry) for correct_spelling in correct_spellings])] for entry in entries]
    # raise NotImplementedError()
    return recommendations# Your answer here 
    
answer_eleven()

Returns:

['corpulent', 'intendence', 'validate']

Module 3: Classification of Text - Assignment 3

In this assignment you will explore text message data and create models to predict if a message is spam or not.

import pandas as pd
import numpy as np

spam_data = pd.read_csv('assets/spam.csv')

spam_data['target'] = np.where(spam_data['target']=='spam',1,0)
spam_data.head(10)

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(spam_data['text'], 
                                                    spam_data['target'], 
                                                    random_state=0)

Question 1

What percentage of the documents in spam_data are spam?

This function should return a float, the percent value (i.e. 𝑟𝑎𝑡𝑖𝑜∗100).

def answer_one():

    # YOUR CODE HERE
    # raise NotImplementedError()
    return len(spam_data[spam_data['target'] == 1]) / len(spam_data) * 100 #Your answer here

answer_one()

Returns:

13.406317300789663

Question 2

Fit the training data X_train using a Count Vectorizer with default parameters.

What is the longest token in the vocabulary?

This function should return a string.

from sklearn.feature_extraction.text import CountVectorizer

def answer_two():

    # YOUR CODE HERE
    max_length = 0
    max_token = ''
    vectorizer = CountVectorizer()
    vectorizer.fit(X_train)
    for token in vectorizer.get_feature_names_out():
        token_length = len(token)
        if token_length > max_length:
            max_length = token_length
            max_token = token
    # raise NotImplementedError()
    return max_token#Your answer here

answer_two()

Returns:

'com1win150ppmx3age16subscription'

Question 3

Fit and transform the training data X_train using a Count Vectorizer with default parameters.

Next, fit a fit a multinomial Naive Bayes classifier model with smoothing alpha=0.1. Find the area under the curve (AUC) score using the transformed test data.

This function should return the AUC score as a float.

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score

def answer_three():

    # YOUR CODE HERE
    vectorizer = CountVectorizer()
    X_train_vectorized = vectorizer.fit_transform(X_train)
    X_test_vectorized = vectorizer.transform(X_test)
    clf = MultinomialNB(alpha=0.1)
    clf.fit(X_train_vectorized, y_train)
    y_pred = clf.predict(X_test_vectorized)
    # raise NotImplementedError()
    return roc_auc_score(y_test, y_pred)#Your answer here

answer_three()

Returns:

0.9720812182741116

Question 4

Fit and transform the training data X_train using a Tfidf Vectorizer with default parameters. The transformed data will be a compressed sparse row matrix where the number of rows is the number of documents in X_train, the number of columns is the number of features found by the vectorizer in each document, and each value in the sparse matrix is the tf-idf value. First find the max tf-idf value for every feature.

What 20 features have the smallest tf-idf and what 20 have the largest tf-idf among the max tf-idf values?

Put these features in two series where each series is sorted by tf-idf value. The index of the series should be the feature name, and the data should be the tf-idf.

The series of 20 features with smallest tf-idfs should be sorted smallest tfidf first, the list of 20 features with largest tf-idfs should be sorted largest first. Any entries with identical tf-ids should appear in lexigraphically increasing order by their feature name in boh series. For example, if the features "a", "b", "c" had the tf-idfs 1.0, 0.5, 1.0 in the series with the largest tf-idfs, then they should occur in the returned result in the order "a", "c", "b" with values 1.0, 1.0, 0.5.

This function should return a tuple of two series (smallest tf-idfs series, largest tf-idfs series).

from sklearn.feature_extraction.text import TfidfVectorizer

def answer_four():

    # YOUR CODE HERE
    vectorizer = TfidfVectorizer()
    X_train_vectorized = vectorizer.fit_transform(X_train)
    features = vectorizer.get_feature_names_out()
    max_tf_idf_values = np.max(X_train_vectorized.toarray(), axis=0)
    tupList = [tuple((ftr, val)) for ftr, val in zip(features, max_tf_idf_values)]
    tupList_sorted = sorted(tupList, key=lambda x: x[1])
    index1, values1 = [tup[0] for tup in tupList_sorted[:20]], [tup[1] for tup in tupList_sorted[:20]]
    index2, values2 = [tup[0] for tup in tupList_sorted[-20:]][::-1], [tup[1] for tup in tupList_sorted[-20:]][::-1]
    Series1, Series2 = pd.Series(values1, index=index1), pd.Series(values2, index=index2)
    # raise NotImplementedError()
    return Series1, Series2 #Your answer here

answer_four()

Returns:

(aaniye          0.074475
 athletic        0.074475
 chef            0.074475
 companion       0.074475
 courageous      0.074475
 dependable      0.074475
 determined      0.074475
 exterminator    0.074475
 healer          0.074475
 listener        0.074475
 organizer       0.074475
 pest            0.074475
 psychiatrist    0.074475
 psychologist    0.074475
 pudunga         0.074475
 stylist         0.074475
 sympathetic     0.074475
 venaam          0.074475
 afternoons      0.091250
 approaching     0.091250
 dtype: float64,
 yup          1.000000
 where        1.000000
 too          1.000000
 thanx        1.000000
 thank        1.000000
 okie         1.000000
 ok           1.000000
 nite         1.000000
 lei          1.000000
 home         1.000000
 havent       1.000000
 er           1.000000
 done         1.000000
 beerage      1.000000
 anytime      1.000000
 anything     1.000000
 645          1.000000
 146tf150p    1.000000
 tick         0.980166
 blank        0.932702
 dtype: float64)

Question 5

Fit and transform the training data X_train using a Tfidf Vectorizer ignoring terms that have a document frequency strictly lower than 3.

Then fit a multinomial Naive Bayes classifier model with smoothing alpha=0.1 and compute the area under the curve (AUC) score using the transformed test data.

This function should return the AUC score as a float.

def answer_five():

    # YOUR CODE HERE
    vectorizer = TfidfVectorizer(min_df=3)
    X_train_vectorized = vectorizer.fit_transform(X_train)
    X_test_vectorized = vectorizer.transform(X_test)
    clf = MultinomialNB(alpha=0.1)
    clf.fit(X_train_vectorized, y_train)
    y_score = clf.predict_proba(X_test_vectorized)[:, 1]
    # raise NotImplementedError()
    return roc_auc_score(y_test, y_score)#Your answer here

answer_five()

Returns:

0.9954968337775665

Question 6

What is the average length of documents (number of characters) for not spam and spam documents?

This function should return a tuple (average length not spam, average length spam).

def answer_six():

    # YOUR CODE HERE
    not_spam_docs = spam_data[spam_data['target'] == 0]['text']
    spam_docs = spam_data[spam_data['target'] == 1]['text']
    avg1 = not_spam_docs.apply(lambda x: len(x)).mean()
    avg2 = spam_docs.apply(lambda x: len(x)).mean()
    # raise NotImplementedError()
    return avg1, avg2#Your answer here

answer_six()

Returns:

(71.02362694300518, 138.8661311914324)

The following function has been provided to help you combine new features into the training data:

def add_feature(X, feature_to_add):
    """
    Returns sparse feature matrix with added feature.
    feature_to_add can also be a list of features.
    """
    from scipy.sparse import csr_matrix, hstack
    return hstack([X, csr_matrix(feature_to_add).T], 'csr')

Question 7

Fit and transform the training data X_train using a Tfidf Vectorizer ignoring terms that have a document frequency strictly lower than 5.

Using this document-term matrix and an additional feature, the length of document (number of characters), fit a Support Vector Classification model with regularization C=10000. Then compute the area under the curve (AUC) score using the transformed test data.

Hint: Since probability is set to false, use the model's decision_function on the test data when calculating the target scores to use in roc_auc_score

This function should return the AUC score as a float.

from sklearn.svm import SVC

def answer_seven():

    # YOUR CODE HERE
    vectorizer = TfidfVectorizer(min_df=5)
    add_ftr_train = X_train.apply(lambda x: len(x))
    add_ftr_test = X_test.apply(lambda x: len(x))
    X_train_vectorized = vectorizer.fit_transform(X_train)
    X_test_vectorized = vectorizer.transform(X_test)
    X_train_added = add_feature(X_train_vectorized, add_ftr_train)
    X_test_added = add_feature(X_test_vectorized, add_ftr_test)
    clf = SVC(C=10000)
    clf.fit(X_train_added, y_train)
    y_score = clf.decision_function(X_test_added)
    # raise NotImplementedError()
    return roc_auc_score(y_test, y_score)#Your answer here

answer_seven()

Returns:

0.9963202213809143

Question 8

What is the average number of digits per document for not spam and spam documents?

Hint: Use \d for digit class

This function should return a tuple (average # digits not spam, average # digits spam).

def answer_eight():

    # YOUR CODE HERE
    not_spam_docs = spam_data[spam_data['target'] == 0]['text']
    spam_docs = spam_data[spam_data['target'] == 1]['text']
    avg1 = not_spam_docs.str.count('\d').mean()
    avg2 = spam_docs.str.count('\d').mean()
    # raise NotImplementedError()
    return avg1, avg2#Your answer here

answer_eight()

Returns:

(0.2992746113989637, 15.759036144578314)

Question 9

Fit and transform the training data X_train using a Tfidf Vectorizer ignoring terms that have a document frequency strictly lower than 5 and using word n-grams from n=1 to n=3 (unigrams, bigrams, and trigrams).

Using this document-term matrix and the following additional features:

the length of document (number of characters)
number of digits per document

fit a Logistic Regression model with regularization C=100 and max_iter=1000. Then compute the area under the curve (AUC) score using the transformed test data.

This function should return the AUC score as a float.

from sklearn.linear_model import LogisticRegression

def answer_nine():

    # YOUR CODE HERE
    vectorizer = TfidfVectorizer(min_df=5, ngram_range=(1,3))
    add_ftr_train1 = X_train.apply(lambda x: len(x))
    add_ftr_test1 = X_test.apply(lambda x: len(x))
    add_ftr_train2 = X_train.str.count('\d')
    add_ftr_test2 = X_test.str.count('\d')
    X_train_vectorized = vectorizer.fit_transform(X_train)
    X_test_vectorized = vectorizer.transform(X_test)
    X_train_added = add_feature(X_train_vectorized, add_ftr_train1)
    X_test_added = add_feature(X_test_vectorized, add_ftr_test1)
    X_train_added = add_feature(X_train_added, add_ftr_train2)
    X_test_added = add_feature(X_test_added, add_ftr_test2)
    clf = LogisticRegression(C=100, max_iter=1000)
    clf.fit(X_train_added, y_train)
    y_score = clf.predict_proba(X_test_added)[:, 1]
    # raise NotImplementedError()
    return roc_auc_score(y_test, y_score)#Your answer here

answer_nine()

Returns:

0.9973218681561211

Question 10

What is the average number of non-word characters (anything other than a letter, digit or underscore) per document for not spam and spam documents?

Hint: Use \w and \W character classes

This function should return a tuple (average # non-word characters not spam, average # non-word characters spam).

def answer_ten():

    # YOUR CODE HERE
    not_spam_docs = spam_data[spam_data['target'] == 0]['text']
    spam_docs = spam_data[spam_data['target'] == 1]['text']
    avg1 = not_spam_docs.str.count('\W').mean()
    avg2 = spam_docs.str.count('\W').mean()
    # raise NotImplementedError()
    return avg1, avg2#Your answer here

answer_ten()

Returns:

(17.29181347150259, 29.041499330655956)

Question 11

Fit and transform the first 2000 rows of training data X_train using a Count Vectorizer ignoring terms that have a document frequency strictly lower than 5 and using character n-grams from n=2 to n=5.

To tell Count Vectorizer to use character n-grams pass in analyzer='char_wb' which creates character n-grams only from text inside word boundaries. This should make the model more robust to spelling mistakes.

Using this document-term matrix and the following additional features:

the length of document (number of characters)
number of digits per document
number of non-word characters (anything other than a letter, digit or underscore.)

fit a Logistic Regression model with regularization C=100 and max_iter=1000. Then compute the area under the curve (AUC) score using the transformed test data.

Also find the 10 smallest and 10 largest coefficients from the model and return them along with the AUC score in a tuple.

The list of 10 smallest coefficients should be sorted smallest first, the list of 10 largest coefficients should be sorted largest first.

The three features that were added to the document term matrix should have the following names should they appear in the list of coefficients: ['length_of_doc', 'digit_count', 'non_word_char_count']

This function should return a tuple (AUC score as a float, smallest coefs list, largest coefs list).

def answer_eleven():

    # YOUR CODE HERE
    vectorizer = CountVectorizer(min_df=5, ngram_range=(2,5), analyzer='char_wb')
    
    add_ftr_train1 = X_train[:2000].apply(lambda x: len(x))
    add_ftr_train2 = X_train[:2000].str.count('\d')
    add_ftr_train3 = X_train[:2000].str.count('\W')
    add_ftr_test1 = X_test.apply(lambda x: len(x))
    add_ftr_test2 = X_test.str.count('\d')
    add_ftr_test3 = X_test.str.count('\W')
    
    X_train_vectorized = vectorizer.fit_transform(X_train[:2000])
    X_test_vectorized = vectorizer.transform(X_test)
    X_train_added = add_feature(X_train_vectorized, add_ftr_train1)
    X_test_added = add_feature(X_test_vectorized, add_ftr_test1)
    X_train_added = add_feature(X_train_added, add_ftr_train2)
    X_test_added = add_feature(X_test_added, add_ftr_test2)
    X_train_added = add_feature(X_train_added, add_ftr_train3)
    X_test_added = add_feature(X_test_added, add_ftr_test3)
    
    clf = LogisticRegression(C=100, max_iter=1000)
    clf.fit(X_train_added, y_train[:2000])
    y_score = clf.predict_proba(X_test_added)[:, 1]
    auc = roc_auc_score(y_test, y_score)
    
    features = vectorizer.get_feature_names_out().tolist() + ['length_of_doc', 'digit_count', 'non_word_char_count']
    coefs = clf.coef_.tolist()[0]
    tupList = [tuple((ftr, coef)) for ftr, coef in zip(features, coefs)]
    tupList_sorted = sorted(tupList, key=lambda x: x[1])
    coef_smallest = [tup[0] for tup in tupList_sorted[:10]]
    coef_largest = [tup[0] for tup in tupList_sorted[-10:][::-1]]
    # raise NotImplementedError()
    return auc, coef_smallest, coef_largest#Your answer here

answer_eleven()

Returns:

(0.997568035583926,
 ['n ', ' i', 'at', 'he', ' m', '..', 'us', 'go', ' lo', ' bu'],
 ['digit_count', 'ne', ' st', 'co', 's ', 'xt', 'lt', 'xt ', ' ne', 'der'])

Module 4: Topic Modeling - Assignment 4 - Document Similarity & Topic Modelling

Part 1 - Document Similarity

For the first part of this assignment, you will complete the functions doc_to_synsets and similarity_score which will be used by document_path_similarity to find the path similarity between two documents.

The following functions are provided:

convert_tag: converts the tag given by nltk.pos_tag to a tag used by wordnet.synsets. You will need to use this function in doc_to_synsets.
document_path_similarity: computes the symmetrical path similarity between two documents by finding the synsets in each document using doc_to_synsets, then computing similarities using similarity_score.

You will need to finish writing the following functions:

doc_to_synsets: returns a list of synsets in document. This function should first tokenize and part of speech tag the document using nltk.word_tokenize and nltk.pos_tag. Then it should find each tokens corresponding synset using wn.synsets(token, wordnet_tag). The first synset match should be used. If there is no match, that token is skipped.
similarity_score: returns the normalized similarity score of a list of synsets (s1) onto a second list of synsets (s2). For each synset in s1, find the synset in s2 with the largest similarity value. Sum all of the largest similarity values together and normalize this value by dividing it by the number of largest similarity values found. Be careful with data types, which should be floats. Missing values should be ignored.

Once doc_to_synsets and similarity_score have been completed, submit to the autograder which will run a test to check that these functions are running correctly.

Do not modify the functions convert_tag and document_path_similarity.

%%capture
import numpy as np
import nltk
nltk.download('punkt')
from nltk.corpus import wordnet as wn
import pandas as pd
nltk.data.path.append("assets/")

def convert_tag(tag):
    """Convert the tag given by nltk.pos_tag to the tag used by wordnet.synsets"""
    
    tag_dict = {'N': 'n', 'J': 'a', 'R': 'r', 'V': 'v'}
    try:
        return tag_dict[tag[0]]
    except KeyError:
        return None

nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4') # Added

def doc_to_synsets(doc):
    """
    Returns a list of synsets in document.

    Tokenizes and tags the words in the document doc.
    Then finds the first synset for each word/tag combination.
    If a synset is not found for that combination it is skipped.

    Args:
        doc: string to be converted

    Returns:
        list of synsets

    Example:
        doc_to_synsets('Fish are friends.')
        Out: [Synset('fish.n.01'), Synset('be.v.01'), Synset('friend.n.01')]
    """

    # YOUR CODE HERE
    tokens = nltk.word_tokenize(doc)
    pos_tags = nltk.pos_tag(tokens)
    wordnet_tags = [convert_tag(tag[1]) for tag in pos_tags]
    # raise NotImplementedError()
    return [wn.synsets(token, wordnet_tag)[0] for token, wordnet_tag in zip(tokens, wordnet_tags) if len(wn.synsets(token, wordnet_tag)) > 0]# Your Answer Here
print(doc_to_synsets('Fish are friends.'))

def similarity_score(s1, s2):
    """
    Calculate the normalized similarity score of s1 onto s2

    For each synset in s1, finds the synset in s2 with the largest similarity value.
    Sum of all of the largest similarity values and normalize this value by dividing it by the
    number of largest similarity values found.

    Args:
        s1, s2: list of synsets from doc_to_synsets

    Returns:
        normalized similarity score of s1 onto s2

    Example:
        synsets1 = doc_to_synsets('I like cats')
        synsets2 = doc_to_synsets('I like dogs')
        similarity_score(synsets1, synsets2)
        Out: 0.7333333333333333
    """

    # YOUR CODE HERE
    max_similarity_values = []
    for syn1 in s1:
        similarity_values = [syn1.path_similarity(syn2) for syn2 in s2 if syn1.path_similarity(syn2) is not None]
        if similarity_values:
            max_similarity_values.append(max(similarity_values))
    # raise NotImplementedError()
    return np.mean(max_similarity_values)# Your Answer Here
synsets1 = doc_to_synsets('I like cats')
synsets2 = doc_to_synsets('I like dogs')
print(similarity_score(synsets1, synsets2))

注：这里需要手动添加三行代码（代码块顶部），否则会出现报错。

Returns:

[Synset('fish.n.01'), Synset('be.v.01'), Synset('friend.n.01')]
0.7333333333333334

def document_path_similarity(doc1, doc2):
    """Finds the symmetrical similarity between doc1 and doc2"""

    synsets1 = doc_to_synsets(doc1)
    synsets2 = doc_to_synsets(doc2)

    return (similarity_score(synsets1, synsets2) + similarity_score(synsets2, synsets1)) / 2

paraphrases is a DataFrame which contains the following columns: Quality, D1, and D2.

Quality is an indicator variable which indicates if the two documents D1 and D2 are paraphrases of one another (1 for paraphrase, 0 for not paraphrase).

# Use this dataframe for questions most_similar_docs and label_accuracy
paraphrases = pd.read_csv('assets/paraphrases.csv')
paraphrases.head()

most_similar_docs

Using document_path_similarity, find the pair of documents in paraphrases which has the maximum similarity score.

This function should return a tuple (D1, D2, similarity_score)

def most_similar_docs():
    
    # YOUR CODE HERE
    tupList = [tuple((D1, D2, document_path_similarity(D1, D2))) for D1, D2 in zip(paraphrases['D1'], paraphrases['D2'])]
    # raise NotImplementedError()
    return sorted(tupList, key=lambda x: x[2], reverse=True)[0]# Your Answer Here

most_similar_docs()

Returns:

('"Indeed, Iran should be put on notice that efforts to try to remake Iraq in their image will be aggressively put down," he said.',
 '"Iran should be on notice that attempts to remake Iraq in Iran\'s image will be aggressively put down," he said.\n',
 0.9590643274853801)

label_accuracy

Provide labels for the twenty pairs of documents by computing the similarity for each pair using document_path_similarity. Let the classifier rule be that if the score is greater than 0.75, label is paraphrase (1), else label is paraphrase (0). Report accuracy of the classifier using scikit-learn's accuracy_score.

This function should return a float.

def label_accuracy():
    from sklearn.metrics import accuracy_score

    # YOUR CODE HERE
    y_true = paraphrases['Quality']
    y_pred = pd.Series([document_path_similarity(D1, D2) for D1, D2 in zip(paraphrases['D1'], paraphrases['D2'])]).apply(lambda x: 1 if x > 0.75 else 0)
    # raise NotImplementedError()
    return accuracy_score(y_true, y_pred)# Your Answer Here

label_accuracy()

Returns:

0.7

Part 2 - Topic Modelling

For the second part of this assignment, you will use Gensim's LDA (Latent Dirichlet Allocation) model to model topics in newsgroup_data. You will first need to finish the code in the cell below by using gensim.models.ldamodel.LdaModel constructor to estimate LDA model parameters on the corpus, and save to the variable ldamodel. Extract 10 topics using corpus and id_map, and with passes=25 and random_state=34.

import pickle
import gensim
from sklearn.feature_extraction.text import CountVectorizer

# Load the list of documents
with open('assets/newsgroups', 'rb') as f:
    newsgroup_data = pickle.load(f)

# Use CountVectorizor to find three letter tokens, remove stop_words, 
# remove tokens that don't appear in at least 20 documents,
# remove tokens that appear in more than 20% of the documents
vect = CountVectorizer(min_df=20, max_df=0.2, stop_words='english', 
                       token_pattern='(?u)\\b\\w\\w\\w+\\b')
# Fit and transform
X = vect.fit_transform(newsgroup_data)

# Convert sparse matrix to gensim corpus.
corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)

# Mapping from word IDs to words (To be used in LdaModel's id2word parameter)
id_map = dict((v, k) for k, v in vect.vocabulary_.items())

# Use the gensim.models.ldamodel.LdaModel constructor to estimate 
# LDA model parameters on the corpus, and save to the variable `ldamodel`

# YOUR CODE HERE
ldamodel = gensim.models.ldamodel.LdaModel(corpus, id2word=id_map, num_topics=10, passes=25, random_state=34)
# raise NotImplementedError()

lda_topics

Using ldamodel, find a list of the 10 topics and the most significant 10 words in each topic. This should be structured as a list of 10 tuples where each tuple takes on the form:

(9, '0.068*"space" + 0.036*"nasa" + 0.021*"science" + 0.020*"edu" + 0.019*"data" + 0.017*"shuttle" + 0.015*"launch" + 0.015*"available" + 0.014*"center" + 0.013*"information"')

for example.

This function should return a list of tuples.

def lda_topics():
    
    # YOUR CODE HERE
    # raise NotImplementedError()
    return ldamodel.print_topics(num_topics=10)# Your Answer Here
lda_topics()

Returns:

[(0,
  '0.056*"edu" + 0.043*"com" + 0.033*"thanks" + 0.022*"mail" + 0.021*"know" + 0.020*"does" + 0.014*"info" + 0.012*"monitor" + 0.010*"looking" + 0.010*"don"'),
 (1,
  '0.024*"ground" + 0.018*"current" + 0.018*"just" + 0.013*"want" + 0.013*"use" + 0.011*"using" + 0.011*"used" + 0.010*"power" + 0.010*"speed" + 0.010*"output"'),
 (2,
  '0.061*"drive" + 0.042*"disk" + 0.033*"scsi" + 0.030*"drives" + 0.028*"hard" + 0.028*"controller" + 0.027*"card" + 0.020*"rom" + 0.018*"floppy" + 0.017*"bus"'),
 (3,
  '0.023*"time" + 0.015*"atheism" + 0.014*"list" + 0.013*"left" + 0.012*"alt" + 0.012*"faq" + 0.012*"probably" + 0.011*"know" + 0.011*"send" + 0.010*"months"'),
 (4,
  '0.025*"car" + 0.016*"just" + 0.014*"don" + 0.014*"bike" + 0.012*"good" + 0.011*"new" + 0.011*"think" + 0.010*"year" + 0.010*"cars" + 0.010*"time"'),
 (5,
  '0.030*"game" + 0.027*"team" + 0.023*"year" + 0.017*"games" + 0.016*"play" + 0.012*"season" + 0.012*"players" + 0.012*"win" + 0.011*"hockey" + 0.011*"good"'),
 (6,
  '0.017*"information" + 0.014*"help" + 0.014*"medical" + 0.012*"new" + 0.012*"use" + 0.012*"000" + 0.012*"research" + 0.011*"university" + 0.010*"number" + 0.010*"program"'),
 (7,
  '0.022*"don" + 0.021*"people" + 0.018*"think" + 0.017*"just" + 0.012*"say" + 0.011*"know" + 0.011*"does" + 0.011*"good" + 0.010*"god" + 0.009*"way"'),
 (8,
  '0.034*"use" + 0.023*"apple" + 0.020*"power" + 0.016*"time" + 0.015*"data" + 0.015*"software" + 0.012*"pin" + 0.012*"memory" + 0.012*"simms" + 0.011*"port"'),
 (9,
  '0.068*"space" + 0.036*"nasa" + 0.021*"science" + 0.020*"edu" + 0.019*"data" + 0.017*"shuttle" + 0.015*"launch" + 0.015*"available" + 0.014*"center" + 0.014*"sci"')]

topic_distribution

For the new document new_doc, find the topic distribution. Remember to use vect.transform on the the new doc, and Sparse2Corpus to convert the sparse matrix to gensim corpus.

This function should return a list of tuples, where each tuple is (#topic, probability)

new_doc = ["\n\nIt's my understanding that the freezing will start to occur because \
of the\ngrowing distance of Pluto and Charon from the Sun, due to it's\nelliptical orbit. \
It is not due to shadowing effects. \n\n\nPluto can shadow Charon, and vice-versa.\n\nGeorge \
Krumins\n-- "]

def topic_distribution():
    
    # YOUR CODE HERE
    X = vect.transform(new_doc)
    corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)
    # raise NotImplementedError()
    return list(ldamodel.get_document_topics(corpus))[0]# Your Answer Here
topic_distribution()

Returns:

[(0, 0.020003108),
 (1, 0.020003324),
 (2, 0.020001281),
 (3, 0.49674824),
 (4, 0.020004038),
 (5, 0.020004129),
 (6, 0.020002972),
 (7, 0.020002645),
 (8, 0.020003129),
 (9, 0.34322715)]

topic_names

From the list of the following given topics, assign topic names to the topics you found. If none of these names best matches the topics you found, create a new 1-3 word "title" for the topic.

Topics: Health, Science, Automobiles, Politics, Government, Travel, Computers & IT, Sports, Business, Society & Lifestyle, Religion, Education.

This function should return a list of 10 strings.

def topic_names():
    
    # YOUR CODE HERE
    # raise NotImplementedError()
    return ['Education','Science','Computers & IT','Religion','Automobiles','Sports','Science','Religion','Computers & IT','Science']# Your Answer Here