NLP代码模板集合

David_Hernandez

已于 2023-11-05 23:56:20 修改

阅读量596

点赞数

CC 4.0 BY-SA版权

文章标签： transformer 文本生成机器翻译语言模型文本摘要命名实体识别词汇相似度

于 2023-11-05 16:31:56 首次发布

本文链接：https://blog.csdn.net/kisslotus/article/details/134231493

这个博客详细介绍了自然语言处理的各种操作，包括词处理、分词、信息提取、文本相似度、主题建模、文本分类和生成。涵盖NLTK、spaCy、transformers等库的使用，如删除停用词、词干提取、词形还原、词嵌入、词相似度计算、文本摘要和情感分析等。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

文章目录

1 词基本操作
- 1.1 使用NLTK下载停用词
- 1.2 使用 spacy 加载语言模型
- 1.3 删除句子中的停用词
- 1.4 基于 spaCy，添加自定义停用词
- 1.5 删除标点符号
- 1.6 使用 bigram 将词汇合并成短语（非常重要）
- 1.7 统计 bigram, trigram（非常重要）
2 Tokenizer（分词）
- 2.1 使用 NLTK 或 spaCy 进行 tokenize
- 2.2 使用 transformers 进行 tokenize?（非常重要）
- 2.3 使用停用词进行tokenize
- 2.4 如何对 Tweeter等网文进行tokenizer？
3 句基本操作
- 3.1 如何将文档拆分成句子？
- 3.2 如何获得句子对应的遗存句法树？
- 3.3 词干提取（stemming）
- 3.4 词形还原（lemmatization）
- 3.5 词汇纠错（重要）
4 Information Extraction
- 4.1 如何从包含邮箱的文档中提取邮箱用户名？
- 4.2 从文档中提取所有名词？
- 4.3 从文档中提取所有人物指称
- 4.4 将指代人称替换成对应的人名
5 Text Similarity
- 5.1 提取最常用的词汇，但不包括停用词（重要）
- - 5.1.1 Input
  - 5.1.2 Desired Output
  - 5.1.3 Solution
- 5.2 两个词的相似度
- - 5.2.1 Input
  - 5.2.2 Desired Output
  - 5.2.3 Solution
- 5.3 获得两篇文档的相似度
- - 5.3.1 Input
  - 5.3.2 Desired Output
  - 5.3.3 Solution
- 5.4 获得两篇文档的 cosine 相似度
- - 5.4.1 Input
  - 5.4.2 Desired Output
  - 5.4.3 Solution
- 5.5 获得文档的 soft consine 相似度？
- - 5.5.1 Input
  - 5.5.2 Desired Output
  - 5.5.3 Solution
- 5.6 使用 Word2Vec 获得词汇的embedding
- - 5.6.1 Input
  - 5.6.2 Desired Output
  - 5.6.3 Solution
- 5.7 可视化 Word2Vec 中的词汇（重要）
- - 5.7.1 Solution
- 5.8 基于 Doc2Vec 获得文档的embedding
- - 5.8.1 Input
  - 5.8.2 Desired Output
  - 5.8.3 Solution
- 5.9 基于 Word2Vec 计算词汇相似度
- - 5.9.1 Desired Output
  - 5.9.2 Solution
- 5.10 计算 Word mover distance
- - 5.10.1 Input
  - 5.10.2 Desired Output
  - 5.10.3 Solution
6 Topic Model
- 6.1 基于LSA抽取 topic 关键词
- - 6.1.1 Input
  - 6.1.2 Desired Output
  - 6.1.3 Solution
- 6.2 使用LDA抽取 topic 关键词
- - 6.2.1 Input
  - 6.2.2 Desired Output
  - 6.2.3 Solution
- 6.3 基于NMF抽取 topic 关键词
- - 6.3.1 Input
  - 6.3.2 Desired Output
  - 6.3.3 Solution
- 6.4 计算 TF-IDF Matrix
- - 6.4.1 Input
  - 6.4.2 Desired Output
  - 6.4.3 Solution
  - - 6.4.3.1 方法一: Using gensim
    - 6.4.3.2 方法二: Using sklearn's TfidfVectorizer
- 6.5 识别文字的语种
- - 6.5.1 Input
  - 6.5.2 Desired Output
  - 6.5.3 Solution
- 6.6 将人名合并在一起：spaCy 的 retokenize() 方法
- - 6.6.1 Input
  - 6.6.2 Desired Output
  - 6.6.3 Solution
- 6.7 抽取名词短语: spaCy 的 noun_chunks
- - 6.7.1 Expected Output
  - 6.7.2 Solution
- 6.8 抽取动词短语：textacy.extract.pos_regex_matches
- - 6.8.1 Input
  - 6.8.2 Desired Output
  - 6.8.3 Solution
- 6.9 抽取人名: spacy 的 Matcher
- - 6.9.1 Input
  - 6.9.2 Desired Output
  - 6.9.3 Solution
- 6.10 抽取实体（NER）: entity.text
- - 6.10.1 Input
  - 6.10.2 Desired Output
  - 6.10.3 Solution
- 6.11 抽取组织机构: entity.label_=="ORG"
- - 6.11.1 Input
  - 6.11.2 Expected Solution
  - 6.11.3 Solution
- 6.12 将人名替换成 UNKNOWN
- - 6.12.1 Input
  - 6.12.2 Desired Output
  - 6.12.3 Solution
- 6.13 可视化句子中的人名：spaCy 的 displacy
- - 6.13.1 Input
  - 6.13.2 Solution
- 6.14 获得遗存句法树（dependency parsing）
- - 6.14.1 Input
  - 6.14.2 Desired Output
  - 6.14.3 Solution
- 6.15 基于遗存句法获得句子的 ROOT word
- - 6.15.1 Input
  - 6.15.2 Desired Output
  - 6.15.3 Solution
- 6.16 可视化遗存句法树（dependency tree）：spaCy 的 displacy
- - 6.16.1 Input
  - 6.16.2 Solution
- 6.17 抽取文本中的电脑公司名称
- - 6.17.1 Input
  - 6.17.2 Solution
7 Summarize Text
- 7.1 基于 gensim 做文本摘要：gensim.summarization.summarizer 的 summarize
- 7.2 使用 LexRank 做文本摘要：sumy.summarizers.lex_rank import LexRankSummarizer
- 7.3 使用 Luhn 算法做文本摘要：sumy.summarizers.luhn import LuhnSummarizer
- 7.4 使用 LSA 算法做文本摘要：sumy.summarizers.lsa import LsaSummarizer
- - 7.4.1 Input
  - 7.4.2 Solution
8 Text Classification
- 8.1 使用 TextBlob 做文本分类：textblob.classifiers import NaiveBayesClassifier
- - 8.1.1 Desired Output
  - 8.1.2 Solution
- 8.2 使用 Simple transformers 训练文本分类模型: simpletransformers.classification import ClassificationModel, ClassificationArgs
- - 8.2.1 Input
  - 8.2.2 Solution
- 8.3 使用 spaCy 做文本分类
- - 8.3.1 Solution
- 8.4 使用 transformers 训练情感分类
- 8.5 使用 TextBlob 做情感分类
9 Text Generation
- 9.1 使用 simpletransformers 做机器翻译：simpletransformers.seq2seq import Seq2SeqModel
- 9.2 使用 transformers 构建问答系统（Question-Answering）
- 9.3 基于 transformers 做文本生成
10 模型
- 10.1 Self-attention实现
- 10.2 计算网络参数量
- 10.3 优化器参数拆分成分组
- 10.4 从checkpoint重训
- 10.5 混合精度计算
- 10.6 手动定义获取一个batch
- 10.7 基于predict值，得到每个标签的概率值
- 10.8 计算 acc
References

1 词基本操作

1.1 使用NLTK下载停用词

Difficulty Level : L1

是下载，不是用。先要下载，才能用。

# Downloading packages and importing

import nltk
nltk.download('punkt')
nltk.download('stop')
nltk.download('stopwords')

#> [nltk_data] Downloading package punkt to /root/nltk_data...
#> [nltk_data]   Unzipping tokenizers/punkt.zip.
#> [nltk_data] Error loading stop: Package 'stop' not found in index
#> [nltk_data] Downloading package stopwords to /root/nltk_data...
#> [nltk_data]   Unzipping corpora/stopwords.zip.
#> True

1.2 使用 spacy 加载语言模型

Difficulty Level : L1

# 下载
python -m spacy download en_core_web_sm

import spacy
nlp = spacy.load("en_core_web_sm")
nlp
# More models here: https://spacy.io/models
#> <spacy.lang.en.English at 0x7facaf6cd0f0>

1.3 删除句子中的停用词

Difficulty Level : L1

1.3.1 Input

text="""the outbreak of coronavirus disease 2019 (COVID-19) has created a global health crisis that has had a deep impact on the way we perceive our world and our everyday lives. Not only the rate of contagion and patterns of transmission threatens our sense of agency, but the safety measures put in place to contain the spread of the virus also require social distancing by refraining from doing what is inherently human, which is to find solace in the company of others. Within this context of physical threat, social and physical distancing, as well as public alarm, what has been (and can be) the role of the different mass media channels in our lives on individual, social and societal levels? Mass media have long been recognized as powerful forces shaping how we experience the world and ourselves. This recognition is accompanied by a growing volume of research, that closely follows the footsteps of technological transformations (e.g. radio, movies, television, the internet, mobiles) and the zeitgeist (e.g. cold war, 9/11, climate change) in an attempt to map mass media major impacts on how we perceive ourselves, both as individuals and citizens. Are media (broadcast and digital) still able to convey a sense of unity reaching large audiences, or are messages lost in the noisy crowd of mass self-communication? """

1.3.2 Desired Output

'outbreak coronavirus disease 2019 ( COVID-19 ) created global health crisis deep impact way perceive world everyday lives . rate contagion patterns transmission threatens sense agency , safety measures place contain spread virus require social distancing refraining inherently human , find solace company . context physical threat , social physical distancing , public alarm , ( ) role different mass media channels lives individual , social societal levels ? Mass media long recognized powerful forces shaping experience world . recognition accompanied growing volume research , closely follows footsteps technological transformations ( e.g. radio , movies , television , internet , mobiles ) zeitgeist ( e.g. cold war , 9/11 , climate change ) attempt map mass media major impacts perceive , individuals citizens . media ( broadcast digital ) able convey sense unity reaching large audiences , messages lost noisy crowd mass self - communication ?'

1.3.3 Solution

1.3.3.1 Method 1: Removing stopwords in nltk

# Method 1
# Removing stopwords in nltk

from nltk.corpus import stopwords
my_stopwords=set(stopwords.words('english'))
new_tokens=[]

# Tokenization using word_tokenize()
all_tokens=nltk.word_tokenize(text)

for token in all_tokens:
  if token not in my_stopwords:
    new_tokens.append(token)

" ".join(new_tokens)

1.3.3.2 Method 2: Removing stopwords in spaCy

# Method 2
# Removing stopwords in spaCy
import spacy

nlp=spacy.load("en_core_web_sm")
doc=nlp(text)
new_tokens=[]

# Using is_stop attribute of each token to check if it's a stopword
for token in doc:
  if token.is_stop==False:
    new_tokens.append(token.text)

" ".join(new_tokens)

1.4 基于 spaCy，添加自定义停用词

Difficulty Level : L1

Q. Add the custom stopwords “NIL” and “JUNK” in spaCy and remove the stopwords in below text

1.4.1 Input

text=" Jonas was a JUNK great guy NIL Adam was evil NIL Martha JUNK was more of a fool "

1.4.2 Expected Output

  'Jonas great guy Adam evil Martha fool'

1.4.3 Solution

import spacy

nlp=spacy.load("en_core_web_sm")
# list of custom stop words
customize_stop_words = ['NIL','JUNK']

# Adding these stop words
for w in customize_stop_words:
    nlp.vocab[w].is_stop = True
doc = nlp(text)
tokens = [token.text for token in doc if not token.is_stop]

" ".join(tokens)

1.5 删除标点符号

Difficulty Level : L1

Q. Remove all the punctuations in the given text

1.5.1 Input

text="The match has concluded !!! India has won the match . Will we fin the finals too ? !"

1.5.2 Desired Output

'The match has concluded India has won the match Will we fin the finals too'

1.5.3 Solution

1.5.3.1 Method 1: Removing punctuations in spaCy

# Removing punctuations in spaCy
import spacy

nlp=spacy.load("en_core_web_sm")
doc=nlp(text)
new_tokens=[]
# Check if a token is a punctuation through is_punct attribute
for token in doc:
  if token.is_punct==False:
    new_tokens.append(token.text)

" ".join(new_tokens)

1.5.3.2 Method 2: Removing punctuation in nltk with RegexpTokenizer

# Method 2
# Removing punctuation in nltk with RegexpTokenizer

tokenizer=nltk.RegexpTokenizer(r"\w+")

tokens=tokenizer.tokenize(text)
" ".join(tokens)

1.6 使用 bigram 将词汇合并成短语（非常重要）

Difficulty Level : L3

传统的使用场景都是分词，但是很少有分短语，除非使用较为复杂的遗存句法解析工具。本题的目的是：将常见的两个词汇合并成短语。

核心是使用：Gensim’s Phraser。

1.6.1 Input

documents = ["the mayor of new york was there", "new york mayor was present"]

1.6.2 Desired Output

['the', 'mayor', 'of', 'new york', 'was', 'there']
['new york', 'mayor', 'was', 'present']

1.6.3 Solution

# Import Phraser from gensim
from gensim.models import Phrases
from gensim.models.phrases import Phraser

sentence_stream = [doc.split(" ") for doc in documents]

# Creating bigram phraser
bigram = Phrases(sentence_stream, min_count=1, threshold=2, delimiter=b' ')
bigram_phraser = Phraser(bigram)

for sent in sentence_stream:
    tokens_ = bigram_phraser[sent]
    print(tokens_)

1.7 统计 bigram, trigram（非常重要）

Difficulty Level : L3

1.7.1 Input

text="Machine learning is a neccessary field in today's world. Data science can do wonders . Natural Language Processing is how machines understand text "

1.7.2 Desired Output

Bigrams are [('machine', 'learning'), ('learning', 'is'), ('is', 'a'), ('a', 'neccessary'), ('neccessary', 'field'), ('field', 'in'), ('in', "today's"), ("today's", 'world.'), ('world.', 'data'), ('data', 'science'), ('science', 'can'), ('can', 'do'), ('do', 'wonders'), ('wonders', '.'), ('.', 'natural'), ('natural', 'language'), ('language', 'processing'), ('processing', 'is'), ('is', 'how'), ('how', 'machines'), ('machines', 'understand'), ('understand', 'text')]
 Trigrams are [('machine', 'learning', 'is'), ('learning', 'is', 'a'), ('is', 'a', 'neccessary'), ('a', 'neccessary', 'field'), ('neccessary', 'field', 'in'), ('field', 'in', "today's"), ('in', "today's", 'world.'), ("today's", 'world.', 'data'), ('world.', 'data', 'science'), ('data', 'science', 'can'), ('science', 'can', 'do'), ('can', 'do', 'wonders'), ('do', 'wonders', '.'), ('wonders', '.', 'natural'), ('.', 'natural', 'language'), ('natural', 'language', 'processing'), ('language', 'processing', 'is'), ('processing', 'is', 'how'), ('is', 'how', 'machines'), ('how', 'machines', 'understand'), ('machines', 'understand', 'text')]

1.7.3 Solution

# 方法1
from nltk import ngrams
bigram = list(ngrams(text.lower().split(), 2))
trigram = list(ngrams(text.lower().split(), 3))

print(" Bigrams are",bigram)
print(" Trigrams are", trigram)


# 方法2
def ngram(text, n):
    # 将输入文本按空格分割为单词列表
    words = text.split()
    # 构建 ngram 列表
    ngram_list = []
    for i in range(len(words) - n + 1):
        ngram_list.append(' '.join(words[i:i+n]))
    return ngram_list

2 Tokenizer（分词）

2.1 使用 NLTK 或 spaCy 进行 tokenize

Difficulty Level : L1

2.1.1 Input

text="Last week, the University of Cambridge shared its own research that shows if everyone wears a mask outside home,dreaded ‘second wave’ of the pandemic can be avoided."

2.1.2 Desired Output

Last
week
,
the
University
of
Cambridge
shared
...(truncated)...

2.1.3 Solution

# 方法一：Tokeniation with nltk
tokens = nltk.word_tokenize(text)
for token in tokens:
  print(token)
  

# 方法二：Tokenization with spaCy
lm = spacy.load("en_core_web_sm")
tokens = lm(text)
for token in tokens:
  print(token.text)

2.2 使用 transformers 进行 tokenize?（非常重要）

Difficulty Level : L1

2.2.1 Input

text="I love spring season. I go hiking with my friends"

2.2.2 Desired Output

[101, 1045, 2293, 3500, 2161, 1012, 1045, 2175, 13039, 2007, 2026, 2814, 102]

[CLS] i love spring season. i go hiking with my friends [SEP]

2.2.3 Solution

from transformers import AutoTokenizer

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Encoding with the tokenizer
inputs = tokenizer.encode(text)
print(inputs)
# 还可以这样用
print(tokenizer(text))

# 解码
print(tokenizer.decode(inputs))

2.3 使用停用词进行tokenize

Difficulty Level : L2

Q. Tokenize the given text with stop words (“is”,”the”,”was”) as delimiters. Tokenizing this way identifies meaningful phrases. Sometimes, useful for topic modeling

2.3.1 Input

text = "Walter was feeling anxious. He was diagnosed today. He probably is the best person I know.""

2.3.2 Expected Output

['Walter',
 'feeling anxious',
 'He',
 'diagnosed today',
 'He probably',
 'best person I know']

2.3.3 Solution

text = "Walter was feeling anxious. He was diagnosed today. He probably is the best person I know."

stop_words_and_delims = ['was', 'is', 'the', '.', ',', '-', '!', '?']
for r in stop_words_and_delims:
    text = text.replace(r, 'DELIM')

words = [t.strip() for t in text.split('DELIM')]
words_filtered = list(filter(lambda a: a not in [''], words))
print(words_filtered)

2.4 如何对 Tweeter等网文进行tokenizer？

Difficulty Level : L2

2.4.1 Input

text=" Having lots of fun #goa #vaction #summervacation. Fancy dinner @Beachbay restro :) "

2.4.2 Desired Output

['Having',
 'lots',
 'of',
 'fun',
 'goa',
 'vaction',
 'summervacation',
 'Fancy',
 'dinner',
 'Beachbay',
 'restro']

2.4.3 Solution

import re
# Cleaning the tweets
text=re.sub(r'[^\w]', ' ', text)

# Using nltk's TweetTokenizer
from nltk.tokenize import TweetTokenizer
tokenizer=TweetTokenizer()
print(tokenizer.tokenize(text))

3 句基本操作

3.1 如何将文档拆分成句子？

Difficulty Level : L1

Q. Print the sentences of the given text document

3.1.1 Input

text="""The outbreak of coronavirus disease 2019 (COVID-19) has created a global health crisis that has had a deep impact on the way we perceive our world and our everyday lives. Not only the rate of contagion and patterns of transmission threatens our sense of agency, but the safety measures put in place to contain the spread of the virus also require social distancing by refraining from doing what is inherently human, which is to find solace in the company of others. Within this context of physical threat, social and physical distancing, as well as public alarm, what has been (and can be) the role of the different mass media channels in our lives on individual, social and societal levels? Mass media have long been recognized as powerful forces shaping how we experience the world and ourselves. This recognition is accompanied by a growing volume of research, that closely follows the footsteps of technological transformations (e.g. radio, movies, television, the internet, mobiles) and the zeitgeist (e.g. cold war, 9/11, climate change) in an attempt to map mass media major impacts on how we perceive ourselves, both as individuals and citizens. Are media (broadcast and digital) still able to convey a sense of unity reaching large audiences, or are messages lost in the noisy crowd of mass self-communication? """

3.1.2 Desired Output

The outbreak of coronavirus disease 2019 (COVID-19) has created a global health crisis that has had a deep impact on the way we perceive our world and our everyday lives.
Not only the rate of contagion and patterns of transmission threatens our sense of agency, but the safety measures put in place to contain the spread of the virus also require social distancing by refraining from doing what is inherently human, which is to find solace in the company of others.
Within this context of physical threat, social and physical distancing, as well as public alarm, what has been (and can be)
...(truncated)...

3.1.3 Solution

# 方法一：使用 spaCy
import spacy
lm = spacy.load('en_core_web_sm')
doc = lm(text)
for sentence in doc.sents:
  print(sentence)

# 方法二：使用 NLTK
print(nltk.sent_tokenize(text))

3.2 如何获得句子对应的遗存句法树？

Difficulty Level : L3

3.2.1 Input

text1="Netflix has released a new series"
text2="It was shot in London"
text3="It is called Dark and the main character is Jonas"
text4="Adam is the evil character"

3.2.2 Desired Output

{'id': 0,
 'paragraphs': [{'cats': [],
   'raw': 'Netflix has released a new series',
   'sentences': [{'brackets': [],
     'tokens': [{'dep': 'nsubj',
       'head': 2,
       'id': 0,
       'ner': 'U-ORG',
       'orth': 'Netflix',
       'tag': 'NNP'},
      {'dep': 'aux',
       'head': 1,
       'id': 1,
       'ner': 'O',
       'orth': 'has',
       'tag': 'VBZ'},
      {'dep': 'ROOT',
       'head': 0,
       'id': 2,
       'ner': 'O',
       'orth': 'released',
       'tag': 'VBN'},
      {'dep': 'det', 'head': 2, 'id': 3, 'ner': 'O', 'orth': 'a', 'tag': 'DT'},
      {'dep': 'amod',
       'head': 1,
       'id': 4,
       'ner': 'O',
       'orth': 'new',
       'tag': 'JJ'},
      {'dep': 'dobj',
       'head': -3,
       'id': 5,
       'ner': 'O',
       'orth': 'series',
       'tag': 'NN'}]}]},
    ...(truncated)

3.2.3 Solution

# Covert into spacy documents
doc1=nlp(text1)
doc2=nlp(text2)
doc3=nlp(text3)
doc4=nlp(text4)

# Import docs_to_json 
from spacy.gold import docs_to_json

# Converting into json format
json_data = docs_to_json([doc1,doc2,doc3,doc4])
print(json_data)

3.3 词干提取（stemming）

Difficulty Level : L2

3.3.1 Input

text= "Dancing is an art. Students should be taught dance as a subject in schools . I danced in many of my school function. Some people are always hesitating to dance."

3.3.2 Desired Output

text= 'danc is an art . student should be taught danc as a subject in school . I danc in mani of my school function . some peopl are alway hesit to danc .'

3.3.3 Solution

from nltk.stem import PorterStemmer
stemmer=PorterStemmer()
stemmed_tokens=[]
for token in nltk.word_tokenize(text):
  stemmed_tokens.append(stemmer.stem(token))

" ".join(stemmed_tokens)

# 还可以使用：
# 1. Porter
# 2. Snowball 更常用
# 3. Lancaster

3.4 词形还原（lemmatization）

Difficulty Level : L2

Q. Perform lemmatzation on the given text

Hint: Lemmatization Approaches

Stemming 和 lemmatization 虽然在学术上有严谨的区分，但是项目中一般只需要进行词性还原即可。

3.4.1 Input

text= "Dancing is an art. Students should be taught dance as a subject in schools . I danced in many of my school function. Some people are always hesitating to dance."

3.4.2 Desired Output

text= 'dancing be an art . student should be teach dance as a subject in school . -PRON- dance in many of -PRON- school function . some people be always hesitate to dance .'

3.4.3 Solution

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

lemmatized=[token.lemma_ for token in doc]
print(" ".join(lemmatized))

3.5 词汇纠错（重要）

Difficulty Level : L2

Q. Correct the spelling errors in the following text

3.5.1 Input

text="He is a gret person. He beleives in bod"

3.5.2 Desired Output

text="He is a great person. He believes in god"

3.5.3 Solution

# 方式一： textblob
from textblob import TextBlob

# Using textblob's correct() function
text=TextBlob(text)
print(text.correct())

# 方式二：还可以使用 wordsegment
from wordsegment import load, segment

load()
ent = 'Information Extraction'.strip()
new_ent = ' '.join(segment(row['识别实体'].strip()))
print(new_ent)

4 Information Extraction

4.1 如何从包含邮箱的文档中提取邮箱用户名？

Difficulty Level : L2

4.1.1 Input

text= "The new registrations are potter709@gmail.com , elixir101@gmail.com. If you find any disruptions, kindly contact granger111@gamil.com or severus77@gamil.com "

4.1.2 Desired Output

['potter709', 'elixir101', 'granger111', 'severus77']

4.1.3 Solution

import re  

# \S matches any non-whitespace character 
# @ for as in the Email 
# + for Repeats a character one or more times 
usernames= re.findall('(\S+)@', text)     
print(usernames)

4.2 从文档中提取所有名词？

Difficulty Level : L2

原理是：分词后获得对应的词性，过滤出想要的词性。

4.2.1 Input

text="James works at Microsoft. She lives in manchester and likes to play the flute"

4.2.2 Desired Output

James
Microsoft
manchester
flute

4.2.3 Solution

# Coverting the text into a spacy Doc
nlp=spacy.load("en_core_web_sm")
doc=nlp(text)

for token in doc:
  if token.pos_=='NOUN' or token.pos_=='PROPN':
    print(token.text)

4.3 从文档中提取所有人物指称

Difficulty Level : L2

Q. Extract and print all the pronouns in the text

4.3.1 Input

text="John is happy finally. He had landed his dream job finally. He told his mom. She was elated "

4.3.2 Desired Output

He
He
She

4.3.3 Solution

# Using spacy's pos_ attribute to check for part of speech tags
nlp=spacy.load("en_core_web_sm")
doc=nlp(text)

for token in doc:
  if token.pos_=='PRON':
    print(token.text)

4.4 将指代人称替换成对应的人名

Difficulty Level : L2

4.4.1 Input

text=" My sister has a dog and she loves him"

4.4.2 Desired Output

[My sister,she]
[a dog ,him ]

4.4.3 Solution

# Import neural coref library
!pip install neuralcoref
import spacy
import neuralcoref

# Add it to the pipeline
nlp = spacy.load('en')
neuralcoref.add_to_pipe(nlp)

# Printing the coreferences
doc1 = nlp('My sister has a dog. She loves him.')
print(doc1._.coref_clusters)

# 可视化工具
# spaCy also provides the feature of visualizing the coreferences. Check out this https://spacy.io/universe/project/neuralcoref-vizualizer/.

5 Text Similarity

5.1 提取最常用的词汇，但不包括停用词（重要）

Difficulty Level : L2

5.1.1 Input

text="""Junkfood - Food that do no good to our body. And there's no need of them in our body but still we willingly eat them because they are great in taste and easy to cook or ready to eat. Junk foods have no or very less nutritional value and irrespective of the way they are marketed, they are not healthy to consume.The only reason of their gaining popularity and increased trend of consumption is 
that they are ready to eat or easy to cook foods. People, of all age groups are moving towards Junkfood as it is hassle free and often ready to grab and eat. Cold drinks, chips, noodles, pizza, burgers, French fries etc. are few examples from the great variety of junk food available in the market.
 Junkfood is the most dangerous food ever but it is pleasure in eating and it gives a great taste in mouth examples of Junkfood are kurkure and chips.. cold rings are also source of junk food... they shud nt be ate in high amounts as it results fatal to our body... it cn be eated in a limited extend ... in research its found tht ths junk foods r very dangerous fr our health
Junkfood is very harmful that is slowly eating away the health of the present generation. The term itself denotes how dangerous it is for our bodies. Most importantly, it tastes so good that people consume it on a daily basis. However, not much awareness is spread about the harmful effects of Junkfood .
The problem is more serious than you think. Various studies show that Junkfood impacts our health negatively. They contain higher levels of calories, fats, and sugar. On the contrary, they have very low amounts of healthy nutrients and lack dietary fibers. Parents must discourage their children from consuming junk food because of the ill effects it has on one’s health.
Junkfood is the easiest way to gain unhealthy weight. The amount of fats and sugar in the food makes you gain weight rapidly. However, this is not a healthy weight. It is more of fats and cholesterol which will have a harmful impact on your health. Junk food is also one of the main reasons for the increase in obesity nowadays.
This food only looks and tastes good, other than that, it has no positive points. The amount of calorie your body requires to stay fit is not fulfilled by this food. For instance, foods like French fries, burgers, candy, and cookies, all have high amounts of sugar and fats. Therefore, this can result in long-term illnesses like diabetes and high blood pressure. This may also result in kidney failure."""

5.1.2 Desired Output

text= {Junkfood: 10,
 food: 8,
 good: 5,
 harmful : 3
 body: 1,
 need: 1,

 ...(truncated)

5.1.3 Solution

from collections import Counter
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

# Removal of stop words and punctuations
words = [str(token).strip().lower() for token in doc if token.is_stop==False and token.is_punct==False]

freq_dict = Counter(words)
print(freq_dict)

5.2 两个词的相似度

5.2.1 Input

word1="amazing"
word2="terrible"
word3="excellent"

5.2.2 Desired Output

#> similarity between amazing and terrible is 0.46189071343764604
#> similarity between amazing and excellent is 0.6388207086737778

5.2.3 Solution

import spacy
!python -m spacy download en_core_web_lg
nlp=spacy.load('en_core_web_lg')
token1=nlp(word1)
token2=nlp(word2)
token3=nlp(word3)

print('similarity between', word1,'and' ,word2, 'is' ,token1.similarity(token2))
print('similarity between', word1,'and' ,word3, 'is' ,token1.similarity(token3))