NLP代码模板集合

文章目录

1 词基本操作

1.1 使用NLTK下载停用词

Difficulty Level : L1

是下载,不是用。先要下载,才能用。

# Downloading packages and importing

import nltk
nltk.download('punkt')
nltk.download('stop')
nltk.download('stopwords')

#> [nltk_data] Downloading package punkt to /root/nltk_data...
#> [nltk_data]   Unzipping tokenizers/punkt.zip.
#> [nltk_data] Error loading stop: Package 'stop' not found in index
#> [nltk_data] Downloading package stopwords to /root/nltk_data...
#> [nltk_data]   Unzipping corpora/stopwords.zip.
#> True

1.2 使用 spacy 加载语言模型

Difficulty Level : L1

# 下载
python -m spacy download en_core_web_sm

import spacy
nlp = spacy.load("en_core_web_sm")
nlp
# More models here: https://spacy.io/models
#> <spacy.lang.en.English at 0x7facaf6cd0f0>

1.3 删除句子中的停用词

Difficulty Level : L1

1.3.1 Input

text="""the outbreak of coronavirus disease 2019 (COVID-19) has created a global health crisis that has had a deep impact on the way we perceive our world and our everyday lives. Not only the rate of contagion and patterns of transmission threatens our sense of agency, but the safety measures put in place to contain the spread of the virus also require social distancing by refraining from doing what is inherently human, which is to find solace in the company of others. Within this context of physical threat, social and physical distancing, as well as public alarm, what has been (and can be) the role of the different mass media channels in our lives on individual, social and societal levels? Mass media have long been recognized as powerful forces shaping how we experience the world and ourselves. This recognition is accompanied by a growing volume of research, that closely follows the footsteps of technological transformations (e.g. radio, movies, television, the internet, mobiles) and the zeitgeist (e.g. cold war, 9/11, climate change) in an attempt to map mass media major impacts on how we perceive ourselves, both as individuals and citizens. Are media (broadcast and digital) still able to convey a sense of unity reaching large audiences, or are messages lost in the noisy crowd of mass self-communication? """

1.3.2 Desired Output

'outbreak coronavirus disease 2019 ( COVID-19 ) created global health crisis deep impact way perceive world everyday lives . rate contagion patterns transmission threatens sense agency , safety measures place contain spread virus require social distancing refraining inherently human , find solace company . context physical threat , social physical distancing , public alarm , ( ) role different mass media channels lives individual , social societal levels ? Mass media long recognized powerful forces shaping experience world . recognition accompanied growing volume research , closely follows footsteps technological transformations ( e.g. radio , movies , television , internet , mobiles ) zeitgeist ( e.g. cold war , 9/11 , climate change ) attempt map mass media major impacts perceive , individuals citizens . media ( broadcast digital ) able convey sense unity reaching large audiences , messages lost noisy crowd mass self - communication ?'

1.3.3 Solution

1.3.3.1 Method 1: Removing stopwords in nltk
# Method 1
# Removing stopwords in nltk

from nltk.corpus import stopwords
my_stopwords=set(stopwords.words('english'))
new_tokens=[]

# Tokenization using word_tokenize()
all_tokens=nltk.word_tokenize(text)

for token in all_tokens:
  if token not in my_stopwords:
    new_tokens.append(token)

" ".join(new_tokens)
1.3.3.2 Method 2: Removing stopwords in spaCy
# Method 2
# Removing stopwords in spaCy
import spacy

nlp=spacy.load("en_core_web_sm")
doc=nlp(text)
new_tokens=[]

# Using is_stop attribute of each token to check if it's a stopword
for token in doc:
  if token.is_stop==False:
    new_tokens.append(token.text)

" ".join(new_tokens)

1.4 基于 spaCy,添加自定义停用词

Difficulty Level : L1

Q. Add the custom stopwords “NIL” and “JUNK” in spaCy and remove the stopwords in below text

1.4.1 Input

text=" Jonas was a JUNK great guy NIL Adam was evil NIL Martha JUNK was more of a fool "

1.4.2 Expected Output

  'Jonas great guy Adam evil Martha fool'

1.4.3 Solution

import spacy

nlp=spacy.load("en_core_web_sm")
# list of custom stop words
customize_stop_words = ['NIL','JUNK']

# Adding these stop words
for w in customize_stop_words:
    nlp.vocab[w].is_stop = True
doc = nlp(text)
tokens = [token.text for token in doc if not token.is_stop]

" ".join(tokens)

1.5 删除标点符号

Difficulty Level : L1

Q. Remove all the punctuations in the given text

1.5.1 Input

text="The match has concluded !!! India has won the match . Will we fin the finals too ? !"

1.5.2 Desired Output

'The match has concluded India has won the match Will we fin the finals too'

1.5.3 Solution

1.5.3.1 Method 1: Removing punctuations in spaCy
# Removing punctuations in spaCy
import spacy

nlp=spacy.load("en_core_web_sm")
doc=nlp(text)
new_tokens=[]
# Check if a token is a punctuation through is_punct attribute
for token in doc:
  if token.is_punct==False:
    new_tokens.append(token.text)

" ".join(new_tokens)
1.5.3.2 Method 2: Removing punctuation in nltk with RegexpTokenizer
# Method 2
# Removing punctuation in nltk with RegexpTokenizer

tokenizer=nltk.RegexpTokenizer(r"\w+")

tokens=tokenizer.tokenize(text)
" ".join(tokens)

1.6 使用 bigram 将词汇合并成短语(非常重要)

Difficulty Level : L3

传统的使用场景都是分词,但是很少有分短语,除非使用较为复杂的遗存句法解析工具。本题的目的是:将常见的两个词汇合并成短语。

核心是使用:Gensim’s Phraser。

1.6.1 Input

documents = ["the mayor of new york was there", "new york mayor was present"]

1.6.2 Desired Output

['the', 'mayor', 'of', 'new york', 'was', 'there']
['new york', 'mayor', 'was', 'present']

1.6.3 Solution

# Import Phraser from gensim
from gensim.models import Phrases
from gensim.models.phrases import Phraser

sentence_stream = [doc.split(" ") for doc in documents]

# Creating bigram phraser
bigram = Phrases(sentence_stream, min_count=1, threshold=2, delimiter=b' ')
bigram_phraser = Phraser(bigram)

for sent in sentence_stream:
    tokens_ = bigram_phraser[sent]
    print(tokens_)

1.7 统计 bigram, trigram(非常重要)

Difficulty Level : L3

1.7.1 Input

text="Machine learning is a neccessary field in today's world. Data science can do wonders . Natural Language Processing is how machines understand text "

1.7.2 Desired Output

Bigrams are [('machine', 'learning'), ('learning', 'is'), ('is', 'a'), ('a', 'neccessary'), ('neccessary', 'field'), ('field', 'in'), ('in', "today's"), ("today's", 'world.'), ('world.', 'data'), ('data', 'science'), ('science', 'can'), ('can', 'do'), ('do', 'wonders'), ('wonders', '.'), ('.', 'natural'), ('natural', 'language'), ('language', 'processing'), ('processing', 'is'), ('is', 'how'), ('how', 'machines'), ('machines', 'understand'), ('understand', 'text')]
 Trigrams are [('machine', 'learning', 'is'), ('learning', 'is', 'a'), ('is', 'a', 'neccessary'), ('a', 'neccessary', 'field'), ('neccessary', 'field', 'in'), ('field', 'in', "today's"), ('in', "today's", 'world.'), ("today's", 'world.', 'data'), ('world.', 'data', 'science'), ('data', 'science', 'can'), ('science', 'can', 'do'), ('can', 'do', 'wonders'), ('do', 'wonders', '.'), ('wonders', '.', 'natural'), ('.', 'natural', 'language'), ('natural', 'language', 'processing'), ('language', 'processing', 'is'), ('processing', 'is', 'how'), ('is', 'how', 'machines'), ('how', 'machines', 'understand'), ('machines', 'understand', 'text')]

1.7.3 Solution

# 方法1
from nltk import ngrams
bigram = list(ngrams(text.lower().split(), 2))
trigram = list(ngrams(text.lower().split(), 3))

print(" Bigrams are",bigram)
print(" Trigrams are", trigram)


# 方法2
def ngram(text, n):
    # 将输入文本按空格分割为单词列表
    words = text.split()
    # 构建 ngram 列表
    ngram_list = []
    for i in range(len(words) - n + 1):
        ngram_list.append(' '.join(words[i:i+n]))
    return ngram_list

2 Tokenizer(分词)

2.1 使用 NLTK 或 spaCy 进行 tokenize

Difficulty Level : L1

2.1.1 Input

text="Last week, the University of Cambridge shared its own research that shows if everyone wears a mask outside home,dreaded ‘second wave’ of the pandemic can be avoided."

2.1.2 Desired Output

Last
week
,
the
University
of
Cambridge
shared
...(truncated)...

2.1.3 Solution

# 方法一:Tokeniation with nltk
tokens = nltk.word_tokenize(text)
for token in tokens:
  print(token)
  

# 方法二:Tokenization with spaCy
lm = spacy.load("en_core_web_sm")
tokens = lm(text)
for token in tokens:
  print(token.text)  

2.2 使用 transformers 进行 tokenize?(非常重要)

Difficulty Level : L1

2.2.1 Input

text="I love spring season. I go hiking with my friends"

2.2.2 Desired Output

[101, 1045, 2293, 3500, 2161, 1012, 1045, 2175, 13039, 2007, 2026, 2814, 102]

[CLS] i love spring season. i go hiking with my friends [SEP]

2.2.3 Solution

from transformers import AutoTokenizer

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Encoding with the tokenizer
inputs = tokenizer.encode(text)
print(inputs)
# 还可以这样用
print(tokenizer(text))

# 解码
print(tokenizer.decode(inputs))

2.3 使用停用词进行tokenize

Difficulty Level : L2

Q. Tokenize the given text with stop words (“is”,”the”,”was”) as delimiters. Tokenizing this way identifies meaningful phrases. Sometimes, useful for topic modeling

2.3.1 Input

text = "Walter was feeling anxious. He was diagnosed today. He probably is the best person I know.""

2.3.2 Expected Output

['Walter',
 'feeling anxious',
 'He',
 'diagnosed today',
 'He probably',
 'best person I know']

2.3.3 Solution

text = "Walter was feeling anxious. He was diagnosed today. He probably is the best person I know."

stop_words_and_delims = ['was', 'is', 'the', '.', ',', '-', '!', '?']
for r in stop_words_and_delims:
    text = text.replace(r, 'DELIM')

words = [t.strip() for t in text.split('DELIM')]
words_filtered = list(filter(lambda a: a not in [''], words))
print(words_filtered)

2.4 如何对 Tweeter等网文进行tokenizer?

Difficulty Level : L2

2.4.1 Input

text=" Having lots of fun #goa #vaction #summervacation. Fancy dinner @Beachbay restro :) "

2.4.2 Desired Output

['Having',
 'lots',
 'of',
 'fun',
 'goa',
 'vaction',
 'summervacation',
 'Fancy',
 'dinner',
 'Beachbay',
 'restro']

2.4.3 Solution

import re
# Cleaning the tweets
text=re.sub(r'[^\w]', ' ', text)

# Using nltk's TweetTokenizer
from nltk.tokenize import TweetTokenizer
tokenizer=TweetTokenizer()
print(tokenizer.tokenize(text))

3 句基本操作

3.1 如何将文档拆分成句子?

Difficulty Level : L1

Q. Print the sentences of the given text document

3.1.1 Input

text="""The outbreak of coronavirus disease 2019 (COVID-19) has created a global health crisis that has had a deep impact on the way we perceive our world and our everyday lives. Not only the rate of contagion and patterns of transmission threatens our sense of agency, but the safety measures put in place to contain the spread of the virus also require social distancing by refraining from doing what is inherently human, which is to find solace in the company of others. Within this context of physical threat, social and physical distancing, as well as public alarm, what has been (and can be) the role of the different mass media channels in our lives on individual, social and societal levels? Mass media have long been recognized as powerful forces shaping how we experience the world and ourselves. This recognition is accompanied by a growing volume of research, that closely follows the footsteps of technological transformations (e.g. radio, movies, television, the internet, mobiles) and the zeitgeist (e.g. cold war, 9/11, climate change) in an attempt to map mass media major impacts on how we perceive ourselves, both as individuals and citizens. Are media (broadcast and digital) still able to convey a sense of unity reaching large audiences, or are messages lost in the noisy crowd of mass self-communication? """

3.1.2 Desired Output

The outbreak of coronavirus disease 2019 (COVID-19) has created a global health crisis that has had a deep impact on the way we perceive our world and our everyday lives.
Not only the rate of contagion and patterns of transmission threatens our sense of agency, but the safety measures put in place to contain the spread of the virus also require social distancing by refraining from doing what is inherently human, which is to find solace in the company of others.
Within this context of physical threat, social and physical distancing, as well as public alarm, what has been (and can be)
...(truncated)...

3.1.3 Solution

# 方法一:使用 spaCy
import spacy
lm = spacy.load('en_core_web_sm')
doc = lm(text)
for sentence in doc.sents:
  print(sentence)

# 方法二:使用 NLTK
print(nltk.sent_tokenize(text))

3.2 如何获得句子对应的遗存句法树?

Difficulty Level : L3

3.2.1 Input

text1="Netflix has released a new series"
text2="It was shot in London"
text3="It is called Dark and the main character is Jonas"
text4="Adam is the evil character"

3.2.2 Desired Output

{'id': 0,
 'paragraphs': [{'cats': [],
   'raw': 'Netflix has released a new series',
   'sentences': [{'brackets': [],
     'tokens': [{'dep': 'nsubj',
       'head': 2,
       'id': 0,
       'ner': 'U-ORG',
       'orth': 'Netflix',
       'tag': 'NNP'},
      {'dep': 'aux',
       'head': 1,
       'id': 1,
       'ner': 'O',
       'orth': 'has',
       'tag': 'VBZ'},
      {'dep': 'ROOT',
       'head': 0,
       'id': 2,
       'ner': 'O',
       'orth': 'released',
       'tag': 'VBN'},
      {'dep': 'det', 'head': 2, 'id': 3, 'ner': 'O', 'orth': 'a', 'tag': 'DT'},
      {'dep': 'amod',
       'head': 1,
       'id': 4,
       'ner': 'O',
       'orth': 'new',
       'tag': 'JJ'},
      {'dep': 'dobj',
       'head': -3,
       'id': 5,
       'ner': 'O',
       'orth': 'series',
       'tag': 'NN'}]}]},
    ...(truncated)

3.2.3 Solution

# Covert into spacy documents
doc1=nlp(text1)
doc2=nlp(text2)
doc3=nlp(text3)
doc4=nlp(text4)

# Import docs_to_json 
from spacy.gold import docs_to_json

# Converting into json format
json_data = docs_to_json([doc1,doc2,doc3,doc4])
print(json_data)

3.3 词干提取(stemming)

Difficulty Level : L2

3.3.1 Input

text= "Dancing is an art. Students should be taught dance as a subject in schools . I danced in many of my school function. Some people are always hesitating to dance."

3.3.2 Desired Output

text= 'danc is an art . student should be taught danc as a subject in school . I danc in mani of my school function . some peopl are alway hesit to danc .'

3.3.3 Solution

from nltk.stem import PorterStemmer
stemmer=PorterStemmer()
stemmed_tokens=[]
for token in nltk.word_tokenize(text):
  stemmed_tokens.append(stemmer.stem(token))

" ".join(stemmed_tokens)

# 还可以使用:
# 1. Porter
# 2. Snowball 更常用
# 3. Lancaster

3.4 词形还原(lemmatization)

Difficulty Level : L2

Q. Perform lemmatzation on the given text

Hint: Lemmatization Approaches

Stemming 和 lemmatization 虽然在学术上有严谨的区分,但是项目中一般只需要进行词性还原即可。

3.4.1 Input

text= "Dancing is an art. Students should be taught dance as a subject in schools . I danced in many of my school function. Some people are always hesitating to dance."

3.4.2 Desired Output

text= 'dancing be an art . student should be teach dance as a subject in school . -PRON- dance in many of -PRON- school function . some people be always hesitate to dance .'

3.4.3 Solution

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

lemmatized=[token.lemma_ for token in doc]
print(" ".join(lemmatized))

3.5 词汇纠错(重要)

Difficulty Level : L2

Q. Correct the spelling errors in the following text

3.5.1 Input

text="He is a gret person. He beleives in bod"

3.5.2 Desired Output

text="He is a great person. He believes in god"

3.5.3 Solution

# 方式一: textblob
from textblob import TextBlob

# Using textblob's correct() function
text=TextBlob(text)
print(text.correct())

# 方式二:还可以使用 wordsegment
from wordsegment import load, segment

load()
ent = 'Information Extraction'.strip()
new_ent = ' '.join(segment(row['识别实体'].strip()))
print(new_ent)

4 Information Extraction

4.1 如何从包含邮箱的文档中提取邮箱用户名?

Difficulty Level : L2

4.1.1 Input

text= "The new registrations are potter709@gmail.com , elixir101@gmail.com. If you find any disruptions, kindly contact granger111@gamil.com or severus77@gamil.com "

4.1.2 Desired Output

['potter709', 'elixir101', 'granger111', 'severus77']

4.1.3 Solution

import re  

# \S matches any non-whitespace character 
# @ for as in the Email 
# + for Repeats a character one or more times 
usernames= re.findall('(\S+)@', text)     
print(usernames) 

4.2 从文档中提取所有名词?

Difficulty Level : L2

原理是:分词后获得对应的词性,过滤出想要的词性。

4.2.1 Input

text="James works at Microsoft. She lives in manchester and likes to play the flute"

4.2.2 Desired Output

James
Microsoft
manchester
flute

4.2.3 Solution

# Coverting the text into a spacy Doc
nlp=spacy.load("en_core_web_sm")
doc=nlp(text)

for token in doc:
  if token.pos_=='NOUN' or token.pos_=='PROPN':
    print(token.text)

4.3 从文档中提取所有人物指称

Difficulty Level : L2

Q. Extract and print all the pronouns in the text

4.3.1 Input

text="John is happy finally. He had landed his dream job finally. He told his mom. She was elated "

4.3.2 Desired Output

He
He
She

4.3.3 Solution

# Using spacy's pos_ attribute to check for part of speech tags
nlp=spacy.load("en_core_web_sm")
doc=nlp(text)

for token in doc:
  if token.pos_=='PRON':
    print(token.text)

4.4 将指代人称替换成对应的人名

Difficulty Level : L2

4.4.1 Input

text=" My sister has a dog and she loves him"

4.4.2 Desired Output

[My sister,she]
[a dog ,him ]

4.4.3 Solution

# Import neural coref library
!pip install neuralcoref
import spacy
import neuralcoref

# Add it to the pipeline
nlp = spacy.load('en')
neuralcoref.add_to_pipe(nlp)

# Printing the coreferences
doc1 = nlp('My sister has a dog. She loves him.')
print(doc1._.coref_clusters)

# 可视化工具
# spaCy also provides the feature of visualizing the coreferences. Check out this https://spacy.io/universe/project/neuralcoref-vizualizer/.

5 Text Similarity

5.1 提取最常用的词汇,但不包括停用词(重要)

Difficulty Level : L2

5.1.1 Input

text="""Junkfood - Food that do no good to our body. And there's no need of them in our body but still we willingly eat them because they are great in taste and easy to cook or ready to eat. Junk foods have no or very less nutritional value and irrespective of the way they are marketed, they are not healthy to consume.The only reason of their gaining popularity and increased trend of consumption is 
that they are ready to eat or easy to cook foods. People, of all age groups are moving towards Junkfood as it is hassle free and often ready to grab and eat. Cold drinks, chips, noodles, pizza, burgers, French fries etc. are few examples from the great variety of junk food available in the market.
 Junkfood is the most dangerous food ever but it is pleasure in eating and it gives a great taste in mouth examples of Junkfood are kurkure and chips.. cold rings are also source of junk food... they shud nt be ate in high amounts as it results fatal to our body... it cn be eated in a limited extend ... in research its found tht ths junk foods r very dangerous fr our health
Junkfood is very harmful that is slowly eating away the health of the present generation. The term itself denotes how dangerous it is for our bodies. Most importantly, it tastes so good that people consume it on a daily basis. However, not much awareness is spread about the harmful effects of Junkfood .
The problem is more serious than you think. Various studies show that Junkfood impacts our health negatively. They contain higher levels of calories, fats, and sugar. On the contrary, they have very low amounts of healthy nutrients and lack dietary fibers. Parents must discourage their children from consuming junk food because of the ill effects it has on one’s health.
Junkfood is the easiest way to gain unhealthy weight. The amount of fats and sugar in the food makes you gain weight rapidly. However, this is not a healthy weight. It is more of fats and cholesterol which will have a harmful impact on your health. Junk food is also one of the main reasons for the increase in obesity nowadays.
This food only looks and tastes good, other than that, it has no positive points. The amount of calorie your body requires to stay fit is not fulfilled by this food. For instance, foods like French fries, burgers, candy, and cookies, all have high amounts of sugar and fats. Therefore, this can result in long-term illnesses like diabetes and high blood pressure. This may also result in kidney failure."""

5.1.2 Desired Output

text= {Junkfood: 10,
 food: 8,
 good: 5,
 harmful : 3
 body: 1,
 need: 1,

 ...(truncated)

5.1.3 Solution

from collections import Counter
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

# Removal of stop words and punctuations
words = [str(token).strip().lower() for token in doc if token.is_stop==False and token.is_punct==False]

freq_dict = Counter(words)
print(freq_dict)

5.2 两个词的相似度

5.2.1 Input

word1="amazing"
word2="terrible"
word3="excellent"

5.2.2 Desired Output

#> similarity between amazing and terrible is 0.46189071343764604
#> similarity between amazing and excellent is 0.6388207086737778

5.2.3 Solution

import spacy
!python -m spacy download en_core_web_lg
nlp=spacy.load('en_core_web_lg')
token1=nlp(word1)
token2=nlp(word2)
token3=nlp(word3)

print('similarity between', word1,'and' ,word2, 'is' ,token1.similarity(token2))
print('similarity between', word1,'and' ,word3, 'is' ,token1.similarity(token3))

5.3 获得两篇文档的相似度

Difficulty Level : L2

5.3.1 Input

text1="John lives in Canada"
text2="James lives in America, though he's not from there"

5.3.2 Desired Output

 0.792817083631068

5.3.3 Solution

!python -m spacy download en_core_web_lg
import spacy

nlp=spacy.load('en_core_web_lg'
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值