使用spacy做分词、实体识别、标注的示例

下载数据:

aws s3 cp s3://applied-nlp-book/data/ data --recursive --no-sign-request
aws s3 cp s3://applied-nlp-book/models/ag_dataset/ models/ag_dataset --recursive --no-sign-request

上面第一份数据接近1GB,第二份接近3GB;

示例代码:

import spacy
# load pretrained transformer model, this model is Roberta-base of BERT-base arch
nlp =  spacy.load("en_core_web_trf")
# tokenizer the sentence of parameter
sentence = nlp.tokenizer("We live in Paris.")
print("The tokens:")
for words in sentence:
        print(words)


import pandas as pd
import os
cwd = os.getcwd()
# read the questions of csv format
data = pd.read_csv(cwd+'/data/jeopardy_questions/jeopardy_questions.csv')
data = pd.DataFrame(data=data)
data.columns = map(lambda x: x.lower().strip(), data.columns)
data = data[0:1000]
data["question_tokens"] = data["question"].apply(lambda x: nlp(x))
# 0-th item
example_question = data.question[0]
example_question_tokens = data.question_tokens[0]
print("The first questions is:")
print(example_question)

print("the tokens from the first question are:")
for tokens in example_question_tokens:
    print(tokens)

文件中的部分内容

jeopardy_questions.csv:

Show Number, Air Date, Round, Category, Value, Question, Answer
4680,2004-12-31,Jeopardy!,"HISTORY","$200","For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory","Copernicus"
4680,2004-12-31,Jeopardy!,"ESPN's TOP 10 ALL-TIME ATHLETES","$200","No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves","Jim Thorpe"
4680,2004-12-31,Jeopardy!,"EVERYBODY TALKS ABOUT IT...","$200","The city of Yuma in this state has a record average of 4,055 hours of sunshine each year","Arizona"
4680,2004-12-31,Jeopardy!,"THE COMPANY LINE","$200","In 1963, live on ""The Art Linkletter Show"", this company served its billionth burger","McDonald's"
4680,2004-12-31,Jeopardy!,"EPITAPHS & TRIBUTES","$200","Signer of the Dec. of Indep., framer of the Constitution of Mass., second President of the United States","John Adams"
4680,2004-12-31,Jeopardy!,"3-LETTER WORDS","$200","In the title of an Aesop fable, this insect shared billing with a grasshopper","the ant"
4680,2004-12-31,Jeopardy!,"HISTORY","$400","Built in 312 B.C. to link Rome & the South of Italy, it's still in use today","the Appian Way"
4680,2004-12-31,Jeopardy!,"ESPN's TOP 10 ALL-TIME ATHLETES","$400","No. 8: 30 steals for the Birmingham Barons; 2,306 steals for the Bulls","Michael Jordan"
4680,2004-12-31,Jeopardy!,"EVERYBODY TALKS ABOUT IT...","$400","In the winter of 1971-72, a record 1,122 inches of snow fell at Rainier Paradise Ranger Station in this state","Washington"
4680,2004-12-31,Jeopardy!,"THE COMPANY LINE","$400","This housewares store was named for the packaging its merchandise came in & was first displayed on","Crate & Barrel"
4680,2004-12-31,Jeopardy!,"EPITAPHS & TRIBUTES","$400","""And away we go""","Jackie Gleason"
4680,2004-12-31,Jeopardy!,"3-LETTER WORDS","$400","Cows regurgitate this from the first stomach to the mouth & chew it again","the cud"
4680,2004-12-31,Jeopardy!,"HISTORY","$600","In 1000 Rajaraja I of the Cholas battled to take this Indian Ocean island now known for its tea","Ceylon (or Sri Lanka)"
4680,2004-12-31,Jeopardy!,"ESPN's TOP 10 ALL-TIME ATHLETES","$600","No. 1: Lettered in hoops, football & lacrosse at Syracuse & if you think he couldn't act, ask his 11 ""unclean"" buddies","Jim Brown"
4680,2004-12-31,Jeopardy!,"EVERYBODY TALKS ABOUT IT...","$600","On June 28, 1994 the nat'l weather service began issuing this index that rates the intensity of the sun's radiation","the UV index"
4680,2004-12-31,Jeopardy!,"THE COMPANY LINE","$600","This company's Accutron watch, introduced in 1960, had a guarantee of accuracy to within one minute a  month","Bulova"
4680,2004-12-31,Jeopardy!,"EPITAPHS & TRIBUTES","$600","Outlaw: ""Murdered by a traitor and a coward whose name is not worthy to appear here""","Jesse James"
4680,2004-12-31,Jeopardy!,"3-LETTER WORDS","$600","A small demon, or a mischievous child (who might be a little demon!)","imp"

运行效果:

label_spacy.py

源码:

import spacy
nlp = spacy.load("en_core_web_trf")

example_sentence = "George Washington was an American political leader, military general, statesman, and Founding Father who served as the first president of the United States from 1789 to 1797.\n"

print(example_sentence)

print("Text Start End Label")
doc = nlp(example_sentence)
for token in doc.ents:
    print(token.text, token.start_char, token.end_char, token.label_)

import requests

def returnGraphResult(query, key, entityType):# key of kgsearch web
    if entityTpye =="PERSON":
        google = f"https://kgsearch.googleapis.com/v1/entities:search ?query={query}&key={key}"
        resp = requests.get(google)
        url = resp.json()['itemListElement'][0]['result']['detailDescription']['url']
        description = resp.json()['itemListElement'][0]['result']['detailedDescrition']['articleBody']
        return url, description
    else:
        return "no_match", "no_match"

# for token in doc.ents:
#    url, description = returnGraphResult(token.text, key, token.labe_)
#    print(token.text, token.label_, url, description)

运行 

$ /usr/bin/python3 label_spacy.py

 

 词组:

import spacy
# load pretrained transformer model, this model is Roberta-base of BERT-base arch
nlp =  spacy.load("en_core_web_trf")

sentence = nlp("My Parents live in New Youk City.")

for token in sentence:
    print(token.text)

for chunk in sentence.noun_chunks:
    print(chunk.text)

运行:

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值