NLP(I): 正则表达式
使用tweepy获取推特数据
爬取
请注意,以下代码中的breaer_token需要自己去推特申请一个开发者账号,也就是说,你自己写代码时这个token和我的是不一样的,不要直接复制。
import tweepy
client = tweepy.Client(bearer_token='AAAAAAAAAAAAAAAAAAAAACZyhAEAAAAAwq%2ByW0tqZ4TbIbtKlDx3w9dpghc%3DG7qdk7bQvPvl5r9W5fKXhLAx4SFSiOD32I5kIZjVISOF79F2WG')
query = 'nft lang:en -is:retweet -has:links' # tweets have 'nft', written in English, not retwee, do not have links
tweets = list(tweepy.Paginator(client.search_recent_tweets, query=query, tweet_fields=['context_annotations', 'created_at'], max_results=100).flatten(limit=10000))
print("{} tweets are collected.".format(len(tweets)))
10000 tweets are collected.
将数据保存为csv文件
import csv
driveFolderDirectory = './'
savedFileName = 'tweets.csv'
pathToSave = driveFolderDirectory + savedFileName
with open(pathToSave, 'w', newline='', encoding='utf-8') as csvfile:
fieldnames = ['idx','tweetId', 'createdTime', 'tweetText']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for i,tweet in enumerate(tweets):
writer.writerow({'idx': i, 'tweetId': tweet.id, 'createdTime': tweet.data['created_at'], 'tweetText': tweet.data['text']})
将所有大写转小写
读入csv文件
csv_data = []
with open(pathToSave, encoding='utf-8') as csvfile:
for row in csv.DictReader(csvfile, skipinitialspace=True):
csv_data.append(row)
操作数据
for row in csv_data:
row['tweetText'] = row['tweetText'].lower()
去除井号、url、数字
去除匹配特定正则表达式,使用re.sub()方法。
regex_task2 = [r'\B(\#[a-zA-Z0-9]+\b)(?!;)',
r'\b(http(s)?:\/\/)?(www.)?([a-zA-Z0-9])+([\-\.]{1}[a-zA-Z0-9]+)*\.[a-zA-Z]{2,5}(:[0-9]{1,5})?(\/[^\s]*)?\b',
r'\d?\.?\d+']
import re
for row in csv_data:
for ptn in regex_task2:
row['tweetText'] = re.sub(ptn, '', row['tweetText'])
去除后缀
从每个单词删去-ing, -ed, -ly。注意不是删去整个单词。
regex_task3 = [r'([a-zA-Z]+)ing\b', r'([a-zA-Z]+)ed\b', r'([a-zA-Z]+)ly\b']
for row in csv_data:
for ptn in regex_task3:
row['tweetText'] = re.sub(ptn, r'\1', row['tweetText'])
词形还原
把所有to do形式换成do。
regex_task4 = r'\bto\b\s+([a-zA-Z]+)\b'
for row in csv_data:
row['tweetText'] = re.sub(regex_task4, r'\1', row['tweetText'])
去除停用词(stop words)
这里的停用词采用维基百科给出的十个最常用英文停用词,链接https://en.wikipedia.org/wiki/Most_common_words_in_English
regex_task5 = [r'\bthe\b', r'\bbe\b', r'\bto\b', r'\bof\b', r'\band\b', r'\ba\b', r'\bin\b', r'\bthat\b', r'\bhave\b', r'\bI\b']
for row in csv_data:
for ptn in regex_task5:
row['tweetText'] = re.sub(ptn, "", row['tweetText'])
令牌化(tokenize)
tokens是一个列表,其每个元素是被拆分为令牌的一句话。
tokens = []
for row in csv_data:
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', row['tweetText'])
for sentence in sentences:
raw_words = sentence.split()
pure_words = []
for raw_word in raw_words:
pure_words.append(re.sub(r'[^a-z]', "", raw_word))
tokens.append(pure_words)
print(tokens[0])
简单可视化
画一个统计每句话长度的直方图。
import matplotlib.pyplot as plt
lengths = []
for token_list in tokens:
lengths.append(len(token_list))
plt.hist(lengths)
plt.show()