NLP(I): 正则表达式

NLP(I): 正则表达式

使用tweepy获取推特数据

爬取

请注意,以下代码中的breaer_token需要自己去推特申请一个开发者账号,也就是说,你自己写代码时这个token和我的是不一样的,不要直接复制。

import tweepy
client = tweepy.Client(bearer_token='AAAAAAAAAAAAAAAAAAAAACZyhAEAAAAAwq%2ByW0tqZ4TbIbtKlDx3w9dpghc%3DG7qdk7bQvPvl5r9W5fKXhLAx4SFSiOD32I5kIZjVISOF79F2WG')

query = 'nft lang:en -is:retweet -has:links' # tweets have 'nft', written in English, not retwee, do not have links
tweets = list(tweepy.Paginator(client.search_recent_tweets, query=query, tweet_fields=['context_annotations', 'created_at'], max_results=100).flatten(limit=10000))
print("{} tweets are collected.".format(len(tweets)))
10000 tweets are collected.
将数据保存为csv文件
import csv


driveFolderDirectory = './'
savedFileName = 'tweets.csv'
pathToSave = driveFolderDirectory + savedFileName

with open(pathToSave, 'w', newline='', encoding='utf-8') as csvfile:
    fieldnames = ['idx','tweetId', 'createdTime', 'tweetText']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    for i,tweet in enumerate(tweets):
        writer.writerow({'idx': i, 'tweetId': tweet.id, 'createdTime': tweet.data['created_at'], 'tweetText': tweet.data['text']})

将所有大写转小写

读入csv文件

csv_data = []
with open(pathToSave, encoding='utf-8') as csvfile:
    for row in csv.DictReader(csvfile, skipinitialspace=True):
        csv_data.append(row)

操作数据

for row in csv_data:
    row['tweetText'] = row['tweetText'].lower()

去除井号、url、数字

去除匹配特定正则表达式,使用re.sub()方法。

regex_task2 = [r'\B(\#[a-zA-Z0-9]+\b)(?!;)', 
               r'\b(http(s)?:\/\/)?(www.)?([a-zA-Z0-9])+([\-\.]{1}[a-zA-Z0-9]+)*\.[a-zA-Z]{2,5}(:[0-9]{1,5})?(\/[^\s]*)?\b',
               r'\d?\.?\d+']
import re

for row in csv_data:
    for ptn in regex_task2:
        row['tweetText'] = re.sub(ptn, '', row['tweetText'])

去除后缀

从每个单词删去-ing, -ed, -ly。注意不是删去整个单词。

regex_task3 = [r'([a-zA-Z]+)ing\b', r'([a-zA-Z]+)ed\b', r'([a-zA-Z]+)ly\b']


for row in csv_data:
    for ptn in regex_task3:
        row['tweetText'] = re.sub(ptn, r'\1', row['tweetText'])

词形还原

把所有to do形式换成do。

regex_task4 = r'\bto\b\s+([a-zA-Z]+)\b'


for row in csv_data:
    row['tweetText'] = re.sub(regex_task4, r'\1', row['tweetText'])

去除停用词(stop words)

这里的停用词采用维基百科给出的十个最常用英文停用词,链接https://en.wikipedia.org/wiki/Most_common_words_in_English

regex_task5 = [r'\bthe\b', r'\bbe\b', r'\bto\b', r'\bof\b', r'\band\b', r'\ba\b', r'\bin\b', r'\bthat\b', r'\bhave\b', r'\bI\b']


for row in csv_data:
    for ptn in regex_task5:
        row['tweetText'] = re.sub(ptn, "", row['tweetText'])

令牌化(tokenize)

tokens是一个列表,其每个元素是被拆分为令牌的一句话。

tokens = []

for row in csv_data:
    sentences = re.split(r' *[\.\?!][\'"\)\]]* *', row['tweetText'])
    for sentence in sentences:
        raw_words = sentence.split()
        pure_words = []
        for raw_word in raw_words:
            pure_words.append(re.sub(r'[^a-z]', "", raw_word))
        tokens.append(pure_words)
        
print(tokens[0])

简单可视化

画一个统计每句话长度的直方图。

import matplotlib.pyplot as plt
 

lengths = []
for token_list in tokens:
    lengths.append(len(token_list))

plt.hist(lengths)
plt.show()

在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值