Python&Stata数据采集与数据分析实证前沿寒假工作坊 现在开始招生了,有兴趣的同学和老师可以戳进来了解
准备数据
现有的数据是一部小说放在一个txt里,我们想按照章节(列表中第一个就是章节1的内容,列表中第二个是章节2的内容)进行分析,这就需要用到正则表达式整理数据。
比如我们先看看 01-Harry Potter and the Sorcerer's Stone.txt" 里的章节情况,我们打开txt
经过检索发现,所有章节存在规律性表达
[Chapter][空格][整数][换行符\n][可能含有空格的英文标题][换行符\n]
我们先熟悉下正则,使用这个设计一个模板pattern提取章节信息
import re
import nltk
raw_text = open("data/01-Harry Potter and the Sorcerer's Stone.txt").read()
pattern = 'Chapter \d+\n[a-zA-Z ]+\n'
re.findall(pattern, raw_text)
['Chapter 1\nThe Boy Who Lived\n',
'Chapter 2\nThe Vanishing Glass\n',
'Chapter 3\nThe Letters From No One\n',
'Chapter 4\nThe Keeper Of The Keys\n',
'Chapter 5\nDiagon Alley\n',
'Chapter 7\nThe Sorting Hat\n',
'Chapter 8\nThe Potions Master\n',
'Chapter 9\nThe Midnight Duel\n',
'Chapter 10\nHalloween\n',
'Chapter 11\nQuidditch\n',
'Chapter 12\nThe Mirror Of Erised\n',
'Chapter 13\nNicholas Flamel\n',
'Chapter 14\nNorbert the Norwegian Ridgeback\n',
'Chapter 15\nThe Forbidden Forest\n',
'Chapter 16\nThrough the Trapdoor\n',
'Chapter 17\nThe Man With Two Faces\n']
熟悉上面的正则表达式操作,我们想更精准一些。我准备了一个test文本,与实际小说中章节目录表达相似,只不过文本更短,更利于理解。按照我们的预期,我们数据中只有5个章节,那么列表的长度应该是5。这样操作后的列表中第一个内容就是章节1的内容,列表中第二个内容是章节2的内容。
import re
test = """Chapter 1\nThe Boy Who Lived\nMr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you’d expect to be involved in anything strange or mysterious, because they just didn’t hold with such nonsense.\nMr. Dursley was the director of a firm called Grunnings,
Chapter 2\nThe Vanishing Glass\nFor a second, Mr. Dursley didn’t realize what he had seen — then he jerked his head around to look again. There was a tabby cat standing on the corner of Privet Drive, but there wasn’t a map in sight. What could he have been thinking of? It must have been a trick of the light. Mr. Dursley blinked and stared at the cat.
Chapter 3\nThe Letters From No One\nThe traffic moved on and a few minutes later, Mr. Dursley arrived in the Grunnings parking lot, his mind back on drills.\nMr. Dursley always sat with his back to the window in his office on the ninth floor. If he hadn’t, he might have found it harder to concentrate on drills that morning.
Chapter 4\nThe Keeper Of The Keys\nHe didn’t know why, but they made him uneasy. This bunch were whispering excitedly, too, and he couldn’t see a single collecting tin.
Chapter 5\nDiagon Alley\nIt was a few seconds before Mr. Dursley realized that the man was wearing a violet cloak. """
#获取章节内容列表(列表中第一个内容就是章节1的内容,列表中第二个内容是章节2的内容)
#为防止列表中有空内容,这里加了一个条件判断,保证列表长度与章节数预期一致
chapter_contents = [c