假设我成功地获取了这条文本,然后将它们指定为textToModify:textToModify = "
abcde abcde
Title: Director, lorem company
Phone: 123.647.4555
Mobile: 123.123.1234 E-mail: try1@umich.edu Assistant: my name Assistant Phone: 667.889.9910
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
Linkedin: www.linkedin.com/in/lorem-ipsum/
Twitter: www.twitter.com/ipsum
"
现在我想从这篇文章中提取标题、姓名、电话号码、linkedin、twitter和其他重要信息。有这样的图书馆吗?或者你有什么想法吗?假设这个文本的格式是随机的,但是单词title总是紧挨着标题本身,单词phone总是紧挨着phone等等
我最初的想法是:
nltk库无法工作,因为它基本上是用标识符分配单词,问题是,这个文本不是按单词分隔的,而是字符,例如,如果访问textToModify[20],它只会返回一个字符。
我的另一个想法是,如果我访问链接,然后截图,然后使用python中的picture到文本库,然后从那里开始呢
谢谢你!在