文本分析--NLTK访问文件

最新推荐文章于 2023-05-14 08:18:36 发布

小白xyz

最新推荐文章于 2023-05-14 08:18:36 发布

阅读量2.3k

点赞数

分类专栏：文本分析

本文链接：https://blog.csdn.net/kevinelstri/article/details/70145629

版权

文本分析专栏收录该内容

14 篇文章

订阅专栏

# -*-coding:utf-8-*-

from __future__ import division
import nltk, re, pprint

"""
    从网络和硬盘中访问文本：
        1、电子书
        2、处理的html
        3、处理搜索引擎的结果
        4、读取本地文件
        5、从pdf，word及其他二进制格式中读取
        6、捕获用户输入
        7、NLP的流程
"""
# 1、电子书
# 可以从http://www.gutenberg.org/上浏览在线免费书籍，获取文本文件的URL
# from urllib import urlopen
#
# url = 'http://www.gutenberg.org/cache/epub/105/pg105.txt'
# raw = urlopen(url).read()  # 读取网络书籍
# # print raw
# print type(raw)
# print len(raw)
# print raw[:60]
#
# tokens = nltk.word_tokenize(raw)  # 分词
# print type(tokens)
# print len(tokens)
# print tokens[:10]

# text = nltk.Text(tokens)
# print type(text)
# # print text[1020:1060]
# print text.collocations()

# 2、处理HTML
# import nltk
# from urllib import urlopen
# import BeautifulSoup
# url = 'http://www.baidu.com'
# html = urlopen(url).read()
# print html  # 将html的所有标签、内容全部输出
# print html[:60]
# raw = nltk.clean_html(html)  # 去除html，不能使用？？？？
# print raw[10:20]
# tokens = nltk.tokenize(raw)
# print tokens

# 3、读取本地文件
# f = open('Dictionnaire.txt')
# raw = f.read()
# print raw
#
# for line in raw:
#     print line.strip()

# 4、从PDF、word提取文件

# 5、用户输入
# s = raw_input("Please enter some text:")
# print len(s)


"""
    NLP流程：
        1、打开一个URL，读取里卖弄的HTML格式内容，并去除标记
        2、对获取的文本进行分词处理，并将其转换为text对象
        3、将所有词汇小写，并提取词汇表（去重,排序）
"""

"""
    字符串操作：
        s.find(t)          字符串 s 中包含 t 的第一个索引(没找到返回-1)
        s.rfind(t)         字符串 s 中包含 t 的最后一个索引(没找到返回-1)
        s.index(t)         与 s.find(t) 功能类似，但没找到时引起异常 ValueError
        s.rindex(t)        与 s.rfind(t) 功能类似，但没找到时引起异常 ValueError
        s.join(text)       连接字符串 s 与 text 中的词汇
        s.split(t)         在所有找到 t 的位置将 s 分割成链表
        s.splitlines()     将 s 按行分割成字符串链表
        s.lower()          将字符串 s 小写
        s.upper()          将字符串 s 大写
        s.titlecase()      将字符串 s 首字母大写
        s.strip()          返回一个没有首尾空白字符的 s 的拷贝
        s.replace(t, u)    用 u 替换 s 中的 t
"""

"""
    Unicode:
        unicode 支持一百万种字符，每个字符分配一个编号，称为编码点。
        python中，编码点写做 \uxxxx 的形式，xxxx是四位十六进制的数。

        文件中的文本都是有特定编码的，所以需要一些机制来将文本翻译成unicode,这个过程就是————解码。
        将unicode写入一个文件或终端，首先需要将unicode转化为河师大额编码，这个过程就是————编码。

        GB2312  --> decode --> unicode --> encode --> GB2312
        Latin-2 --> decode --> unicode --> encode --> Latin-2
        UTF-8   --> decode --> unicode --> encode --> UTF-8
"""
# 从文件中提取已经编码的文件
# path = nltk.data.find('History of France.txt')
# import codecs
# f = codecs.open(path, encoding='utf8')


s = u'中华人民共和国'
# u = s.decode('utf8')
print s.encode('utf8')
print '-------------------------'

f = open('RedDream.txt')
raw = f.read()
print raw