对于中文以及windows下路径的修改是要注意的,尤其是编码方式
ASCII不能存储中文
unicode是中文在内存的编码方式
utf-8是中文在硬盘的编码方式
需要转化尤其是在调用存储的时候
下面的代码先decode的目的在于,将原本存于硬盘的utf-8代码解析成Unicode,然后再转换成utf-8显示
还有就是split对于分词来说十分有用
python下标是从0开始的。。。
# -*- coding: UTF-8 -*-
import os,sys
import re
str2 = 'C:/Users/Hit/Desktop/文本/199801.txt'
path = unicode(str2,"utf8")
fo = open(path)
fw = open('new.txt','w')
count = 0
done = 0
while not done:
line = fo.readline()
if line:
count = count+1
if count != 0:
split_line = line.split(" ")
clear_time = 1
for item in split_line:
if clear_time == 1:
clear_time = clear_time + 1
continue
else:
term = re.split('/',item)
if term[0] != '\n':
for word in term[1].split():
if word == 'nr' or word == 'ns' or word == 'nz' or word == 'nt':
count_nr = 0
isfirst = 1