为了清除Python-era中的文本行,将其放入cetm中。我使用Pandas来清理每一行,并返回一个新的、干净的Excel文件,其格式与原始文件相同。为了使标记器和词干分析器能够读取Excel文件,Pandas数据帧需要采用字符串格式。在
它或多或少可以工作,但下面的代码将每行中的文本按单个单词拆分,结果每行只包含一个(已清理)单词,而不像原始文件那样包含一个句子。如何确保它不会拆分每行文本?在
(简体)代码如下:import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import SnowballStemmer
tokenizer = RegexpTokenizer(r'\w+')
stemmer = SnowballStemmer('english')
stop_words = set(stopwords.words('english'))
excel = pd.read_excel(open('example.xls', 'rb'))
data_to_string = pd.DataFrame.to_string(excel)
for line in data_to_string:
tokens = tokenizer.tokenize(data_to_string)
stopped = [word for word in tokens if not word in stop_words] #removes stop words
trimmed = [ word for word in stopped if len(word) >= 3 ] #takes out all words of two characters or less.
stemmed = [stemmer.stem(word) for word in trimmed] #stems the words
return_to_dataframe = pd.DataFrame(stemmed) #resets back to pandas dataframe
我想用这个,但没用:
^{pr2}$
编辑:Maarten问我是否可以上传我当前和期望输出的图像。原始输入文件(未清理)的格式在左侧。中间部分是期望的结果(词干和停止词删除等),右边的图像是当前的输出。在
编辑:我设法解决了它;主要问题是标记化。首先,我必须将pandas数据帧转换为列表列表(请参见下面代码中的strdata),然后标记每个列表中的每个项。剩下的部分通过一个简单的for循环来解决,将清理后的行追加到列表中,并将列表转换回pandas数据帧。之所以有remove_NaN,是因为pandas认为每个None-类型的元素都是由字母数字字符组成的字符串(即单词“None”),而不是一个空单元格,因此必须删除这个字符串。另外,pandas将每个标记化的单词放入一个单独的列中。mergeddf是为了将所有单词合并回同一列。在
工作代码如下:from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import PorterStemmer
import pandas as pd
import numpy as np
#load tokenizer, stemmer and stop words
tokenizer = RegexpTokenizer(r'\w+')
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))
excel = pd.read_excel(open(inFilePath, 'rb')) #use pandas to read excel file
strdata = excel.values.tolist() #convert values to list of lists (each row becomes a separate list)
tokens = [tokenizer.tokenize(str(i)) for i in strdata] #tokenize words in lists
cleaned_list = []
for m in tokens:
stopped = [i for i in m if str(i).lower() not in stop_words] #remove stop words
stemmed = [stemmer.stem(i) for i in stopped] #stem words
cleaned_list.append(stemmed) #append stemmed words to list
backtodf = pd.DataFrame(cleaned_list) #convert list back to pandas dataframe
remove_NaN = backtodf.replace(np.nan, '', regex=True) #remove None (which return as words (str))
mergeddf = remove_NaN.astype(str).apply(lambda x: ' '.join(x), axis=1) #convert cells to strings, merge columns