文本处理
一颗小铁球
这个作者很懒,什么都没留下…
展开
-
清理文本
def clean_text(text): text = text.lower() # lowercase text = re.sub(r'[!]+', '!', text) text = re.sub(r'[?]+', '?', text) text = re.sub(r'[.]+', '.', text) text = re.sub(r"'", "", text) text = re.sub('\s+', ' ', text).strip() # R原创 2021-03-09 14:53:37 · 193 阅读 · 0 评论 -
FastText
fasttext简单理解及应用原创 2021-03-07 14:59:34 · 88 阅读 · 0 评论 -
替换特定字符
替换特定字符import oswith open(r'input.txt', 'r', encoding='utf-8') as f: countent = f.readlines() for line in countent: countent = line.replace("A","B")#把A换成B with open(r'output.txt', "a+", encoding='utf-8') as fa: fa.writ原创 2021-03-06 12:43:13 · 93 阅读 · 0 评论 -
去除小于一定长度的数据条目
去除小于一定长度的数据条目import ref=open(r"input.csv",'r',encoding='utf-8')alllines=f.readlines()f.close()f=open(r"output.csv",'w+',encoding='utf-8')for eachline in alllines: a = eachline try: b = list(a) c = b[0] if c !=' ':原创 2021-03-06 12:40:36 · 233 阅读 · 0 评论 -
从文本中随机抽取部分数据
从文本中随机抽取部分数据# coding:utf-8import randomimport time"""注意盘符小写"""DATA_DIR = r'D:\\'DATA_DIR2 = r'D:\\hello.txt'a = range(3)def test(): for item in a: f = open("13fix.txt", "r", encoding='utf-8') # 源文件 fw = open("hello.txt",原创 2021-03-06 12:38:29 · 420 阅读 · 0 评论 -
合并两个txt文档
合并两个txt文档file1 = open(r"1.txt", "r",encoding='utf-8')file2 = open(r"2.txt", "r",encoding='utf-8')file1_lists = file1.readlines()file2_lists = file2.readlines()file3_list = []file4_list = []for i in file1_lists: temp_list = i.split() file3_原创 2021-03-06 12:32:12 · 420 阅读 · 0 评论 -
分割文本为三不等部分
分割你的训练文本with open(r"要处理的原文本.txt",'r',encoding='utf-8') as f: countent = f.readlines() count = 0 # pat = re.compile('\n') for line in countent: count=count+1 if 0<=count<=(数字:第一部分数据条数): with open(r"train.t原创 2021-03-06 12:36:10 · 100 阅读 · 0 评论 -
删除不必要字符
删除不必要字符(如空格)import ref=open(r"1.txt",'r',encoding='utf-8')alllines=f.readlines()f.close()f=open(r"output.txt",'w+',encoding='utf-8')for eachline in alllines: a=re.sub(' ','. ',eachline) f.writelines(a)f.close()原创 2021-03-06 12:33:36 · 98 阅读 · 0 评论