读取文件，保存文件

最新推荐文章于 2023-04-24 22:11:19 发布

行走的五花肉

最新推荐文章于 2023-04-24 22:11:19 发布

阅读量318

点赞数

分类专栏：文本分类 python

本文链接：https://blog.csdn.net/weixin_42545466/article/details/107504455

版权

python 同时被 2 个专栏收录

11 篇文章 0 订阅

订阅专栏

文本分类

5 篇文章 0 订阅

订阅专栏

TXT

读取

按行读取TXT文件

#最重要的是观察我们的问价是以什么为分隔符，也就是split('\n')中的内容是什么。
#open().read()函数
def readLines(filename):
    lines=open(filename,encoding='utf-8').read().strip().split('\n')
    return [line for line in lines]
#file.read([size])：
#file.readline()：返回一行。
readline只读一行的意思是返回的时候只返回一行，也就是print(filename.readline())是返回一行的值，如果用for i in filename.readline():会一行一行地输出，而print(filename.read())是返回整个文件的值，不用for遍历

在这里插入图片描述

边读取文件边分类

category_lines={}
all_categories=[]
#边读着就构建了字典，先找出类别，之后用该类别构建字典
for filename in findFiles(r'F:\谷歌下载\data\names\*.txt'):
    category = os.path.splitext(os.path.basename(filename))[0]
    all_categories.append(category)
    lines = readLines(filename)
    category_lines[category] = lines

分割

将TXT文件按照’\n’分割成多个txt文件

CSV

读取

按行读取CSV文件

解压压缩文件

zip_ref = zipfile.ZipFile(os.path.join(r'压缩文件的地址'), 'r')
zip_ref.extractall(r'解压后的地址')
zip_ref.close()
for i in zip_ref.namelist():
    print (i)

读取CSV文件

dictLabels = {}
with open(r'F:\研一\NLP\数据集\ag_news_csv\test.csv') as csvfile:
            csvreader = csv.reader(csvfile, delimiter=',')
            next(csvreader, None)  # skip (filename, label)
            #要想获得迭代器的数据，用enumerate。
            for i, row in enumerate(csvreader):
            	print(i,row)

1.读取CSV文件的某列

dictLabels = {}
with open(r'F:\研一\NLP\数据集\ag_news_csv\test.csv') as csvfile:
            csvreader = csv.reader(csvfile, delimiter=',')
            next(csvreader, None)  # skip (filename, label)
            #要想获得迭代器的数据，用enumerate。
            for i, row in enumerate(csvreader):
            	print(i,row)
                text = row[2]
                label = row[0]
                # append filename to current label
                #if label in dictLabels.keys():
                    #dictLabels[label].append(row)
                #else:
                    #dictLabels[label] = [row]

2.获取每个元素的索引及其值

with open(filename,encoding="utf-8") as f:
    reader = csv.reader(f)
    header_row = next(reader)    
    for index,column_header in enumerate(header_row):
        print(index,column_header)

3.读取某文件夹下的所有文件名

import os
def readname():
    filePath = 'G:\\workplace\\first\\SamplingAlgorithm\\datasets\\'
    name = os.listdir(filePath)
    return name

if __name__ == "__main__":
    name = readname()
    print(name)
    for i in name:
        print(i)

将数据保存成CSV或者TXT文件
在用Python将结果导出到csv中的时候，如果结果中有中文，经常会出现乱码的情况。这种情况下，我们可以通过如下语句导出csv：

4.读取文件夹下的所有文件，并将其构建成字典

def findFiles(path):
    #  glob.glob()返回的是列表 list类型。是所有路径下的符合条件的文件名的列表。
    return glob.glob(path)
def readLines(filename):
    lines=open(filename,encoding='utf-8').read().strip().split('\n')
    return [line for line in lines]
for filename in findFiles(r'F:\谷歌下载\data\names\*.txt'):
    category=os.path.splitext(os.path.basename(filename))[0]
    all_categories.append(category)
    lines=readLines(filename)
    #将所有的文件构建成一个大字典。
    category_lines[category]=lines

glob方法，os.path.splitext方法

当保存的结果中含有中文乱码，解决办法
data.to_csv(‘3A_test.csv’,index=False,encoding=‘utf_8_sig’)

参考文献: https://www.jb51.net/article/159025.htm