python 文件覆盖_知道了这些，您可以使用Python覆盖99％的文件操作

最新推荐文章于 2024-07-21 02:04:28 发布

weixin_26737625

最新推荐文章于 2024-07-21 02:04:28 发布

阅读量2.4k

点赞数

文章标签： python java linux

原文链接：https://towardsdatascience.com/knowing-these-you-can-cover-99-of-file-operations-in-python-84725d82c2df

版权

python 文件覆盖

Working with files is one of the most common tasks we do every day. Python has several built-in modules for performing file operations, such as reading files, moving files, getting file attributes, etc. This article summarizes many functions that you need to know to cover the most common file operations and good practices in Python.

与文件瓦特是工作会有的，我们每天做的最常见的任务之一。 Python有几个用于执行文件操作的内置模块，例如读取文件，移动文件，获取文件属性等。本文总结了您需要了解的许多功能，以涵盖Python中最常见的文件操作和良好做法。

Here is a graph of modules/functions you will see in this article. To know more about each operation, please continue reading.

这是您将在本文中看到的模块/功能图。要了解有关每个操作的更多信息，请继续阅读。

打开和关闭文件 (Open & Close a file)

When you want to read or write a file, the first thing to do is to open the file. Python has a built-in function open that opens the file and returns a file object. The type of the file object depends on the mode in which the file is opened. It can be a text file object, a raw binary file, and a buffered binary file. Every file object has methods such as read() and write().

当您要读取或写入文件时，首先要做的就是打开文件。 Python具有open的内置函数，该函数open文件并返回文件对象。文件对象的类型取决于打开文件的模式。它可以是文本文件对象，原始二进制文件和缓冲的二进制文件。每个文件对象都有诸如read()和write() 。

There is a problem in this code block, can you recognize it? We will discuss it later.

该代码块中有问题，您能识别出来吗？ 我们将在后面讨论。

file = open("test_file.txt","w+")
file.read()
file.write("a new line")

Python documentation has listed all the possible file modes. The most common modes are listed in the table. An important rule is that any w related mode will first truncate the file if it exists and then create a new file. Be careful with this mode if you don’t want to overwrite the file and use a append mode if possible.

Python 文档列出了所有可能的文件模式。表中列出了最常见的模式。 一个重要的规则是，任何 与 w 相关的模式都将首先截断该文件(如果存在)，然后创建一个新文件。 小心这种模式下，如果你不希望覆盖该文件，并使用a ，如果可能的追加模式。

The problem in the previous code block is that we only opened the file, but didn’t close it. It’s important to always close the file when working with files. Having an open file object can cause unpredictable behaviors such as resource leak. There are two ways to make sure that a file is closed properly.

上一个代码块中的问题是我们只打开了文件，但没有关闭它。 在处理文件时始终关闭文件很重要。 拥有打开的文件对象可能会导致不可预测的行为，例如资源泄漏。有两种方法可以确保正确关闭文件。

Use close()
使用 close()

The first way is to explicitly use close(). A good practice is to put it in finally, so that we can make sure the file will be closed in any case. It brings more clarity to the code, but on the other hand, the developer should take responsibility and not forget to close it.

第一种方法是显式使用close() 。一个好的做法是将它放入finally ，以便我们可以确保在任何情况下都将关闭该文件。它使代码更加清晰，但另一方面，开发人员应该承担责任，不要忘记关闭它。

try:
    file = open("test_file.txt","w+")
    file.write("a new line")
exception Exception as e:
    logging.exception(e)
finally:
    file.close()

2. Use context manager with open(...) as f

2.使用 with open(...) as f 上下文管理器 with open(...) as f

The second way is to use a context manager. If you are not familiar with the context manager, then check out Context Managers and the “with” Statement in Python by Dan Bader. with open() as f statement implements __enter__ and __exit__ methods to open and close the file. Besides, it encapsulates try/finally statement in the context manager, which means we will never forget to close the file.

第二种方法是使用上下文管理器。如果您不熟悉上下文管理器，那么请查看Dan Bader 用Python编写的上下文管理器和“ with”语句。 with open() as f语句实现__enter__和__exit__方法来打开和关闭文件。此外，它将try / finally语句封装在上下文管理器中，这意味着我们将永远不会忘记关闭文件。

with open("test_file","w+") as file:
    file.write("a new line")

Is this context manager solution always better than close()? It depends on where you use it. The following example implements 3 different ways of writing 50,000 records to a file. As you can see from the output, use_context_manager_2() function has extremely low performance compared to the others. This is because with statement is in a separate function, it basically opens and closes the file for each record. Such expensive I/O operation influences the performance tremendously.

这个上下文管理器解决方案是否总是比close() ？这取决于您在哪里使用它。以下示例实现了将50,000条记录写入文件的3种不同方式。从输出中可以看到， use_context_manager_2()函数与其他函数相比性能极低。这是因为with语句在单独的函数中，它基本上为每个记录打开和关闭文件。这种昂贵的I / O操作会极大地影响性能。

def _write_to_file(file, line):
    with open(file, "a") as f:
        f.write(line)


def _valid_records():
    for i in range(100000):
        if i % 2 == 0:
            yield i


def use_context_manager_2(file):
    for line in _valid_records():
        _write_to_file(file, str(line))


def use_context_manager_1(file):
    with open(file, "a") as f:
        for line in _valid_records():
            f.write(str(line))


def use_close_method(file):
    f = open(file, "a")
    for line in _valid_records():
        f.write(str(line))
    f.close()
    
use_close_method("test.txt")
use_context_manager_1("test.txt")
use_context_manager_2("test.txt")


# Finished 'use_close_method' in 0.0253 secs
# Finished 'use_context_manager_1' in 0.0231 secs
# Finished 'use_context_manager_2' in 4.6302 secs

读写文件 (Read & Write to a file)

After you open a file, you must want to read or write to the file. The file object provides 3 methods to read a file which are read(), readline() and readlines().

打开文件后，您必须要读取或写入文件。 file对象提供了三种读取文件的方法，分别是read() ， readline()和readlines() 。

By default, read(size=-1) returns the entire contents of a file. If the file is bigger than the memory, the optional parameter size can help you to limit the size of the returned characters (text mode) or bytes (binary mode).

默认情况下， read(size=-1)返回文件的全部内容。如果文件大于内存，则可选参数size可以帮助您限制返回的字符(文本模式)或字节(二进制模式)的大小。

readline(size=-1) returns an entire line including character \n at the end. If size is bigger than 0, it will return maximum size number of characters from the line.

readline(size=-1)返回整行，最后包括字符\n 。如果size大于0，则返回最大size的字符数从线。

readlines(hint=-1) returns all the lines of a file in a list. The optional parameter hint means if the number of characters returned exceeds hint, no more lines will be returned.

readlines(hint=-1)返回列表中文件的所有行。可选参数hint表示如果返回的字符数超过hint ，将不返回任何行。

Among these 3 methods, read() and readlines() are less memory efficient because by default they return the complete file either in a string or in a list. A more memory efficient way to iterate over lines is to use readline() and let it stop reading until it returns an empty string. The empty string "" means the pointer reaches the end of the file.

在这三种方法中， read()和readlines()的内存效率较低，因为默认情况下，它们以字符串或列表形式返回完整的文件。一种更有效的内存迭代方式是使用readline()并使其停止读取，直到返回空字符串。空字符串""表示指针到达文件末尾。

with open('test.txt', 'r') as reader:
    line = reader.readline()
    while line != "":
        line = reader.readline()
        print(line)

In terms of writing, there are 2 methods write() and writelines(). As the name suggests, write() is to write a string and writelines() is to write a list of string. It’s the responsibility of the developer to add \n at the end.

在编写方面，有两种方法write()和writelines() 。顾名思义， write()是写一个字符串，而writelines()是写一个字符串列表。 开发人员有责任 在末尾 添加 \n 。

with open("test.txt", "w+") as f:
    f.write("hi\n")
    f.writelines(["this is a line\n", "this is another line\n"])
    
# >>> cat test.txt 
# hi
# this is a line
# this is another line

If you write text to a special file type such as JSON or csv, then you should use Python built-in module json or csv on top of file object.

如果您将文本写入特殊的文件类型(例如JSON或csv)，则应在文件对象顶部使用Python内置模块json或csv 。

import csv
import json


with open("cities.csv", "w+") as file:
    writer = csv.DictWriter(file, fieldnames=["city", "country"])
    writer.writeheader()
    writer.writerow({"city": "Amsterdam", "country": "Netherlands"})
    writer.writerows(
        [
            {"city": "Berlin", "country": "Germany"},
            {"city": "Shanghai", "country": "China"},
        ]
    )
    
# >>> cat cities.csv 
# city,country
# Amsterdam,Netherlands
# Berlin,Germany
# Shanghai,China


with open("cities.json", "w+") as file:
    json.dump({"city": "Amsterdam", "country": "Netherlands"}, file)


# >>> cat cities.json 
# { "city": "Amsterdam", "country": "Netherlands" }

在文件内移动指针 (Move pointer within the file)

When we open a file, we get a file handler that points to a certain position. In r and w modes, the handler points to the beginning of the file. In a mode, the handler points to the end of the file.

当我们打开文件时，我们得到一个指向特定位置的文件处理程序。在r和w模式下，处理程序指向文件的开头。在a模式下，处理程序指向文件的末尾。

tell() and seek()

tell() 和 seek()

As we read from the file, the pointer moves to the place where the next read will start from, unless we tell the pointer to move around. You can do this using 2 methods: tell() and seek().

当我们从文件中读取时，指针将移动到下一个读取将开始的位置，除非我们告诉指针移动。您可以使用2种方法来做到这一点： tell()和seek() 。

tell() returns the current position of the pointer as number of bytes/characters from the beginning of the file. seek(offset,whence=0) moves the handler to a position offset characters away from whence. whence can be:

tell()以文件开头的字节数/字符数的形式返回指针的当前位置。 seek(offset,whence=0)将处理程序移动到距离whence offset位置的字符。 whence可以是：

0: from the beginning of the file
0：从文件开头
1: from the current position
1：从当前位置开始
2: from the end of the file
2：从文件末尾开始

In the text mode, whence should only be 0 and offset should be ≥0.

在文本模式下， whence仅应为0， offset应≥0。

with open("text.txt", "w+") as f:
    f.write("0123456789abcdef")
    f.seek(9)
    print(f.tell()) # 9 (pointer moves to 9, next read starts from 9)
    print(f.read()) # 9abcdef

了解文件状态 (Understand the file status)

The file system on the operating system can tell you a number of practical information about a file. For example, what’s the size of the file, when it was created and modified. To get this information in Python, you can use os or pathlib module. Actually there are many common things between os and pathlib. pathlib is a more object-oriented module than os.

操作系统上的文件系统可以告诉您许多有关文件的实用信息。例如，创建和修改文件时文件的大小是多少。要在Python中获取此信息，可以使用os或pathlib模块。实际上， os和pathlib.之间有很多共同之处pathlib. pathlib是比os更面向对象的模块。

操作系统

A way to get a complete status is to useos.stat("test.txt"). It returns a result object with many statistics such as st_size (size of the file in bytes), st_atime (timestamp of the most recent access), st_mtime (timestamp of the most recent modification), etc.

获取完整状态的一种方法是使用os.stat("test.txt") 。它返回具有许多统计信息的结果对象，例如st_size (文件大小，以字节为单位)， st_atime (最新访问的时间戳)， st_mtime (最新修改的时间戳)等。

print(os.stat("text.txt"))>>> os.stat_result(st_mode=33188, st_ino=8618932538, st_dev=16777220, st_nlink=1, st_uid=501, st_gid=20, st_size=16, st_atime=1597527409, st_mtime=1597527409, st_ctime=1597527409)

You can also get statistics individually using os.path.

您也可以使用os.path单独获取统计信息。

os.path.getatime()
os.path.getctime()
os.path.getmtime()
os.path.getsize()

Pathlib

路径库

Another way to get the complete status is to use pathlib.Path("text.txt").stat(). It returns the same object as os.stat().

获取完整状态的另一种方法是使用pathlib.Path("text.txt").stat() 。它返回与os.stat()相同的对象。

print(>>> os.stat_result(st_mode=33188, st_ino=8618932538, st_dev=16777220, st_nlink=1, st_uid=501, st_gid=20, st_size=16, st_atime=1597528703, st_mtime=1597528703, st_ctime=1597528703)

We will compare more aspects of os and pathlib in the following sections.

在以下各节中，我们将比较os和pathlib更多方面。

复制，移动和删除文件 (Copy, Move and Delete a file)

Python has many built-in modules to handle file movement. Before you trust the first answer returned by Google, you should be aware that different choices of modules can lead to different performances. Some modules will block the thread until the file movement is done, while others might do it asynchronously.

Python有许多内置模块来处理文件移动。在您信任Google返回的第一个答案之前，您应该意识到，不同的模块选择会导致不同的性能。一些模块将阻塞线程，直到文件移动完成，而其他模块则可能异步执行。

shutil

关闭

shutil is the most well-known module for moving, copying, and deleting both files and folders. It provides 4 methods to only copy a file. copy(), copy2() and copyfile().

shutil是用于移动，复制和删除文件和文件夹的最著名的模块。它提供了4种仅复制文件的方法。 copy() ， copy2()和copyfile() 。

copy() v.s. copy2(): copy2() is very much similar to copy(). The difference is that copy2() also copies the metadata of the file such as the most recent access time, the most recent modification time. But according to Python doc, even copy2() cannot copy all the metadata due to the constrain on the operating system.

copy() vs copy2() ： copy2()与copy()非常相似。不同之处在于copy2()还复制文件的元数据，例如最近的访问时间，最近的修改时间。但是根据Python doc的介绍，由于操作系统的限制，即使copy2()也无法复制所有元数据。

shutil.copy("1.csv", "copy.csv")
shutil.copy2("1.csv", "copy2.csv")


print(pathlib.Path("1.csv").stat())
print(pathlib.Path("copy.csv").stat())
print(pathlib.Path("copy2.csv").stat())
# 1.csv
# os.stat_result(st_mode=33152, st_ino=8618884732, st_dev=16777220, st_nlink=1, st_uid=501, st_gid=20, st_size=11, st_atime=1597570395, st_mtime=1597259421, st_ctime=1597570360)


# copy.csv
# os.stat_result(st_mode=33152, st_ino=8618983930, st_dev=16777220, st_nlink=1, st_uid=501, st_gid=20, st_size=11, st_atime=1597570387, st_mtime=1597570395, st_ctime=1597570395)


# copy2.csv
# os.stat_result(st_mode=33152, st_ino=8618983989, st_dev=16777220, st_nlink=1, st_uid=501, st_gid=20, st_size=11, st_atime=1597570395, st_mtime=1597259421, st_ctime=1597570395)

copy() v.s. copyfile(): copy() sets the permission of the new file the same as the original file, but copyfile() doesn’t copy its permission mode. Secondly, the destination of copy() can be a directory. If a file with the same name exists, it will be overwritten, otherwise, a new file will be created. But, the destination of copyfile() must be the target file name.

copy() vs copyfile() ： copy()将新文件的权限设置为与原始文件相同，但是copyfile()不会复制其权限模式。其次， copy()的目标可以是目录。如果存在同名文件，则将其覆盖，否则，将创建一个新文件。但是， copyfile()的目的地必须是目标文件名。

shutil.copy("1.csv", "copy.csv")
shutil.copyfile("1.csv", "copyfile.csv")


print(pathlib.Path("1.csv").stat())
print(pathlib.Path("copy.csv").stat())
print(pathlib.Path("copyfile.csv").stat())


# 1.csv
# os.stat_result(st_mode=33152, st_ino=8618884732, st_dev=16777220, st_nlink=1, st_uid=501, st_gid=20, st_size=11, st_atime=1597570395, st_mtime=1597259421, st_ctime=1597570360)


# copy.csv
# os.stat_result(st_mode=33152, st_ino=8618983930, st_dev=16777220, st_nlink=1, st_uid=501, st_gid=20, st_size=11, st_atime=1597570387, st_mtime=1597570395, st_ctime=1597570395)


# copyfile.csv
# permission (st_mode) is changed
# os.stat_result(st_mode=33188, st_ino=8618984694, st_dev=16777220, st_nlink=1, st_uid=501, st_gid=20, st_size=11, st_atime=1597570387, st_mtime=1597570395, st_ctime=1597570395)


shutil.copyfile("1.csv", "./source")
# IsADirectoryError: [Errno 21] Is a directory: './source'

操作系统

os module has a function system() that allows you to execute the command in a subshell. You need to pass the command as an argument to the system(). This has the same effect as the command executed on the operating system. For moving and deleting files, you can also use dedicated functions in os module.

os模块具有一个system()函数，允许您在子shell中执行命令。您需要将该命令作为参数传递给system() 。这与在操作系统上执行的命令具有相同的效果。为了移动和删除文件，您还可以在os模块中使用专用功能。

# copy
os.system("cp 1.csv copy.csv")


# rename/move
os.system("mv 1.csv move.csv")
os.rename("1.csv", "move.csv")


# delete
os.system("rm move.csv")

Copy/Move a file asynchronously

异步复制/移动文件

So far, the solutions are always synchronous, which means the program might be blocked if the file is huge and needs more time to move. If you want to make the program asynchronous, you can use threading , multiprocessing or subprocess module to let the file operation run in a separate thread or a separate process.

到目前为止，解决方案始终是同步的，这意味着如果文件很大并且需要更多时间移动，则程序可能会被阻止。如果要使程序异步，则可以使用threading ， multiprocessing或subprocess模块使文件操作在单独的线程或单独的进程中运行。

import threading
import subprocess
import multiprocessing


src = "1.csv"
dst = "dst_thread.csv"


thread = threading.Thread(target=shutil.copy, args=[src, dst])
thread.start()
thread.join()


dst = "dst_multiprocessing.csv"
process = multiprocessing.Process(target=shutil.copy, args=[src, dst])
process.start()
process.join()


cmd = "cp 1.csv dst_subprocess.csv"
status = subprocess.call(cmd, shell=True)

搜索文件 (Search a file)

After copying and moving files, you will probably want to search for filenames that match a particular pattern. Python provides a number of built-in functions for you to choose from.

复制和移动文件后，您可能需要搜索与特定模式匹配的文件名。 Python提供了许多内置函数供您选择。

glob

球状

The glob module finds all the pathnames matching a specified pattern according to the rules used by the Unix shell. It supports wildcard characters such as * ? [].

glob模块根据Unix shell使用的规则查找与指定模式匹配的所有路径名。它支持通配符，例如*？。 []。

glob.glob("*.csv") searches for all the files that have csv extension in the current directory. glob module makes it possible to search for files in the subdirectories as well.

glob.glob("*.csv")搜索当前目录中所有具有csv扩展名的文件。 glob模块，还可以在子目录中搜索文件。

>>> import glob
>>> glob.glob("*.csv")
['1.csv', '2.csv']
>>> glob.glob("**/*.csv",recursive=True)
['1.csv', '2.csv', 'source/3.csv']

操作系统

os module is so powerful that it can basically do everything with file operation. We can simply list all the files in the directory using os.listdir() and use file.endswith() and file.startswith() to detect the pattern. If you want to traverse the directory, then use os.walk().

os模块是如此强大，以至于它基本上可以执行文件操作。我们可以简单地使用os.listdir()列出目录中的所有文件，并使用file.endswith()和file.startswith()来检测模式。如果要遍历目录，请使用os.walk() 。

import os


for file in os.listdir("."):
    if file.endswith(".csv"):
        print(file)
 
for root, dirs, files in os.walk("."):
    for file in files:
        if file.endswith(".csv"):
            print(file)

pathlib

路径库

pathlib has a similar function to the glob module. It’s possible to search filenames recursively as well. Compared to the previous solution based on os, pathlib has less code and offers a more object-oriented solution.

pathlib具有与glob模块类似的功能。也可以递归搜索文件名。与以前的基于os解决方案相比， pathlib代码更少，并且提供了更多的面向对象的解决方案。

from pathlib import Path


p = Path(".")
for name in p.glob("**/*.csv"): # recursive
    print(name)

播放文件路径 (Play around with file path)

Working with a file path is another common task that we do. It can be getting the relative path and absolute path of a file. It can also be joining multiple paths and finding the parent directory, etc.

使用文件路径是我们执行的另一项常见任务。它可以获取文件的相对路径和绝对路径。它也可以连接多个路径并找到父目录等。

relative and absolute path

相对路径和绝对路径

Both os and pathlib offer functions to get the relative path and absolute path of a file or a directory.

os和pathlib都提供函数来获取文件或目录的相对路径和绝对路径。

import os
import pathlib


print(os.path.abspath("1.txt"))  # absolute
print(os.path.relpath("1.txt"))  # relative


print(pathlib.Path("1.txt").absolute())  # absolute
print(pathlib.Path("1.txt"))  # relative

Joining paths

联接路径

This is how we can join paths in os and pathlib independent of the environment. pathlib uses a slash to create child paths.

这是我们可以独立于环境连接os和pathlib路径的方式。 pathlib使用斜杠创建子路径。

import os
import pathlib


print(os.path.join("/home", "file.txt"))
print(pathlib.Path("/home") / "file.txt")

Getting the parent directory

获取父目录

dirname() is the function to get parent directory in os, while in pathlib, you can just use Path().parent to get the parent folder.

dirname()是在os获取父目录的功能，而在pathlib ，您可以仅使用Path().parent来获取父文件夹。

import os
import pathlib


# relative path
print(os.path.dirname("source/2.csv"))
# source
print(pathlib.Path("source/2.csv").parent)
# source


# absolute path
print(pathlib.Path("source/2.csv").resolve().parent)
# /Users/<...>/project/source
print(os.path.dirname(os.path.abspath("source/2.csv")))
# /Users/<...>/project/source

操作系统vs pathlib (os v.s. pathlib)

Last but not least, I want to briefly talk about os and pathlib. As the Python doc says, pathlib is a more object-oriented solution than os. It represents each file path as a proper object instead of a string. This brings a lot of advantages to the developers such as making it easier to join multiple paths, being more consistent on different operation systems, methods are directly accessible from the object.

最后但并非最不重要的一点是，我想简要地介绍一下os和pathlib 。正如Python文档所说， pathlib是比os更面向对象的解决方案。它将每个文件路径表示为适当的对象，而不是字符串。这给开发人员带来了很多好处，例如，使连接多个路径变得更加容易，在不同的操作系统上更加一致，并且可以直接从对象访问方法。

I hope this article can boost your efficiency in working with files.

我希望本文可以提高您处理文件的效率。