Python酷库之旅-第三方库Pandas(004)

最新推荐文章于 2024-07-25 18:55:58 发布

神奇夜光杯

最新推荐文章于 2024-07-25 18:55:58 发布

阅读量1.1k

点赞数 46

分类专栏： Myelsa的Python酷库之旅文章标签： python pandas 开发语言标准库及第三方库基础知识学习和成长

本文链接：https://blog.csdn.net/ygb_1024/article/details/140205408

版权

Myelsa的Python酷库之旅专栏收录该内容

89 篇文章 20 订阅

订阅专栏

一、用法精讲

5、pandas.DataFrame.to_csv函数

5-1、语法

5-2、参数

5-3、功能

5-4、返回值

5-5、说明

5-6、用法

5-6-1、代码示例

5-6-2、结果输出

6、pandas.read_fwf函数

6-1、语法

6-2、参数

6-3、功能

6-4、返回值

6-5、说明

6-6、用法

一、用法精讲

5、pandas.DataFrame.to_csv函数

5-1、语法

# 5、pandas.DataFrame.to_csv函数
DataFrame.to_csv(path_or_buf=None, *, sep=',', na_rep='', float_format=None, columns=None, header=True, index=True, index_label=None, mode='w', encoding=None, compression='infer', quoting=None, quotechar='"', lineterminator=None, chunksize=None, date_format=None, doublequote=True, escapechar=None, decimal='.', errors='strict', storage_options=None)
Write object to a comma-separated values (csv) file.

Parameters:
path_or_bufstr, path object, file-like object, or None, default None
String, path object (implementing os.PathLike[str]), or file-like object implementing a write() function. If None, the result is returned as a string. If a non-binary file object is passed, it should be opened with newline=’’, disabling universal newlines. If a binary file object is passed, mode might need to contain a ‘b’.

sepstr, default ‘,’
String of length 1. Field delimiter for the output file.

na_repstr, default ‘’
Missing data representation.

float_formatstr, Callable, default None
Format string for floating point numbers. If a Callable is given, it takes precedence over other numeric formatting parameters, like decimal.

columnssequence, optional
Columns to write.

headerbool or list of str, default True
Write out the column names. If a list of strings is given it is assumed to be aliases for the column names.

indexbool, default True
Write row names (index).

index_labelstr or sequence, or False, default None
Column label for index column(s) if desired. If None is given, and header and index are True, then the index names are used. A sequence should be given if the object uses MultiIndex. If False do not print fields for index names. Use index_label=False for easier importing in R.

mode{‘w’, ‘x’, ‘a’}, default ‘w’
Forwarded to either open(mode=) or fsspec.open(mode=) to control the file opening. Typical values include:

‘w’, truncate the file first.

‘x’, exclusive creation, failing if the file already exists.

‘a’, append to the end of file if it exists.

encodingstr, optional
A string representing the encoding to use in the output file, defaults to ‘utf-8’. encoding is not supported if path_or_buf is a non-binary file object.

compressionstr or dict, default ‘infer’
For on-the-fly compression of the output data. If ‘infer’ and ‘path_or_buf’ is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’ (otherwise no compression). Set to None for no compression. Can also be a dict with key 'method' set to one of {'zip', 'gzip', 'bz2', 'zstd', 'xz', 'tar'} and other key-value pairs are forwarded to zipfile.ZipFile, gzip.GzipFile, bz2.BZ2File, zstandard.ZstdCompressor, lzma.LZMAFile or tarfile.TarFile, respectively. As an example, the following could be passed for faster compression and to create a reproducible gzip archive: compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1}.

New in version 1.5.0: Added support for .tar files.

May be a dict with key ‘method’ as compression mode and other entries as additional compression options if compression mode is ‘zip’.

Passing compression options as keys in dict is supported for compression modes ‘gzip’, ‘bz2’, ‘zstd’, and ‘zip’.

quotingoptional constant from csv module
Defaults to csv.QUOTE_MINIMAL. If you have set a float_format then floats are converted to strings and thus csv.QUOTE_NONNUMERIC will treat them as non-numeric.

quotecharstr, default ‘"’
String of length 1. Character used to quote fields.

lineterminatorstr, optional
The newline character or character sequence to use in the output file. Defaults to os.linesep, which depends on the OS in which this method is called (’\n’ for linux, ‘\r\n’ for Windows, i.e.).

Changed in version 1.5.0: Previously was line_terminator, changed for consistency with read_csv and the standard library ‘csv’ module.

chunksizeint or None
Rows to write at a time.

date_formatstr, default None
Format string for datetime objects.

doublequotebool, default True
Control quoting of quotechar inside a field.

escapecharstr, default None
String of length 1. Character used to escape sep and quotechar when appropriate.

decimalstr, default ‘.’
Character recognized as decimal separator. E.g. use ‘,’ for European data.

errorsstr, default ‘strict’
Specifies how encoding and decoding errors are to be handled. See the errors argument for open() for a full list of options.

storage_optionsdict, optional
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.

Returns:
None or str
If path_or_buf is None, returns the resulting csv format as a string. Otherwise returns None.

5-2、参数

5-2-1、path_or_buf(可选，默认值为None)：指定要写入的文件路径(字符串或路径对象)或任何文件状对象。如果为None，则输出将作为字符串返回，而不是写入文件。

5-2-2、sep(可选，默认值为',')：字段之间的分隔符，可以根据需要更改为其他字符，如制表符('\t')用于制表符分隔的值(TSV)。

5-2-3、na_rep(可选，默认值为'')：缺失值(NaN)的表示，你可以指定任何你想要的字符串来表示缺失值。

5-2-4、float_format(可选，默认值为None)：浮点数的格式字符串。例如，'%.2f'会将浮点数格式化为保留两位小数的字符串。

5-2-5、columns(可选，默认值为None)：要写入的列名列表。如果为None，则写入所有列。

5-2-6、header(可选，默认值为True)：是否将列名写入文件作为第一行。如果为False，则不写入列名；也可以是一个字符串列表，用于指定要作为文件头部写入的列名(注意：这可能会改变列的顺序)。

5-2-7、index(可选，默认值为True)：是否将行索引写入文件。如果为False，则不写入索引。

5-2-8、index_label(可选，默认值为None)：如果需要，可以使用此参数来更改索引列的列名。如果为False，则不写入索引名称。如果为字符串或字符串序列，则用作索引的列名。

5-2-9、mode(可选，默认值为'w')：文件打开模式，若执行写入模式，如果文件已存在则覆盖。

5-2-10、encoding(可选，默认值为None)：指定文件的编码方式。

5-2-11、compression(可选，默认值为'infer')：指定压缩的字符串(如'gzip'、'bz2'、'zip'、'xz')，或者一个包含压缩选项的字典。如果为'infer'并且文件扩展名是.gz、.bz2、.zip或.xz，则自动推断压缩方式。

5-2-12、quoting(可选，默认值为None)：控制字段中引号的使用。

5-2-13、quotechar(可选，默认值为"")：引号字符，用于包围字段中的特殊字符。

5-2-14、lineterminator(可选，默认值为None)：行结束符。

5-2-15、chunksize(可选，默认值为None)：如果设置了，则文件将被写入指定的块大小，这对于大文件可能很有用，因为它可以减少内存使用量。

5-2-16、date_format(可选，默认值为None)：日期时间对象的格式字符串。

5-2-17、doublequote(可选，默认值为True)：控制是否将字段内的quotechar(引号字符)加倍(即当字段内容中已包含引号字符时，使用双引号来包围该字段)，这在处理需要被引号包围且内容中已包含引号的字段时非常有用。

5-2-18、escapechar(可选，默认值为None)：转义字符，用于转义引号字符(如果quoting参数不是csv.QUOTE_NONE且字段中包含引号字符时)。如果指定了escapechar，则quotechar字符前的escapechar会被用来转义quotechar，而不是加倍quotechar。

5-2-19、decimal(可选，默认值为'.')：用于表示浮点数的小数点字符，这在处理不同地域的数据时非常有用，因为某些地区可能使用逗号(,)作为小数点字符。

5-2-20、errors(可选，默认值为'strict')：指定如何处理编码错误。有效选项包括'strict'、'ignore'、'replace'、'surrogatepass'等，'strict'(默认值)将引发异常，'ignore'将忽略错误，'replace'将使用?替换错误字符，'surrogatepass'将允许通过代理对(surrogate pairs)表示UTF-16字符，这可能在某些情况下导致不可预见的错误。

5-2-21、storage_options(可选，默认值为None)：对于支持额外存储选项的文件系统(如S3、GCS等)，此参数允许你传递额外的选项给底层的存储系统。例如，在写入S3时，你可以使用storage_options={'key':'secret','bucket_name':'mybucket'}来传递认证信息和桶名。

5-3、功能

将DataFrame中的数据写入到指定的文件路径或文件状对象中。

5-4、返回值

5-4-1、如果path_or_buf参数是一个文件路径或文件状对象，则DataFrame.to_csv()函数通常没有返回值(即返回None)，因为它直接将数据写入到指定的文件中。

5-4-2、如果path_or_buf参数为None，则函数返回一个字符串，该字符串包含了DataFrame的CSV表示形式，这允许你在不直接写入文件的情况下获取CSV格式的字符串数据。

5-5、说明

无

5-6、用法

5-6-1、代码示例

# 5、pandas.DataFrame.to_csv函数
# 5-1、无返回值
import pandas as pd
# 创建一个简单的DataFrame
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [24, 27, 22],
    'City': ['New York', 'Los Angeles', 'Chicago']
})
# 将DataFrame导出为CSV文件
csv_str = df.to_csv('people.csv', index=False)  # 注意：这里没有返回值
print(csv_str)

# 5-2、有返回值
import pandas as pd
# 创建一个包含数据的字典
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [24, 27, 22],
    'City': ['New York', 'Los Angeles', 'Chicago']
}
# 使用字典创建DataFrame
df = pd.DataFrame(data)
# 将DataFrame转换为CSV格式的字符串
# index=False: 不包含行索引
# sep=';': 使用分号作为分隔符
# na_rep='N/A': 用'N/A'表示缺失值
# line_terminator='\n': 使用换行符分隔行
csv_string = df.to_csv(index=False, sep=';', na_rep='N/A', lineterminator='\n')
# 打印CSV字符串
print(csv_string)

# 5-3、指定文件路径
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [24, 27, 22], 'City': ['New York', 'Los Angeles', 'Chicago']}
df = pd.DataFrame(data)
csv_string = df.to_csv('data.csv', index=False)
print(csv_string)

# 5-4、使用文件对象
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [24, 27, 22], 'City': ['New York', 'Los Angeles', 'Chicago']}
df = pd.DataFrame(data)
with open('data.csv', 'w') as file:
    csv_string = df.to_csv(file, index=False)
print(csv_string)

# 5-5、使用StringIO
import pandas as pd
from io import StringIO
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [24, 27, 22], 'City': ['New York', 'Los Angeles', 'Chicago']}
df = pd.DataFrame(data)
buffer = StringIO()
df.to_csv(buffer, index=False)
csv_string = buffer.getvalue()
print(csv_string)

5-6-2、结果输出

# 5-1、无返回值
None

# 5-2、有返回值
Name;Age;City
Alice;24;New York
Bob;27;Los Angeles
Charlie;22;Chicago

# 5-3、指定文件路径
None

# 5-4、使用文件对象
None

# 5-5、使用StringIO
Name,Age,City
Alice,24,New York
Bob,27,Los Angeles
Charlie,22,Chicago

6、pandas.read_fwf函数

6-1、语法

# 6、pandas.read_fwf函数
pandas.read_fwf(filepath_or_buffer, *, colspecs='infer', widths=None, infer_nrows=100, dtype_backend=_NoDefault.no_default, iterator=False, chunksize=None, **kwds)
Read a table of fixed-width formatted lines into DataFrame.

Also supports optionally iterating or breaking of the file into chunks.

Additional help can be found in the online docs for IO Tools.

Parameters:
filepath_or_bufferstr, path object, or file-like object
String, path object (implementing os.PathLike[str]), or file-like object implementing a text read() function.The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.csv.

colspecslist of tuple (int, int) or ‘infer’. optional
A list of tuples giving the extents of the fixed-width fields of each line as half-open intervals (i.e., [from, to[ ). String value ‘infer’ can be used to instruct the parser to try detecting the column specifications from the first 100 rows of the data which are not being skipped via skiprows (default=’infer’).

widthslist of int, optional
A list of field widths which can be used instead of ‘colspecs’ if the intervals are contiguous.

infer_nrowsint, default 100
The number of rows to consider when letting the parser determine the colspecs.

dtype_backend{‘numpy_nullable’, ‘pyarrow’}, default ‘numpy_nullable’
Back-end data type applied to the resultant DataFrame (still experimental). Behaviour is as follows:

"numpy_nullable": returns nullable-dtype-backed DataFrame (default).

"pyarrow": returns pyarrow-backed nullable ArrowDtype DataFrame.

New in version 2.0.

**kwdsoptional
Optional keyword arguments can be passed to TextFileReader.

Returns:
DataFrame or TextFileReader
A comma-separated values (csv) file is returned as two-dimensional data structure with labeled axes.

6-2、参数

6-2-1、filepath_or_buffer(必须)：字符串或文件对象，表示要读取的文件路径或文件对象。如果是文件路径，需要确保Pandas能够访问到这个文件。

6-2-2、colspecs(可选，默认值为'infer')：指定列宽的规范。可以是一个整数列表，表示每列的起始位置(索引从0开始)，或者是一个元组列表，每个元组包含两个整数，分别表示每列的起始和结束位置(不包括结束位置)。如果设置为 'infer'，Pandas会尝试自动推断列宽。

6-2-3、widths(可选，默认值为None)：与colspecs参数类似，但widths接收的是一个整数列表，直接指定每列的宽度(即每列的结束位置相对于起始位置的偏移量)。如果同时指定了colspecs和widths，则colspecs会被优先使用。

6-2-4、infer_nrows(可选，默认值为100)：用于推断列宽时读取的行数。当colspecs='infer'时，Pandas会读取文件的前infer_nrows行来尝试推断出列宽，这个值可以根据文件大小和复杂性进行调整。

6-2-5、dtype_backend(可选)：这个参数通常不需要用户直接设置，它是用来指定数据类型推断的后端，Pandas内部使用它来优化数据类型的推断过程。

6-2-6、iterator(可选，默认值为False)：布尔值，如果设置为True，则返回一个TextFileReader对象，该对象可以迭代地读取文件块(chunk)，而不是一次性将整个文件读入内存，这对于处理大文件很有用。

6-2-7、chunksize(可选，默认值为None)：当iterator=True时，这个参数指定了每个文件块(chunk)的行数。如果设置为None，则chunksize会被设置为infer_nrows的值。

6-2-8、*kwds(可选)：其他关键字参数，这些参数会传递给底层的TextParser对象。常用的有header(指定列名的行位置，默认为None，表示没有列名)、names(自定义的列名列表，当文件中没有列名时使用)等。

6-3、功能

将固定宽度格式的文本文件解析成Pandas的DataFrame对象。

6-4、返回值

返回值是一个DataFrame对象。

6-5、说明

从Pandas 1.0.0开始，dtype_backend参数已被弃用，并且可能在未来的版本中移除。在大多数情况下，用户不需要直接设置这个参数。

6-6、用法

6-6-1、代码示例

# 6、pandas.read_fwf函数
# 6-1、创建测试用的.txt文件
# 直接使用Python的文件操作写入字符串
with open('example.txt', 'w') as f:
    f.write('12345John Doe  25  New York\n')
    f.write('67890Jane Smith30  Los Angeles\n')

# 6-2、基础用法
import pandas as pd
# 假设列宽分别为 5, 10, 2, 14
colspecs = [(0, 5), (5, 15), (15, 17), (17, 31)]
# 读取文件
df = pd.read_fwf('example.txt', colspecs=colspecs, header=None, names=['ID', 'Name', 'Age', 'City'])
# 显示DataFrame
print(df)

# 6-3、自动推断列宽
import pandas as pd
# 尝试自动推断列宽，这里假设前100行足够用来推断
df = pd.read_fwf('example.txt', colspecs='infer', header=None, names=['ID', 'Name', 'Age', 'City'], infer_nrows=100)
# 显示DataFrame
print(df)

# 6-4、使用widths参数
import pandas as pd
# 使用 widths 参数指定列宽
widths = [5, 10, 2, 14]  # 分别对应ID, Name, Age, City的宽度
# 读取文件
df = pd.read_fwf('example.txt', widths=widths, header=None, names=['ID', 'Name', 'Age', 'City'])
# 显示DataFrame
print(df)

6-6-2、结果输出

# 6-1、创建测试用的.txt文件
# None

# 6-2、基础用法
#       ID        Name  Age         City
# 0  12345    John Doe   25     New York
# 1  67890  Jane Smith   30  Los Angeles

# 6-3、自动推断列宽
#           ID     Name  Age     City
# 0  12345John  Doe  25  New     York
# 1  67890Jane  Smith30  Los  Angeles

# 6-4、使用widths参数
#       ID        Name  Age         City
# 0  12345    John Doe   25     New York
# 1  67890  Jane Smith   30  Los Angeles