Python库pandas之二

IT_Beijing_BIT

于 2024-10-03 04:52:31 发布

阅读量718

点赞数 18

分类专栏： Python 程序设计语言文章标签： python pandas 开发语言

本文链接：https://blog.csdn.net/IT_Beijing_BIT/article/details/142486629

版权

程序设计语言同时被 2 个专栏收录

31 篇文章 0 订阅

订阅专栏

Python

22 篇文章 0 订阅

订阅专栏

Python库pandas之二

基本数据结构
- DataFrame

基本数据结构

Pandas提供了两种类型的类来处理数据：

Series：保存任何类型数据的一维数组。例如整数、字符串、Python对象等。
DataFrame：一种二维数据结构，用于保存数据，如二维数组，或具有行和列的表格。

DataFrame

DataFrame提供二维、大小可变、可以异构的表格化数据。

数据结构还包含标记的轴（行和列）。算术运算对齐行和列。可以被认为是 Series 对象的类似字典的容器。主要的 pandas 数据结构。

构造函数

词法：pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)

参数说明

data，该参数类型是ndarray, Iterable, dict, 或者DataFrame。
字典dict 可以包含Series、数组、常量、数据类，或类似列表的对象。
如果data是字典，则列顺序遵循插入顺序。如果字典包含已定义索引的Series，则它按其索引对齐。
如果data是 Series 或 DataFrame 本身，也会发生这种对齐。对Series/DataFrame输入，也会进行对齐。
如果data是字典列表，则列顺序遵循插入顺序。
index，该参数类型是Index 或者类似array的数据类型。
用于结果帧的索引。如果输入数据没有索引信息部分，而且未提供索引，则默认为RangeIndex。
columns，该参数类型是Index 或者类似array的数据类型。
当数据没有列标签时，该参数用于结果帧的列标签，默认为RangeIndex(0, 1, 2, …, n)。如果数据包含列标签，则将执行列选择。
dtype，该参数类型是dtype, 默认值为None。
数据类型强制。只允许使用单一数据类型。如果没有，则推断。
copy，该参数类型是bool或者None, 默认值为None。
从输入复制数据。对于 dict 数据，默认 None 的行为类似于 copy=True。对于 DataFrame 或 2d ndarray 输入，默认值 None 的行为类似于 copy=False。如果 data 是包含一个或多个 Series的字典，Series可能具有不同数据类型，则 copy=False 将确保不会复制这些输入。

属性

属性名称	说明
T	DataFrame 的转置。
at	访问行/列标签对的单个值。
attrs	该数据集的全局属性字典。
axes	返回表示 DataFrame 轴的列表。
columns	DataFrame 的列标签。
dtypes	返回 DataFrame 中的 dtype。
empty	指示Series/DataFrame是否为空。
flags	获取与此 pandas 对象关联的属性。
iat	按整数位置访问行/列对的单个值。
iloc	纯粹基于整数位置的索引，用于按位置选择。该属性已经被弃用。
index	DataFrame 的索引（行标签）。
loc	通过标签或布尔数组访问一组行和列。
ndim	返回一个 int 表示轴数/数组维度。
shape	返回表示 DataFrame 维度的元组。
size	返回一个 int 表示该对象中元素的数量。
style	返回一个 Styler 对象。
values	返回 DataFrame 的 Numpy 表示形式。

属性应用

test_1.csv内容

name,age,weight,height,salary
"John", 35, 150, 170,10000.19
"Tom", 45, 170, 180,8000.51

属性T

>>> import pandas as pd
>>> df = pd.read_csv("test_1.csv")
>>> print(df)
   name  age  weight  height    salary
0  John   35     150     170  10000.19
1   Tom   45     170     180   8000.51
>>> print(df.T)
               0        1
name        John      Tom
age           35       45
weight       150      170
height       170      180
salary  10000.19  8000.51

属性axes，index，shape

>>> df = pd.read_csv("test_1.csv")
>>> print(df.axes)
[RangeIndex(start=0, stop=2, step=1), Index(['name', 'age', 'weight', 'height', 'salary'], dtype='object')]
>>> print(df.axes[0])
RangeIndex(start=0, stop=2, step=1)
>>> print(df.axes[1])
Index(['name', 'age', 'weight', 'height', 'salary'], dtype='object')
>>> print(df.index)
RangeIndex(start=0, stop=2, step=1)
>>> print(df.shape)
(2, 5)

属性values

>>> df = pd.read_csv("test_1.csv")
>>> print(df.values)
[['John' 35 150 170 10000.19]
 ['Tom' 45 170 180 8000.51]]

成员函数

成员函数	说明
abs()	返回包含每个元素的绝对数值的 Series/DataFrame。
add(other[, axis, level, fill_value])	获取数据帧和其他元素的加法（二元运算符添加）。
add_prefix(prefix[, axis])	带有字符串前缀的前缀标签。
add_suffix(suffix[, axis])	带有字符串后缀的后缀标签。
agg([func, axis])	使用指定轴上的一项或多项操作进行聚合。
aggregate([func, axis])	使用指定轴上的一项或多项操作进行聚合。
align(other[, join, axis, level, copy, …])	使用指定的连接方法将两个对象在其轴上对齐。
all([axis, bool_only, skipna])	返回是否所有元素都为 True（可能在轴上）。
any(*[, axis, bool_only, skipna])	返回任何元素是否为 True，可能在轴上。
apply(func[, axis, raw, result_type, args, …])	沿 DataFrame 的轴应用函数。
applymap(func[, na_action])	将函数按元素应用于 Dataframe。该函数已弃用
asfreq(freq[, method, how, normalize, …])	将时间序列转换为指定频率。
asof(where[, subset])	返回最后一行，where 之前没有任何 NaN。
assign(**kwargs)	将新列分配给 DataFrame。
astype(dtype[, copy, errors])	将 pandas 对象转换为指定的dtype。
at_time(time[, asof, axis])	选择一天中特定时间的值（例如9:30am）。
backfill(*[, axis, inplace, limit, downcast])	通过使用下一个有效观察值来填充 NA/NaN 值以填补空白。该函数已弃用
between_time(start_time, end_time[, …])	选择一天中特定时间之间的值（例如，9:00-9:30am）。
bfill(*[, axis, inplace, limit, limit_area, …])	通过使用下一个有效观察来填补空白来填充 NA/NaN 值。
bool()	返回单个元素 Series 或 DataFrame 的 bool。该函数已弃用
boxplot([column, by, ax, fontsize, rot, …])	根据 DataFrame 列绘制箱线图。
clip([lower, upper, axis, inplace])	在输入阈值处修剪值。
combine(other, func[, fill_value, overwrite])	与另一个 DataFrame 执行按列组合。
combine_first(other)	使用 other 中相同位置的值更新 null 元素。
compare(other[, align_axis, keep_shape, …])	与另一个 DataFrame 进行比较并显示差异。
convert_dtypes([infer_objects, …])	使用支持 pd.NA 的数据类型将列转换为最佳可能的数据类型。
copy([deep])	复制该对象的索引和数据。
corr([method, min_periods, numeric_only])	计算列的成对相关性，不包括 NA/null 值。
corrwith(other[, axis, drop, method, …])	计算成对相关性。
count([axis, numeric_only])	计算每列或行的非 NA 单元格数量。
cov([min_periods, ddof, numeric_only])	计算列的成对协方差，不包括 NA/null 值。
cummax([axis, skipna])	返回 DataFrame 或 Series 轴上的累积最大值。
cummin([axis, skipna])	返回 DataFrame 或 Series 轴上的累积最小值。
cumprod([axis, skipna])	返回 DataFrame 或 Series 轴上的累积乘积。
cumsum([axis, skipna])	返回 DataFrame 或 Series 轴上的累积和。
describe([percentiles, include, exclude])	生成描述性统计数据。
diff([periods, axis])	元素的第一个离散差分。
div(other[, axis, level, fill_value])	获取数据帧和其他元素的浮动除法（二元运算符 truediv）。
divide(other[, axis, level, fill_value])	获取数据帧和其他元素的浮动除法（二元运算符 truediv）。
dot(other)	计算 DataFrame 和其他 DataFrame 之间的矩阵乘法。
drop([labels, axis, index, columns, level, …])	从行或列中删除指定的标签。
drop_duplicates([subset, keep, inplace, …])	返回删除了重复行的 DataFrame。
droplevel(level[, axis])	返回系列/数据帧，并删除请求的索引/列级别。
dropna(*[, axis, how, thresh, subset, …])	删除缺失值。
duplicated([subset, keep])	返回表示重复行的布尔系列。
eq(other[, axis, level])	获取数据帧和其他元素的等于（二元运算符 eq）。
equals(other)	测试两个对象是否包含相同的元素。
eval(expr, *[, inplace])	评估描述 DataFrame 列操作的字符串。
ewm([com, span, halflife, alpha, …])	提供指数加权 (EW) 计算。
expanding([min_periods, axis, method])	提供扩展窗口计算。
explode(column[, ignore_index])	将类似列表的每个元素转换为一行，复制索引值。
ffill(*[, axis, inplace, limit, limit_area, …])	通过将最后一个有效观测值传播到下一个有效观测值来填充 NA/NaN 值。
fillna([value, method, axis, inplace, …])	使用指定的方法填充 NA/NaN 值。
filter([items, like, regex, axis])	根据指定的索引标签对数据帧行或列进行子集化。
first(offset)	根据日期偏移选择时间序列数据的初始周期。该函数已经被弃用
first_valid_index()	返回第一个非 NA 值的索引，如果未找到非 NA 值，则返回 None。
floordiv(other[, axis, level, fill_value])	获取数据帧和其他元素的整数除法（二元运算符 Floordiv）。
from_dict(data[, orient, dtype, columns])	从类似数组的字典或字典构造 DataFrame。
from_records(data[, index, exclude, …])	将结构化或记录 ndarray 转换为 DataFrame。
ge(other[, axis, level])	获取大于或等于数据帧和其他元素（二元运算符 ge）。
get(key[, default])	从给定键的对象中获取项目（例如：DataFrame 列）。
groupby([by, axis, level, as_index, sort, …])	使用映射器或一系列列对 DataFrame 进行分组。
gt(other[, axis, level])	获取大于数据帧和其他元素的数据（二元运算符 gt）。
head([n])	返回前 n 行。
hist([column, by, grid, xlabelsize, xrot, …])	制作 DataFrame 列的直方图。
idxmax([axis, skipna, numeric_only])	返回请求轴上第一次出现最大值的索引。
idxmin([axis, skipna, numeric_only])	返回请求轴上第一次出现最小值的索引。
infer_objects([copy])	尝试为对象列推断更好的数据类型。
info([verbose, buf, max_cols, memory_usage, …])	打印 DataFrame 的简洁摘要。
insert(loc, column, value[, allow_duplicates])	将列插入 DataFrame 中的指定位置。
interpolate([method, axis, limit, inplace, …])	使用插值方法填充 NaN 值。
isetitem(loc, value)	在位置为 loc 的列中设置给定值。
isin(values)	DataFrame 中的每个元素是否包含在值中。
isna()	检测缺失值。
isnull()	DataFrame.isnull 是 DataFrame.isna 的别名。
items()	迭代（列名称，系列）对。
iterrows()	将 DataFrame 行作为（索引，系列）对进行迭代。
itertuples([index, name])	将 DataFrame 行作为命名元组进行迭代。
join(other[, on, how, lsuffix, rsuffix, …])	连接另一个 DataFrame 的列。
keys()	获取“信息轴”。
kurt([axis, skipna, numeric_only])	返回请求轴上的无偏峰度。
kurtosis([axis, skipna, numeric_only])	返回请求轴上的无偏峰度。
last(offset)	根据日期偏移选择时间序列数据的最终周期。该函数已经被弃用
last_valid_index()	返回最后一个非 NA 值的索引，如果未找到非 NA 值，则返回 None。
le(other[, axis, level])	获取小于或等于数据帧和其他元素（二元运算符 le）。
lt(other[, axis, level])	获取小于数据帧和其他元素的数据（二元运算符 lt）。
map(func[, na_action])	按元素将函数应用于数据框。
mask(cond[, other, inplace, axis, level])	替换条件为 True 的值。
max([axis, skipna, numeric_only])	返回请求轴上的最大值。
mean([axis, skipna, numeric_only])	返回请求轴上的值的平均值。
median([axis, skipna, numeric_only])	返回请求轴上的值的中位数。
melt([id_vars, value_vars, var_name, …])	将 DataFrame 从宽格式逆透视为长格式，可以选择保留标识符集。
memory_usage([index, deep])	返回每列的内存使用情况（以字节为单位）。
merge(right[, how, on, left_on, right_on, …])	使用数据库样式连接合并 DataFrame 或命名 Series 对象。
min([axis, skipna, numeric_only])	返回请求轴上的最小值。
mod(other[, axis, level, fill_value])	获取数据帧和其他元素的模（二元运算符 mod）。
mode([axis, numeric_only, dropna])	获取沿选定轴的每个元素的模式。
mul(other[, axis, level, fill_value])	获取数据帧和其他元素的乘法（二元运算符 mul）。
multiply(other[, axis, level, fill_value])	获取数据帧和其他元素的乘法（二元运算符 mul）。
ne(other[, axis, level])	获取数据帧和其他元素的不等于（二元运算符 ne）。
nlargest(n, columns[, keep])	返回按列降序排列的前 n 行。
notna()	检测现有（非缺失）值。
notnull()	DataFrame.notnull 是 DataFrame.notna 的别名。
nsmallest(n, columns[, keep])	返回按列升序排列的前 n 行。
nunique([axis, dropna])	计算指定轴中不同元素的数量。
pad(*[, axis, inplace, limit, downcast])	通过将最后一个有效观测值传播到下一个有效观测值来填充 NA/NaN 值。该函数已经被弃用
pct_change([periods, fill_method, limit, freq])	当前元素与先前元素之间的分数变化。
pipe(func, args, *kwargs)	应用需要 Series 或 DataFrame 的可链接函数。
pivot(*, columns[, index, values])	返回按给定索引/列值组织的重塑的 DataFrame。
pivot_table([values, index, columns, …])	创建电子表格样式的数据透视表作为 DataFrame。
pop(item)	返回物品并从框架中掉落。
pow(other[, axis, level, fill_value])	获取数据帧和其他元素的指数幂（二元运算符 pow）。
prod([axis, skipna, numeric_only, min_count])	返回请求轴上的值的乘积。
product([axis, skipna, numeric_only, min_count])	返回请求轴上的值的乘积。
quantile([q, axis, numeric_only, …])	返回请求轴上给定分位数的值。
query(expr, *[, inplace])	使用布尔表达式查询 DataFrame 的列。
radd(other[, axis, level, fill_value])	获取数据帧和其他元素的加法（二元运算符 radd）。
rank([axis, method, numeric_only, …])	计算沿轴的数值数据排名（1 到 n）。
rdiv(other[, axis, level, fill_value])	获取数据帧和其他元素的浮动除法（二元运算符 rtruediv）。
reindex([labels, index, columns, axis, …])	使用可选的填充逻辑使 DataFrame 符合新索引。
reindex_like(other[, method, copy, limit, …])	返回一个与其他对象具有匹配索引的对象。
rename([mapper, index, columns, axis, copy, …])	重命名列或索引标签。
rename_axis([mapper, index, columns, axis, …])	设置索引或列的轴名称。
reorder_levels(order[, axis])	使用输入顺序重新排列索引级别。
replace([to_replace, value, inplace, limit, …])	将 to_replace 中给出的值替换为 value。
resample(rule[, axis, closed, label, …])	对时间序列数据重新采样。
reset_index([level, drop, inplace, …])	重置索引或其级别。
rfloordiv(other[, axis, level, fill_value])	获取数据帧和其他元素的整数除法（二元运算符 rfloordiv）。
rmod(other[, axis, level, fill_value])	获取数据帧和其他元素的模（二元运算符 rmod）。
rmul(other[, axis, level, fill_value])	获取数据帧和其他元素的乘法（二元运算符 rmul）。
rolling(window[, min_periods, center, …])	提供滚动窗口计算。
round([decimals])	将 DataFrame 舍入为可变的小数位数。
rpow(other[, axis, level, fill_value])	获取数据帧和其他元素的指数幂（二元运算符 rpow）。
rsub(other[, axis, level, fill_value])	获取数据帧和其他元素的减法（二元运算符 rsub）。
rtruediv(other[, axis, level, fill_value])	获取数据帧和其他元素的浮动除法（二元运算符 rtruediv）。
sample([n, frac, replace, weights, …])	从对象轴返回项目的随机样本。
select_dtypes([include, exclude])	根据列 dtypes 返回 DataFrame 列的子集。
sem([axis, skipna, ddof, numeric_only])	返回请求轴上平均值的无偏标准误差。
set_axis(labels, *[, axis, copy])	Assign desired index to given axis.
set_flags(*[, copy, allows_duplicate_labels])	返回带有更新标志的新对象。
set_index(keys, *[, drop, append, inplace, …])	使用现有列设置 DataFrame 索引。
shift([periods, freq, axis, fill_value, suffix])	使用可选的时间频率将索引移动所需的周期数。
skew([axis, skipna, numeric_only])	返回请求轴上的无偏斜。
sort_index(*[, axis, level, ascending, …])	按标签对对象进行排序（沿轴）。
sort_values(by, *[, axis, ascending, …])	按沿任一轴的值排序。
squeeze([axis])	将一维轴对象压缩为标量。
stack([level, dropna, sort, future_stack])	将指定级别从列堆叠到索引。
std([axis, skipna, ddof, numeric_only])	返回请求轴上的样本标准差。
sub(other[, axis, level, fill_value])	获取数据帧和其他元素的减法（二元运算符 sub）。
subtract(other[, axis, level, fill_value])	获取数据帧和其他元素的减法（二元运算符 sub）。
sum([axis, skipna, numeric_only, min_count])	返回请求轴上的值的总和。
swapaxes(axis1, axis2[, copy])	适当地互换轴和交换值轴。该函数已经被弃用
swaplevel([i, j, axis])	交换 MultiIndex 中的级别 i 和 j。
tail([n])	返回最后 n 行。
take(indices[, axis])	返回沿轴给定位置索引中的元素。
to_clipboard(*[, excel, sep])	将对象复制到系统剪贴板。
to_csv([path_or_buf, sep, na_rep, …])	将对象写入逗号分隔值 (csv) 文件。
to_dict([orient, into, index])	将 DataFrame 转换为字典。
to_excel(excel_writer, *[, sheet_name, …])	将对象写入 Excel 工作表。
to_feather(path, **kwargs)	将 DataFrame 写入二进制 Feather 格式。
to_gbq(destination_table, *[, project_id, …])	将 DataFrame 写入 Google BigQuery 表。该函数已弃用
to_hdf(path_or_buf, *, key[, mode, …])	使用 HDFStore 将包含的数据写入 HDF5 文件。
to_html([buf, columns, col_space, header, …])	将 DataFrame 渲染为 HTML 表。
to_json([path_or_buf, orient, date_format, …])	将对象转换为 JSON 字符串。
to_latex([buf, columns, header, index, …])	将对象渲染为 LaTeX 表格、长表或嵌套表。
to_markdown([buf, mode, index, storage_options])	以 Markdown 友好的格式打印 DataFrame。
to_numpy([dtype, copy, na_value])	将 DataFrame 转换为 NumPy 数组。
to_orc([path, engine, index, engine_kwargs])	将 DataFrame 写入 ORC 格式。
to_parquet([path, engine, compression, …])	将 DataFrame 写入二进制 parquet 格式。
to_period([freq, axis, copy])	将 DataFrame 从 DatetimeIndex 转换为 periodIndex。
to_pickle(path, *[, compression, protocol, …])	将对象腌制（序列化）到文件。
to_records([index, column_dtypes, index_dtypes])	将 DataFrame 转换为 NumPy 记录数组。
to_sql(name, con, *[, schema, if_exists, …])	将存储在 DataFrame 中的记录写入 SQL 数据库。
to_stata(path, *[, convert_dates, …])	将 DataFrame 对象导出为 Stata dta 格式。
to_string([buf, columns, col_space, header, …])	将 DataFrame 渲染为控制台友好的表格输出。
to_timestamp([freq, how, axis, copy])	在周期开始时转换为时间戳的 DatetimeIndex。
to_xarray()	从 pandas 对象返回一个 xarray 对象。
to_xml([path_or_buffer, index, root_name, …])	将 DataFrame 呈现为 XML 文档。
transform(func[, axis])	在 self 上调用 func 生成一个与 self 具有相同轴形状的 DataFrame。
transpose(*args[, copy])	转置索引和列。
truediv(other[, axis, level, fill_value])	获取数据帧和其他元素的浮动除法（二元运算符 truediv）。
truncate([before, after, axis, copy])	在某个索引值之前和之后截断 Series 或 DataFrame。
tz_convert(tz[, axis, level, copy])	将 tz 感知轴转换为目标时区。
tz_localize(tz[, axis, level, copy, …])	将 Series 或 DataFrame 的 tz-naive 索引本地化到目标时区。
unstack([level, fill_value, sort])	透视索引标签的一个级别（必须是分层的）。
update(other[, join, overwrite, …])	使用另一个 DataFrame 中的非 NA 值进行适当修改。
value_counts([subset, normalize, sort, …])	返回一个包含数据框中每个不同行的频率的系列。
var([axis, skipna, ddof, numeric_only])	返回请求轴上的无偏方差。
where(cond[, other, inplace, axis, level])	替换条件为 False 的值。
xs(key[, axis, level, drop_level])	从系列/数据帧返回横截面。

成员函数应用

下面展示一些 内联代码片。

import sys
import pandas as pd

def upper_string(p):
    return p.upper()

def add_salary(p):
    return p + 15
    
def apply_member_func(dat):
    df = pd.read_csv(dat)
    print(df)
    df['name'] = df['name'].map(upper_string)
    print()
    print(df)
    df['salary'] = df['salary'].map(add_salary)
    print()
    print(df)    

if __name__ == "__main__":
    apply_member_func(sys.argv[1])

屏幕输出

C:\>python pandas_data_2.py test_1.csv
   name  age  weight  height    salary
0  John   35     150     170  10000.19
1   Tom   45     170     180   8000.51

   name  age  weight  height    salary
0  JOHN   35     150     170  10000.19
1   TOM   45     170     180   8000.51

   name  age  weight  height    salary
0  JOHN   35     150     170  10015.19
1   TOM   45     170     180   8015.51