Pandas使用笔记

阿懿_

已于 2022-06-15 16:14:39 修改

阅读量322

点赞数

分类专栏： python_pandas 文章标签： python pandas

于 2021-09-18 17:24:42 首次发布

本文链接：https://blog.csdn.net/Lay_ZRS/article/details/120342613

版权

python_pandas 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

Pandas使用笔记

基本操作
查看dataframe内存占用
使用pandas进行数据分析及可视化
参考文献

基本操作

读取

从csv中读取数据
```
import pandas as pd
df = pd.read_csv("data.csv", encoding="utf-8")
```
如遇到字符编码报错，更改encoding参数，中文可用“gbk”；

pandas节约内存的一个标配函数

	def reduce_mem(df):
    starttime = time.time()
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if pd.isnull(c_min) or pd.isnull(c_max):
                continue
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024**2
    print('-- Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction),time spend:{:2.2f} min'.format(end_mem,
                                                                                                           100*(start_mem-end_mem)/start_mem,
                                                                                                           (time.time()-starttime)/60))
    return df

从excel中读取

df = pd.read_excel(path, sheet_name=None)

pandas从不同的sheet中读取：

# df.keys() 获取所有sheet名称
df1 = pd.read_excel(path, sheet_name="sheet1")

采样

某些场景下需要先使用一小部分数据进行实验，以节约时间，可以对数据进行采样：

df = df.sample(frac=0.1, random_state=SEED)
# frac设置采样率
# random_state可设置随机种子，以固定采样结果

创建DataFrame

创建空的DataFrame

pd.DataFrame(columns=["a", "b", "c"], index=[0, 1, 2])

使用list创建DataFrame

data_list = [[1,2,3], [4,5,6], [7,8,9]]
df = pd.DataFrame(data_list, columns=["a", "b", "c"])

使用dict创建DataFrame

data_dict = {"a": [1,2,3], "b": [4, 5, 6], "c": [7, 8, 9]}
df = pd.DataFrame(data_dict)

DataFrame遍历行

DataFrame格式不能直接用for循环遍历整行，可采用以下三种方法进行行遍历：

# iterrows（）：在单独的变量中返回索引和行项目，速度较慢 
# itertuples（）：快于.iterrows（），但将索引与行项目一起返回，ir [0]是索引 
# zip：最快，但不能访问该行的索引

zip方法：

for tup in zip(df["uid"], df["iid"]):
    print(tup)

保存

保存为csv文件：

df.to_csv("data.csv", encoding="utf-8", index=False)

如遇到字符编码报错，更改encoding参数，中文可用“gbk”，包含emoji表情等特殊字符可用“utf_8_sig”。

保存excel中的不同sheet

with pd.ExcelWriter("file.xlsx") as writer:
    df1.to_excel(writer, sheet_name="sheet1")
    df2.to_excel(writer, sheet_name="sheet2")

查看dataframe内存占用

df.memory_usage()  # 按列看
df.memory_usage().sum()  # 全部
df.info()  # 详细信息 包含内存占用情况

使用pandas进行数据分析及可视化

统计数据信息

info() 函数
获取DataFrame的简要摘要，包括索引的数据类型dtype和列的数据类型dtype，非空值的数量和内存使用情况。
describe() 函数
生成描述性统计信息。描述性统计数据：数值类型的包括均值，标准差，最大值，最小值，分位数等；类别的包括个数，类别的数目，最高数量的类别及出现次数等；输出将根据提供的内容而有所不同。

统计某一列中所有值出现的频率

df.type.value_counts()

new    128
image    32
Name: type, dtype: int64

选取特定行、列

切片

# 返回1-9行
print(df[1:10])
# 选取列"a"和"c"
print(df[["a", "c"]])

loc()
```
# 返回1-9行
print(df.loc[1:9])
```
选取数值满足某一条件的行
```
print(df[df.type == "new"])
```

查看数据是否有缺失

df.isnull().any()
按列查看是否有空值

	Unnamed: 0       False
	uid              False
	createtime       False
	iid              False
	dtype: bool

df.isnull().values.sum()
查看缺失值总数
df.isnull().sum()
按列查看缺失值总数

对缺失值进行填充

df.fillna(0)
将所有缺失位置填充为0

可视化

使用plot()绘图
绘制条形图并添加数据标签

plt.figure()
df.type.value_counts().plot(kind="bar")
# 添加数据标签
for x, y in enumerate(df.type.value_counts()):
	plt.text(x, y, "%s" % y, fontsize=10, ha="center") # x,y表示标签添加的位置，默认根据坐标轴的数据度量，为绝对值； "%s"%y 为要填充的值。
plt.title("xx")
plt.xlabel("xxx")
plt.ylabel("xxx")
plt.show()