Pandas 基本操作介绍

最新推荐文章于 2024-02-04 11:05:13 发布

珍妮的选择

最新推荐文章于 2024-02-04 11:05:13 发布

阅读量274

点赞数

分类专栏： Python 文章标签： python 数据分析 Pandas

本文链接：https://blog.csdn.net/Eric_1993/article/details/104326620

版权

Python 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

Pandas 基本操作介绍

本节介绍 Pandas 的基本操作, 方便日后查询.

文章目录

Pandas 基本操作介绍

载入相关的库

import warnings
warnings.filterwarnings('ignore')
import os
import sys
from os.path import join, exists
from datetime import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

创建操作数据

假设下面这些数据存储在 data.csv 文件中.

日期,年龄,收入,支出
2020-02-02,22,100000.9,20000.6
2020-02-03,18,20800.8,10000.3
2020-02-02,32,508008.8,10300.4
2020-02-02,26,332342.8,101240.3
2020-02-03,24,332344.6,101240.5
2020-02-04,54,134364.2,11240.5
2020-02-03,53,254354.8,11240.0
2020-02-04,13,254234.2,1140.0

数据读取

下面提供一些数据文件读取的函数, 方便将 csv 或 excel 中的数据读入到 pandas 中.

从文件中读取数据

def read_raw_csv_file(file_name):
    """
    直接读取 csv 文件, 不做任何预处理
    """
    df = pd.read_csv(file_name, encoding='utf-8')
    return df

def preprocess(raw_csv_file, output_file):
    """
    处理编码问题
    """
    print('>>>[Preprocess] Preprocessing Raw CSV File: {}'.format(raw_csv_file))
    with open(raw_csv_file, 'rb') as f:
        with open(output_file, 'w') as out:
            for line in f:
                line = line.decode('gb2312').strip().rstrip(',')
                out.write('{}\n'.format(line))
    print('>>>[Preprocess] Saving CSV File To: {}'.format(output_file))

def read_csv_with_preprocess(raw_csv_file, tmp_file='tmp.csv', rm_tmp=True):
    """
    读取原始 csv 文件, 同时包含预处理.
    预处理是为了解决中文内容无法正确读取的问题.
    """
    preprocess(raw_csv_file, tmp_file)
    df = read_raw_csv_file(tmp_file)
    if rm_tmp:
        os.remove(tmp_file)
    return df

def read_xlsx_file(file_name):
    """
    读取 xlsx 文件, 一般读取第一个 sheet
    """
    ordered_dict_dfs = pd.read_excel(file_name, sheet_name=None)
    dfs = list(ordered_dict_dfs.values())[0]
    return dfs

## 数据读取成功后, 查看前五行或者后三行
file_name = 'data.csv'
df = read_raw_csv_file(file_name)
df.head() # 或者 df.head(5)

	日期	年龄	收入	支出
0	2020-02-02	22	100000.9	20000.6
1	2020-02-03	18	20800.8	10000.3
2	2020-02-02	32	508008.8	10300.4
3	2020-02-02	26	332342.8	101240.3
4	2020-02-03	24	332344.6	101240.5

## 查看数据后三行
df.tail(3)

	日期	年龄	收入	支出
5	2020-02-04	54	134364.2	11240.5
6	2020-02-03	53	254354.8	11240.0
7	2020-02-04	13	254234.2	1140.0

## 有的时候, 读取 csv 文件时可能没有 header (用上面数据举例子的话, 就是 csv 文件中没有 "日期,年龄,收入,支出" 这一行)
## 为了正确读取新文件, 需要说明 `header=None`. 代码中 `data1.csv` 是 `data.csv` 去掉第一行 "日期,年龄,收入,支出" 得到的
df = pd.read_csv('data1.csv', header=None, encoding='utf-8')
## 显示 df 之后, 发现 DataFrame 的 columns 是 0 ~ 3
df.head(3)

	0	1	2	3
0	2020-02-02	22	100000.9	20000.6
1	2020-02-03	18	20800.8	10000.3
2	2020-02-02	32	508008.8	10300.4

## 我们可以修改 df 的 columns
df.columns = ['Date', 'Age', 'Income', 'Expense']
df.head(3)

	Date	Age	Income	Expense
0	2020-02-02	22	100000.9	20000.6
1	2020-02-03	18	20800.8	10000.3
2	2020-02-02	32	508008.8	10300.4

从内存中读取数据

## 从 list 中读取数据, 注意 "收入" 这一列我用字符串表示
data_list = [
    ['2020-02-02',22,'100000.9',20000.6],
    ['2020-02-03',18,'20800.8',10000.3],
    ['2020-02-02',32,'508008.8',10300.4],
]

df = pd.DataFrame(data_list, columns=['Date', 'Age', 'Income', 'Expense'], index=['A', 'B', 'C'])
df.head()

	Date	Age	Income	Expense
A	2020-02-02	22	100000.9	20000.6
B	2020-02-03	18	20800.8	10000.3
C	2020-02-02	32	508008.8	10300.4

## 查看每一列数据的类型

## 方法 1:
print(df.dtypes)

## 方法 2:
print(list(map(lambda x: x.dtype, [df[col] for col in df.columns])))

## 方法 3:
print(list(map(lambda x: x[1].dtype, df.iteritems())))

Date        object
Age          int64
Income      object
Expense    float64
dtype: object
[dtype('O'), dtype('int64'), dtype('O'), dtype('float64')]
[dtype('O'), dtype('int64'), dtype('O'), dtype('float64')]

## 修改 Income 这一列的 dtype
df['Income'] = df['Income'].astype(np.float64)
df.dtypes

Date        object
Age          int64
Income     float64
Expense    float64
dtype: object

## 从 Dict 读取数据
data_dict = {
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': [7, 8, 9]
}
df = pd.DataFrame(data_dict)
df.head()

	A	B	C
0	1	4	7
1	2	5	8
2	3	6	9

数据访问

遍历DataFrame中的每一行和每一列

## 1. 使用 `iteritems` 或 `items` 访问 DataFrame 中的每一列, 返回 (column name, Series) pairs
## 2. 使用 `iterrows` 或 `itertuples` 访问 DataFrame 中的每一行, 前者返回  (index, Series) pairs,
## 后者返回 namedtuples
df = pd.DataFrame(
    [
        [1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]
    ],
    columns=['A', 'B', 'C'],
    index=['a', 'b', 'c']
)

## 使用 iteritems 访问每一列 (items 类似)
for label, content in df.iteritems():
    print('label: {}'.format(label))
    print('content:\n{}'.format(content))
    print('content (without index):\n{}'.format(content.to_string(index=False, header=False)))
    break

label: A
content:
a    1
b    4
c    7
Name: A, dtype: int64
content (without index):
1
4
7

## 使用 iterrows 访问每一行
for label, content in df.iterrows():
    print('label: {}'.format(label))
    print('content:\n{}'.format(content))
    print('content (without index):\n{}'.format(content.to_string(index=False, header=False).replace('\n', '  ')))
    break

label: a
content:
A    1
B    2
C    3
Name: a, dtype: int64
content (without index):
1  2  3

## 使用 itertuples 访问每一行
for tp in df.itertuples(index=True, name='Pandas'):
    print(tp)
    print(getattr(tp, 'A'), getattr(tp, 'B'), getattr(tp, 'C'))

Pandas(Index='a', A=1, B=2, C=3)
1 2 3
Pandas(Index='b', A=4, B=5, C=6)
4 5 6
Pandas(Index='c', A=7, B=8, C=9)
7 8 9

行列操作

参考: pandas 删除行,删除列,增加行,增加列

增加行

使用 DataFrame 的 append 方法增加 pd.Series 作为新的一行, 注意一般 append 中需要使用 ignore_index=True 参数;
使用 DataFrame 的 loc 或 at 方法来增加新的行.
逐行增加
插入行

df = pd.DataFrame(np.arange(9).reshape(3, 3), index=list('abc'), columns=list('ABC'))

## 使用 append 增加一行 pd.Series, 注意 pd.Series 需要设置 index=df.columns
## 如果 pd.Series 没有设置 name, 那么在 append 中需要增加 ignore_index=True 参数
## 否则会报错.
new_row = pd.Series([9, 0, 1], index=df.columns, name=4)
df = df.append(new_row, ignore_index=True)
df.head()

	A	B	C
0	0	1	2
1	3	4	5
2	6	7	8
3	9	0	1

## 使用 loc 或 at
df = pd.DataFrame(np.arange(9).reshape(3, 3), index=list('abc'), columns=list('ABC'))
df.loc[4] = [9, 0, 1]
df.at[5] = [0, 9, 1]
df.head()

	A	B	C
a	0.0	1.0	2.0
b	3.0	4.0	5.0
c	6.0	7.0	8.0
4	9.0	0.0	1.0
5	0.0	9.0	1.0

## 逐行插入
## 但按照下面的方式, 由于 len(df) 结果为 int, 如果这个整数在 df 表格中已经存在, 那么会覆盖原有数据
df = pd.DataFrame(np.arange(9).reshape(3, 3), index=list('abc'), columns=list('ABC'))
df.loc[len(df)] = [5, 6, 7]
df.head()

	A	B	C
a	0	1	2
b	3	4	5
c	6	7	8
3	5	6	7

## 插入行
## 比如在 'c' 上面插入一行
## 可以先 reindex, 再插入数据
df = pd.DataFrame(np.arange(9).reshape(3, 3), index=list('abc'), columns=list('ABC'))
df = df.reindex(index=df.index.insert(2, 4)) # 'c' 在第三行, 索引为 2
df.loc[4] = [5, 6, 6]
df.head()

	A	B	C
a	0.0	1.0	2.0
b	3.0	4.0	5.0
4	5.0	6.0	6.0
c	6.0	7.0	8.0

删除行

使用 drop, 其中 axis 参数用于区别删除行还是删除列. axis 默认为 0.

推荐查阅 drop_duplicates() 用于去重.

df = pd.DataFrame(np.arange(9).reshape(3, 3), index=list('abc'), columns=list('ABC'))
df.drop(df.index[[0, 2]], axis=0, inplace=True) # inplace 删除第一行和第三行, axis 默认为 0
df.head()

	A	B	C
b	3	4	5

df = pd.DataFrame(np.arange(9).reshape(3, 3), index=list('abc'), columns=list('ABC'))
df.drop(['a', 'b'], inplace=True) # inplace 删除第一行和第二行, axis 默认为 0
df.head()

	A	B	C
c	6	7	8

增加列

遍历 DataFrame 获取序列 (不增加列)
通过 [ ] 或 loc 增加列 (常用)
通过 Insert 来增加列

## 遍历 DataFrame 获取序列, 这里没有增加列
df = pd.DataFrame(np.arange(9).reshape(3, 3), index=list('abc'), columns=list('ABC'))
s = [a + c for a, c in zip(df['A'], df['C'])]          # 通过遍历获取序列
print(s)
s = [row['A'] + row['C'] for i, row in df.iterrows()]  # 通过iterrows()获取序列，s为list
print(s)
s = df.apply(lambda row: row['A'] + row['C'], axis=1)  # 通过apply获取序列，s为Series
print(s)
s = df['A'] + df['C']                                  # 通过Series矢量相加获取序列
print(s)
s = df['A'].values + df['C'].values                    # 通过Numpy矢量相加获取序列
print(s)

[2, 8, 14]
[2, 8, 14]
a     2
b     8
c    14
dtype: int64
a     2
b     8
c    14
dtype: int64
[ 2  8 14]

## 通过 [] 或 loc 增加列
df = pd.DataFrame(np.arange(9).reshape(3, 3), index=list('abc'), columns=list('ABC'))
df['D'] = df['A'] + df['B']
df.loc[:, 'E'] = df['A'] + df['B']
df.head()

	A	B	C	D	E
a	0	1	2	1	1
b	3	4	5	7	7
c	6	7	8	13	13

## 通过 Insert 来增加列
df = pd.DataFrame(np.arange(9).reshape(3, 3), index=list('abc'), columns=list('ABC'))
df.insert(2, 'D', [3, 4, 5])
df.head()

	A	B	D	C
a	0	1	3	2
b	3	4	4	5
c	6	7	5	8

删除列

使用 drop, 其中 axis 参数用于区别删除行还是删除列.

df = pd.DataFrame(np.arange(9).reshape(3, 3), index=list('abc'), columns=list('ABC'))
df.drop(df.columns[[0, 2]], axis=1, inplace=True) # inplace 删除第一列和第三列, axis 默认为 0
df.head()

	B
a	1
b	4
c	7

df = pd.DataFrame(np.arange(9).reshape(3, 3), index=list('abc'), columns=list('ABC'))
df.drop(['A', 'C'], axis=1, inplace=True) # inplace 删除第一列和第三列, axis 默认为 0
df.head()

	B
a	1
b	4
c	7

数据分组 Groupby

## 创建数据
data = np.array([['2020-02-02', 18, 100000.9, 20000.6],
           ['2020-02-03', 18, 20800.8, 10000.3],
           ['2020-02-02', 20, 508008.8, 10300.4],
           ['2020-02-02', 18, 332342.8, 101240.3],
           ['2020-02-03', 18, 332344.6, 101240.5],
           ['2020-02-04', 20, 134364.2, 11240.5],
           ['2020-02-03', 20, 254354.8, 11240.0],
           ['2020-02-04', 18, 254234.2, 1140.0]], dtype=object)
df = pd.DataFrame(data, columns=['Date', 'Age', 'Income', 'Expense'])
df.head(3)

	Date	Age	Income	Expense
0	2020-02-02	18	100001	20000.6
1	2020-02-03	18	20800.8	10000.3
2	2020-02-02	20	508009	10300.4

## 按 Date 分组
t = df.groupby(['Date']).sum()
t.head()

	Age	Income	Expense
Date
2020-02-02	56	940352.5	131541.3
2020-02-03	56	607500.2	122480.8
2020-02-04	38	388598.4	12380.5

## 按 Date 和 Age 分组
t = df.groupby(['Date', 'Age']).sum()
t.head()

		Income	Expense
Date	Age
2020-02-02	18	432343.7	121240.9
2020-02-02	20	508008.8	10300.4
2020-02-03	18	353145.4	111240.8
2020-02-03	20	254354.8	11240.0
2020-02-04	18	254234.2	1140.0

将 Groupby 对象转换为 DataFrame 对象

通过上面的例子我们发现, 使用 groupby 得到 Groupby 对象, 但有些场景下我们需要的是 DataFrame 对象, 此时需要给 groupby 增加 as_index=False 的参数.

t = df.groupby(['Date', 'Age'], as_index=False).sum()
t.head()

	Date	Age	Income	Expense
0	2020-02-02	18	432343.7	121240.9
1	2020-02-02	20	508008.8	10300.4
2	2020-02-03	18	353145.4	111240.8
3	2020-02-03	20	254354.8	11240.0
4	2020-02-04	18	254234.2	1140.0

个性化需求

打印 DataFrame 但不输出 row number/index

参考: https://stackoverflow.com/questions/52396477/printing-a-pandas-dataframe-without-row-number-index

df = pd.DataFrame(
    [
        [1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]
    ],
    columns=['A', 'B', 'C'],
    index=['a', 'b', 'c']
)
## 1. 按下面的方式, 不会输出 index 以及 header
print(df.to_string(index=False, header=False))

## 2. 使用 df.values 得到 numpy.array, 再转换为 tolist()
print(df.values.tolist())
print('\n'.join(['  '.join(map(str, row)) for row in df.values.tolist()]))

    1  2  3
    4  5  6
    7  8  9
    [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
    1  2  3
    4  5  6
    7  8  9

珍妮的选择

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Pandas 基本操作介绍

Pandas 基本操作介绍本节介绍 Pandas 的基本操作, 方便日后查询.文章目录Pandas 基本操作介绍载入相关的库创建操作数据数据读取从文件中读取数据从内存中读取数据数据访问遍历DataFrame中的每一行和每一列行列操作增加行删除行增加列删除列个性化需求打印 DataFrame 但不输出 row number/index载入相关的库import warningswarning...
复制链接

扫一扫

专栏目录