Python-Pandas数据分析库

参考:Pandas: 强大的 Python 数据分析支持库

Pandas 是 Python 的核心数据分析支持库,提供了快速、灵活、明确的数据结构,旨在简单、直观地处理关系型、标记型数据。

Pandas 适用于处理以下类型的数据:

  • 与 SQL 或 Excel 表类似的,含异构列的表格数据;
  • 有序和无序(非固定频率)的时间序列数据;
  • 带行列标签的矩阵数据,包括同构或异构型数据;
  • 任意其它形式的观测、统计数据集, 数据转入 Pandas 数据结构时不必事先标记。

Pandas 的主要数据结构是 Series(一维数据)与 DataFrame(二维数据),这两种数据结构足以处理金融、统计、社会科学、工程等领域里的大多数典型用例。对于 R 用户,DataFrame 提供了比 R 语言 data.frame 更丰富的功能。

Pandas 基于 NumPy 开发,可以与其它第三方科学计算支持库完美集成。

数据结构

维数名称描述
1Series带标签的一维同构数组
2DataFrame带标签的,大小可变的,二维异构表格

理数据一般分为几个阶段:数据整理与清洗、数据分析与建模、数据可视化与制表,Pandas 是处理数据的理想工具。

基本操作

import numpy as np
import pandas as pd
# 生成对象
#创建一个Series,索引为默认值
a = pd.Series([1,2,3,4])
# print(a)
# print(a.values)  #Series的值
# print(a.index)   #Series的索引

#创建一个Series,指定索引
a2 = pd.Series([1,2,3,4],index=['a','b','c','d'])
# print(a2)
# print(a2[['a','b','c']])

# print(2 in a)

dic1 ={'a':5,'b':3,'r':2}
a3= pd.Series(dic1)
# print(a3)

# # 用含日期时间索引与标签的 NumPy 数组生成 DataFrame
dates = pd.date_range('20200801',periods=6)
# print(dates)

df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
# print(df)

# 用 Series 字典对象生成 DataFrame:
df2 = pd.DataFrame({'A': 1.,
                    'B': pd.Timestamp('20200801'),
                    'C': pd.Series(1, index=list(range(4)), dtype='float32'),
                    'D': np.array([3] * 4, dtype='int32'),
                    'E': pd.Categorical(["test", "train", "test", "train"]),
                    'F': 'foo'})
print(df2)
print("````````````````````````````````````````")
# 查看数据
# 查看DataFrame 头部和尾部数据
print(df.head())
print("````````````````````````````````````````")
print(df.tail(3))
print("````````````````````````````````````````")
#显示索引与列名
print(df.index)
print("````````````````````````````````````````")
print(df.columns)
print("````````````````````````````````````````")
#查看数据的统计摘要
print(df.describe())
print("````````````````````````````````````````")
#转置
print(df.T)
print("````````````````````````````````````````")
# 按轴排序:
print(df.sort_index(axis=1, ascending=False))
print("````````````````````````````````````````")

# 按值排序
print(df.sort_values(by='B'))


   

 A          B    C  D      E    F
0  1.0 2020-08-01  1.0  3   test  foo
1  1.0 2020-08-01  1.0  3  train  foo
2  1.0 2020-08-01  1.0  3   test  foo
3  1.0 2020-08-01  1.0  3  train  foo
````````````````````````````````````````
                   A         B         C         D
2020-08-01  0.236278  1.103618 -0.397312 -0.061134
2020-08-02 -0.479919 -0.133846  0.252811  0.439448
2020-08-03  2.183216 -1.582146 -1.572844 -0.832187
2020-08-04  1.995344 -0.053535 -1.593044  0.110176
2020-08-05  0.934193  0.344933  2.016774  0.348698
````````````````````````````````````````
                   A         B         C         D
2020-08-04  1.995344 -0.053535 -1.593044  0.110176
2020-08-05  0.934193  0.344933  2.016774  0.348698
2020-08-06  2.209469 -0.795880 -0.387335  1.173128
````````````````````````````````````````
DatetimeIndex(['2020-08-01', '2020-08-02', '2020-08-03', '2020-08-04',
               '2020-08-05', '2020-08-06'],
              dtype='datetime64[ns]', freq='D')
````````````````````````````````````````
Index(['A', 'B', 'C', 'D'], dtype='object')
````````````````````````````````````````
              A         B         C         D
count  6.000000  6.000000  6.000000  6.000000
mean   1.179763 -0.186143 -0.280158  0.196355
std    1.134672  0.925899  1.340140  0.658485
min   -0.479919 -1.582146 -1.593044 -0.832187
25%    0.410756 -0.630372 -1.278961 -0.018307
50%    1.464768 -0.093691 -0.392323  0.229437
75%    2.136248  0.245316  0.092774  0.416760
max    2.209469  1.103618  2.016774  1.173128
````````````````````````````````````````
   2020-08-01  2020-08-02  2020-08-03  2020-08-04  2020-08-05  2020-08-06
A    0.236278   -0.479919    2.183216    1.995344    0.934193    2.209469
B    1.103618   -0.133846   -1.582146   -0.053535    0.344933   -0.795880
C   -0.397312    0.252811   -1.572844   -1.593044    2.016774   -0.387335
D   -0.061134    0.439448   -0.832187    0.110176    0.348698    1.173128
````````````````````````````````````````
                   D         C         B         A
2020-08-01 -0.061134 -0.397312  1.103618  0.236278
2020-08-02  0.439448  0.252811 -0.133846 -0.479919
2020-08-03 -0.832187 -1.572844 -1.582146  2.183216
2020-08-04  0.110176 -1.593044 -0.053535  1.995344
2020-08-05  0.348698  2.016774  0.344933  0.934193
2020-08-06  1.173128 -0.387335 -0.795880  2.209469
````````````````````````````````````````
                   A         B         C         D
2020-08-03  2.183216 -1.582146 -1.572844 -0.832187
2020-08-06  2.209469 -0.795880 -0.387335  1.173128
2020-08-02 -0.479919 -0.133846  0.252811  0.439448
2020-08-04  1.995344 -0.053535 -1.593044  0.110176
2020-08-05  0.934193  0.344933  2.016774  0.348698
2020-08-01  0.236278  1.103618 -0.397312 -0.061134

Process finished with exit code 0

缺失值:Pandas 主要用 np.nan 表示缺失数据。 计算时,默认不包含空值。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值