jupyter notebook 之 pandas

最新推荐文章于 2024-05-09 12:55:27 发布

莫迟_

最新推荐文章于 2024-05-09 12:55:27 发布

阅读量3.3k

点赞数 1

分类专栏：数据

本文链接：https://blog.csdn.net/qq_39457834/article/details/100128170

版权

数据专栏收录该内容

5 篇文章 1 订阅

订阅专栏

Pandas

Python Data Analysis

Pandas 是python的一个数据分析包，最初由AQR Capital Management于2008年4月开发，并于2009年底开源出来，目前由专注于Python数据包开发的PyData开发team继续开发和维护，属于PyData项目的一部分。Pandas最初被作为金融数据分析工具而开发出来，因此，pandas为时间序列分析提供了很好的支持。 Pandas的名称来自于面板数据（panel data）和python数据分析（data analysis）。panel data是经济学中关于多维数据集的一个术语，在Pandas中也提供了panel的数据类型。

1.基于Numpy,基于Matplotlib,把这两个库进行了再封装
2.拥有Series,DataFrame两种数据种类型(Series即是一个序列,又是一个hash表)(DataFrame把Series当作是一列)
3.在读取文件的操作上更加简便.

In [6]:

#数据有三剑客

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

使用pd读取csv文件

filepath_or_buffer    文件路径

sep=','               分割符

header='infer'        是否有列名称，默认是有,没有写成None

names=None            如果没有列名，我们可以指定列名，要求是一个序列(['name','sex','age'])

engine=None           使用C或pythn作为计算引擎(C的速度快，python比较精确)     {'c', 'python'}

skiprows=None         跳过多少行

nrows=None            取多少行

skipfooter=0          从尾部跳过

In [7]:

#DataFrame

#自动化运维   系统日志  一个文件 60G

#数据流控制

AAPL = pd.read_csv('AAPL.csv',sep=',')

In [3]:

#获取当前数据在内存中占用多少

AAPL.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9759 entries, 0 to 9758
Data columns (total 7 columns):
Date         9759 non-null object
Open         9758 non-null float64
High         9758 non-null float64
Low          9758 non-null float64
Close        9758 non-null float64
Adj Close    9758 non-null float64
Volume       9758 non-null float64
dtypes: float64(6), object(1)
memory usage: 533.8+ KB

In [8]:

AAPL.iloc[:1000,:-1].plot(figsize=(18,10))

Out[8]:

<matplotlib.axes._subplots.AxesSubplot at 0x269703f21d0>

Series

Series是一个类似于一维数组的对象，由两个部分组成：

values ：是一个一维的ndarray
indexs : 是一维的序列

In [11]:

nd1 = np.random.randint(0,10,10)

是用Series使用 copy = True此时的copy为深拷贝

当nd1 为array（[1,2,3]）时，Series产生的是浅拷贝（只拷贝引用地址，不拷贝对象本身）
当nd1 为{}时，为深拷备，或copy = True

In [10]:

S = pd.Series(nd1,index=list('abcdefghij'),copy=True)

Out[10]:

a    9
b    6
c    2
d    6
e    7
f    5
g    8
h    7
i    8
j    3
dtype: int32

In [7]:

#Series中会出现 引用传地址问题

nd1[0] = 1000

S[0]

Out[7]:

In [8]:

#对象.属性

S.a

Out[8]:

In [9]:

#key

S['a']

Out[9]:

In [10]:

#索引

S[0]

Out[10]:

序列类型的索引

In [11]:

nd1

Out[11]:

array([1000,    6,    2,    7,    5,    4,    9,    1,    3,    2])

In [12]:

nd1[[0,1,0]]

Out[12]:

array([1000,    6, 1000])

In [13]:

S[[0,1,2,0,1]]

Out[13]:

a    6
b    6
c    2
a    6
b    6
dtype: int64

条件查询

条件查询返回倒是索引

In [14]:

np.where(nd1>5) #返回的是元组

Out[14]:

(array([0, 1, 3, 6]),)

In [15]:

indexs = np.argwhere(nd1>5).ravel()  # 返回的是array二维数组，用ravel降维

In [16]:

nd1[indexs]

Out[16]:

array([1000,    6,    7,    9])

pd当中，目前没有搜索方法，pd是基于(继承)numpy的

In [17]:

cond = np.argwhere(S > 5).ravel()

In [18]:

S[cond]

Out[18]:

a    6
b    6
d    7
g    9
dtype: int64

利用广播机制，返回bool值

In [19]:

bls = S>5

In [20]:

#只匹配返回结果为True的

S[bls]

Out[20]:

a    6
b    6
d    7
g    9
dtype: int64

显示索引&隐式索引

显示索引 .loc[] 只能使用关联型的索引取值，是一个闭区间,适合查找一个指定的值
隐式索引 .iloc[] 只能使用枚举型的索引取值，是一个半闭区间,适合查找一个范围的值

In [45]:

#离散类型的 ： 关联型

S['a':'j']

. . .

In [46]:

#连续类型的 : 枚举型

S[0:]

. . .

In [37]:

S.loc['a']

Out[37]:

In [44]:

S.iloc[0:]

. . .

DataFrame

DataFrame是一个类似于表格的二维数据结构，分为行(indexs)和列(columns),由多个Series组成的，每一列是一个Series

dtypes 检查每一列的数据类型
columns 获取列的名称
index 获取行号
shape 查看形状
values 或值的部分，得到的是一个二维矩阵

In [50]:

AAPL.dtypes

Out[50]:

Date          object
Open         float64
High         float64
Low          float64
Close        float64
Adj Close    float64
Volume       float64
dtype: object

In [65]:

AAPL.columns

Out[65]:

Index(['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'], dtype='object')

In [75]:

AAPL.index

Out[75]:

RangeIndex(start=0, stop=9759, step=1)

In [77]:

AAPL.shape

Out[77]:

(9759, 7)

In [78]:

AAPL.values

Out[78]:

array([['1980-12-12', 0.513393, 0.515625, ..., 0.513393,
        0.40897100000000003, 117258400.0],
       ['1980-12-15', 0.488839, 0.488839, ..., 0.48660699999999996,
        0.387633, 43971200.0],
       ['1980-12-16', 0.453125, 0.453125, ..., 0.45089300000000004,
        0.359183, 26432000.0],
       ...,
       ['2019-08-22', 213.190002, 214.440002, ..., 212.460007,
        212.460007, 22253700.0],
       ['2019-08-23', 209.429993, 212.05000299999998, ..., 202.639999,
        202.639999, 46818000.0],
       ['2019-08-26', 205.860001, 207.190002, ..., 206.490005,
        206.490005, 26043600.0]], dtype=object)

DataFrame的索引

head() 默认读取前5行
tail() 默认读取后5行

In [85]:

AAPL.head()

Out[85]:

	Date	Open	High	Low	Close	Adj Close	Volume
0	1980-12-12	0.513393	0.515625	0.513393	0.513393	0.408971	117258400.0
1	1980-12-15	0.488839	0.488839	0.486607	0.486607	0.387633	43971200.0
2	1980-12-16	0.453125	0.453125	0.450893	0.450893	0.359183	26432000.0
3	1980-12-17	0.462054	0.464286	0.462054	0.462054	0.368074	21610400.0
4	1980-12-18	0.475446	0.477679	0.475446	0.475446	0.378743	18362400.0

In [94]:

#dataframe 的中括号只能取 列的名称

#如果索引是字符串类型，返回一个Series

#如果索引是序列类型，返回一个Dataframe

AAPL['Date']

. . .

In [97]:

#dataframe 的中括号切片  切的行

#返回的都是一个Dataframe

AAPL[0:100]

. . .

DataFrame的显示和隐式索引 (先取行，再取列)

In [113]:

AAPL

. . .

In [107]:

#DateFrame中显示索引loc如果没有关联型的索引，那么显示取值取枚举类型

AAPL.loc[0,'Date']

Out[107]:

'1980-12-12'

In [119]:

AAPL.loc[:10,:'High']

. . .

In [115]:

AAPL.iloc[:10,:2]

. . .

创建一个Dataframe

pd.DataFrame(data,index,columns)

In [196]:

df = pd.DataFrame(data={'A':[1,2,3,4,5],'B':[5,6,7,8,9],'C':[11,12,13,14,15]},

                  index=['西毒','绿帝','北丐','东邪','中李野'],

                 columns=['A','B','C','D'])

df

Out[196]:

	A	B	C	D
西毒	1	5	11	NaN
绿帝	2	6	12	NaN
北丐	3	7	13	NaN
东邪	4	8	14	NaN
中李野	5	9	15	NaN

In [139]:

df1 = pd.DataFrame(np.random.randint(1,100,size=(5,3)),columns=['体育','音乐','生物'])

df1

Out[139]:

	体育	音乐	生物
0	73	14	53
1	32	39	44
2	97	87	15
3	1	32	37
4	44	1	86

In [146]:

#取除生物列

df1['生物']

df1.iloc[:,2]

df1.loc[:,'生物']

df1.生物

Out[146]:

0    53
1    44
2    15
3    37
4    86
Name: 生物, dtype: int64

In [148]:

#取出音乐的第二行

df1.loc[1,'音乐']

Out[148]:

DateFrame的CURD

增加：¶

df.loc['李晶'] = ['东北','男',1045,np.NaN]
删除：
df.drop(labels=['中李野','李晶'],axis=0,inplace=True)
改：
indexs = df.query("A == '东北'").index
s = df.loc[indexs]
s['D'] = 'ABC'
df.loc[indexs] = s
df

查：
df.query("B == '男' | C>5 ")

In [162]:

#C insert

df

Out[162]:

	A	B	C	D
西毒	1	5	11	NaN
绿帝	2	6	12	NaN
北丐	3	7	13	NaN
东邪	4	8	14	NaN
中李野	5	9	15	NaN
李晶	东北	男	1045	NaN

In [197]:

# df.insert()

#插入

df.loc['李晶'] = ['东北','男',1045,np.NaN]

In [198]:

#查询 B='男' and C>10

#查询 B列中 = ‘男’

cond = df.loc[:,'B'] == '男'

In [216]:

#先筛选出第一个条件

cond_c = df[cond].loc[:,'C'] > 5

In [217]:

df[cond][cond_c]

Out[217]:

	A	B	C	D
李晶	东北	男	1045	NaN

In [12]:

#query()

df.query("B == '男' | C>5 ")

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-12-1c83fd6081c3> in <module>()
      1 #query()
----> 2 df.query("B == '男' | C>5 ")

NameError: name 'df' is not defined

更改

找到要更改数据的index
根据索引找到该条数据
修改数据
将修改过的这条数据（res）赋值给本表格下的那条数据（df.loc[indexs]）

In [236]:

#update

#把  set D='ABC' where A = '东北'

indexs = df.query('A == "东北"').index

In [242]:

res = df.loc[indexs]

In [244]:

res['D'] = 'ABC'

In [247]:

df.loc[indexs] = res

In [248]:

df

Out[248]:

	A	B	C	D
西毒	1	5	11	NaN
绿帝	2	6	12	NaN
北丐	3	7	13	NaN
东邪	4	8	14	NaN
中李野	5	9	15	NaN
李晶	东北	男	1045	ABC

In [252]:

#删除

a = df.iloc[0:4]

b = df.iloc[-1]

a.loc[b.name] = b.values

In [264]:

#drop()

#labels=None

#axis=0  删除行   1 删除列

#inplace=False  默认不对原对象产生影响

df.drop(labels=['中李野','李晶'],axis=0,inplace=True)

In [265]:

df

Out[265]:

	A	B	C	D
西毒	1	5	11	NaN
绿帝	2	6	12	NaN
北丐	3	7	13	NaN
东邪	4	8	14	NaN

空值检测

MySQL 中是 null python 中是 None Data 中是 NaN Not a Number 是一个float

isnull() 检查元素为空
notnull() 不为空
dropna() 删除包含NaN的行或者列
fillna() 填充值

In [270]:

type(np.NaN)

Out[270]:

float

In [275]:

#NaN和任何的数值做计算，返回的都是NaN

np.NaN - np.array([1,2,3,4,5])

Out[275]:

array([nan, nan, nan, nan, nan])

In [291]:

#dataframe在循环的时候，默认是循环列

for c in df:

    display(df[c].isnull())

. . .

In [294]:

#按行 for i in df.index:

#     display(df.loc[i])

#按列

for i in df.columns:

    print(df.loc[:,i])

. . .

In [300]:

#dropna()

#到底删行还是列

#一行代表一个样本的信息

#一列是代表所有样本的信息

#如果行当中的空数据太多，那就删行

df.dropna(axis=0, how='any')

Out[300]:

	A	B	C	D
西毒	1	5	11	1
东邪	4	8	14	1

In [312]:

#fillna()

#method=方式, axis=轴

#{'backfill', 'bfill', 'pad', 'ffill', None}

df.fillna(method='ffill',axis=1)

Out[312]:

	A	B	C	D
西毒	1	5	11	1
绿帝	2	6	12	12
北丐	3	7	13	13
东邪	4	8	14	1

In [325]:

#取均值

df['D'].fillna(value=df['D'].mean(),inplace=True)

In [326]:

df

Out[326]:

	A	B	C	D
西毒	1	5	11	1.0
绿帝	2	6	12	1.0
北丐	3	7	13	1.0
东邪	4	8	14	1.0

In [331]:

df.iloc[1].median()

Out[331]:

4.0

使用pandas连接数据库

In [349]:

import pymysql

conn = pymysql.connect(host='localhost',port=3306,user='root',passwd='123456',db='python')

cursor = conn.cursor(cursor=pymysql.cursors.DictCursor)

cursor.execute('select * from userinfo1')

res = cursor.fetchall()

cursor.close()

conn.close()

In [336]:

pd.DataFrame(data=res)

Out[336]:

	id	name	pwd
0	1	张三	123456
1	2	赵四	123456
2	3	王五	123456
3	4	赵六	123456
4	5	鬼脚七	123456

sqlalchemy

In [13]:

from sqlalchemy import create_engine

In [14]:

db_info = dict(

    user='root',

    password='123456',

    host='192.168.1.215',

    port=3306,

    database='a',

    charset='utf8',

In [17]:

import pymysql

engine=create_engine('mysql+pymysql://{user}:{password}@{host}:{port}/{database}?charset={charset}'.format(**db_info))

In [21]:

da = pd.read_sql('abc',engine)

把表中的数据保存成文件

In [ ]:

da.to_csv(path_or_buf='userinfo1.csv')

把文件当中的数据保存到数据库中

In [ ]:

AAPL.to_sql('aapl2',engine,index=False)

莫迟_

关注

1
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
jupyter notebook 之 pandas

PandasPython Data AnalysisPandas 是python的一个数据分析包，最初由AQR Capital Management于2008年4月开发，并于2009年底开源出来，目前由专注于Python数据包开发的PyData开发team继续开发和维护，属于PyData项目的一部分。Pandas最初被作为金融数据分析工具而开发出来，因此，pandas为时间序列...
复制链接

扫一扫

专栏目录