pandas Cookbook部分整理内容

pandas Cookbook (http://pandas.pydata.org/pandas-docs/stable/tutorials.html)

1   Reading your data into pandas is pretty much the easiest thing. Even when the encoding is wrong!

1.1 读取csv文件

fixed_df = pd.read_csv('../data/bikes.csv', sep=';', encoding='latin1', parse_dates=['Date'], dayfirst=True,   index_col='Date')

1.2 选择某一列(像引用字典类型一样,中括号内为列名)

fixed_df['Berri 1']

1.3 对某一列的可视化

fixed_df['Berri 1'].plot()#一列
fixed_df.plot(figsize=(15, 10))#所有列

2   It’s not totally obvious how to select data from a pandas dataframe. Here we explain the basics (how to take slices and get columns)

2.1 选择某一列或某一行

complaints = pd.read_csv('../data/311-service-requests.csv')
complaints['Complaint Type']#中括号内为列名
complaints[:5]#所有数据的前5行
complaints['Complaint Type'][:5]#某一列的前5行,也可写成complaints[:5]['Complaint Type']

2.2 选择多列

complaints[['Complaint Type', 'Borough']] 
complaints[['Complaint Type', 'Borough']][:10]#前10行

2.3 某一列中某种个占多少个

complaint_counts = complaints['Complaint Type'].value_counts()
complaint_counts[:10].plot(kind='bar')#类型最多的前10个可视化,画柱状图

3  Here we get into serious slicing and dicing and learn how to filter dataframes in complicated ways, really fast.

3.1 选择某列的某个特定值(过滤)

is_noise = complaints['Complaint Type'] == "Noise - Street/Sidewalk"#得到的结果为True或False
in_brooklyn = complaints['Borough'] == "BROOKLYN"
complaints[is_noise & in_brooklyn][:5]#输出为同时True的数据的前5行

3.2 numpy 与 Series

pd.Series([1,2,3]).values#输出为:array([1, 2, 3])
arr = np.array([1,2,3])
arr != 2#输出为:array([ True, False,  True], dtype=bool)
arr[arr != 2]#输出为:array([1, 3]) #只输出True的数值

3.3  条件判断

is_noise = complaints['Complaint Type'] == "Noise - Street/Sidewalk"#输出为True或False
noise_complaints = complaints[is_noise]#输出为具体数值
noise_complaints['Borough'].value_counts()

4  Groupby/aggregate is seriously my favorite thing about pandas and I use it all the time. You should probably read this.

4.1 将日期转成每月的第多少天、星期几,dataframe增加新的一列

berri_bikes.index为日期[2012-01-01, ..., 2012-11-05]
berri_bikes.index.day#输出为array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, ...,4, 5], detype=int32)
berri_bikes.index.weekday#输出为array([6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0,
       1, 2, 3, 4, 5, 6, 0, ..., 6, 0], detype=int32)#0为周一
berri_bikes.loc[:,'weekday'] = berri_bikes.index.weekday#增加weekday列

5  Web scraping with pandas is fun! Here we combine dataframes.

5.1 网页上读取数据(.read_csv)

url_template = "http://climate.weather.gc.ca/climateData/bulkdata_e.html?format=csv&stationID=5415"
url = url_template.format(month=3, year=2012)#format限定只读取2012年3月的数据
url = url_template.format(year=year, month=month)#format限定读取全年的数据
weather_mar2012 = pd.read_csv(url, skiprows=15, index_col='Date/Time', parse_dates=True, encoding='latin1', header=True)

5.2 去除含Nan的column或row(dropna)

weather_mar2012 = weather_mar2012.dropna(axis=1, how='any')#axis=1(columns), axis=0(rows),how='any'(drop the column or row if any value is null)

5.3 去除整行或整列

weather_mar2012 = weather_mar2012.drop(['Year', 'Month', 'Day', 'Time', 'Data Quality'], axis=1)#中括号为要去除的列名或行名,axis=1(column),axis=0(row)

5.4  获得时间的hour部分

temperatures = weather_mar2012[[u'Temp (C)']].copy()
temperatures.loc[:,'Hour'] = weather_mar2012.index.hour#index为2012-03-01 01:00:00格式

6  Strings with pandas are great. It has all these vectorized string operations and they’re the best. We will turn a bunch of strings into vectors of numbers in a trice.

6.1 .resample函数求每月的相关统计信息

weather_2012['Temp (C)'].resample('M', how=np.median)#得到每月温度的中位数
is_snowing.astype(float).resample('M', how=np.mean)#每月下雪时间的比例#is_snowing.astype(float)将True转为1,False转为0

7  Cleaning up messy data is never a joy, but with pandas it’s easier.

7.1 如何知道数据规范不

requests['Incident Zip'].unique()#unique()方法查看数据

8  Parsing Unix timestamps is confusing at first but it turns out to be really easy

8.1 表示时间的整数用to_datetime()转成时间表示形式

popcon['atime'] = popcon['atime'].astype(int)#先转成整型如1387295796
popcon['atime'] = pd.to_datetime(popcon['atime'], unit='s')#1387295796表示为201-12-17 15:56:37

8.2  ‘~’符号的使用

nonlibraries = popcon[~popcon['package-name'].str.contains('lib')]#得到package-name不包含lib的部分

9  Reading data from SQL databases. (SQL:import sqlite3)(MySQL:import MySQLdb )

9.1 从SQL数据库中读取数据

con = sqlite3.connect("../data/weather_2012.sqlite")#与SQL数据库建立连接
con = MySQLdb.connect(host="localhost", db="test")#与MySQL数据库建立连接
df = pd.read_sql("SELECT * from weather_2012 LIMIT 3", con, index_col='id')#index_col也可幅值数据库中的索引,这里是‘id’,不写这个参数默认就是重新建立dataframe的index

9.2 数据写入SQL数据库

weather_df = pd.read_csv('../data/weather_2012.csv')
con = sqlite3.connect("../data/test_db.sqlite")
con.execute("DROP TABLE IF EXISTS weather_2012")
weather_df.to_sql("weather_2012", con)

 

 

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值