前言
python必知必会的数据处理知识1(pandas))
python必知必会的数据处理知识2(pandas))
pandas中高频函数详细说明
之前有文章介绍pandas这个包,以上为对应链接。本文使用pandas实现SQL的语言功能,可以当做是pandas的练习。
建议本文的使用方法:
首先看SQL这一列想要取出来的数据,然后在jupyter上使用pandas操作取出对应数据。最后与答案进行核对。
首先准备数据:
所有数据从该链接下载: http://ourairports.com/data/
import pandas as pd
airports = pd.read_csv('data/airports.csv')
airport_freq = pd.read_csv('data/airport-frequencies.csv')
runways = pd.read_csv('data/runways.csv')
1. SELECT, WHERE, DISTINCT, LIMIT
在SQL中的一些SELECT命令:我们使用where过滤数据,使用limit截断数据,使用distinct移除重复数据。那么在于pandas的对应中:
SQL | Pandas |
---|---|
select * from airports | airports |
select * from airports limit 3 | airports.head(3) |
select id from airports where ident = ‘KLAX’ | airports[airports.ident==‘KLAX’].id |
select distinct type from airport | airports.type.unique() |
2. SELECT with multiple conditions
SQL | Pandas |
---|---|
select * from airports where iso_region = ‘US-CA’ and type=‘seaplane_base’ | airports[(airports.iso_region == ‘US-CA’) & (airports.type==‘seaplane_base’)] |
select ident, name, municipality from airports where iso_region = ‘US-CA’ and type=‘large_airport’ | airports[(airports.iso_region == ‘US-CA’) & (airports.type==‘large_airport’)][[‘ident’,‘name’,‘municipality’]] |
3. ORDER BY
默认pandas会按照升序排序。关键函数sort_values()
SQL | Pandas |
---|---|
select * from airport_freq where airport_ident = ‘KLAX’ order by type | airports_freq[airports_freq.airport_ident == ‘KLAX’].sort_values(‘type’) |
select * from airport_freq where airport_ident = ‘KLAX’ order by type desc | airports_freq[airports_freq.airport_ident == ‘KLAX’].sort_values(‘type’, ascending=False) |
4. IN … NOT IN
pandas的.isin()
能够同样起到效果。取反的话,使用 ~
SQL | Pandas |
---|---|
select * from airport where type in (‘heliport’, ‘balloonport’) | airports[airports.type.isin([‘heliport’, ‘balloonport’])] |
select * from airports where type not in (‘heliport’, ‘balloonport’) | airports[~airports.type.isin([‘heliport’, ‘balloonport’])] |
5. GROUP BY, COUNT, ORDER BY
group by 很直接,在pandas中直接用.groupby()
count 有区别,pandas的count统计的是NaN value 的个数,如果想实现SQL中同样的语义,应该使用.size()
SQL | Pandas |
---|---|
select iso_country, type, count(*) from airports group by iso_country, type order by iso_country, type | airports.groupby([‘iso_country’, ‘type’]).size() |
select iso_country, type, count(*) from airports group by iso_country, type order by iso_country, count(*) desc | airports.groupby([‘iso_country’,‘type’]).size().to_frame(‘size’).reset_index().sort_values([‘iso_country’,‘size’],ascending=[True, False]) |
第二个稍微复杂些,这里做些详细解释:
因为我们想要使用size的大小排序,所以.size()
输出的结果需要成为DataFrame数据结构的一部分。而.groupby()
操作之后,我们得到的是一个GroupByObject的数据类型。所以我们需要使用.to_frame()
将其转换为DataFrame,使用.reset_index()
重新设置索引。
6. HAVING
SQL在过滤数据的时候,可以使用having
Pandas 中可以使用 .filter()
并提供func(lambda)
SQL | Pandas |
---|---|
select type, count(*) from airports where iso_country = ‘US’ group by type having count(*) > 1000 order by count(*) desc | airports[airports.iso_country == ‘US’].groupby(‘type’).filter(lambda g: len(g) > 1000).groupby(‘type’).size().sort_values(ascending=False) |
7. Top N records
by_country = airports.groupby(['iso_country']).size().to_frame('airport_count').reset_index()
首先通过如上命令获取一个by_country
的dataframe
如下例子中,首先依据count选取最多的十个国家;然后选出第十一到第二十的国家。
SQL | Pandas |
---|---|
select iso_country from by_country order by size desc limit 10 | by_country.nlargest(10, columns=‘airport_count’) |
select iso_country from by_country order by size desc limit 10 offset 10 | by_country.nlargest(20, columns=‘airport_count’).tail(10) |
8. aggregate functions(MIN, MAX, MEAN)
SQL | Pandas |
---|---|
select max(length_ft), min(length_ft),avg(length_ft), median(length_ft) from runways | runways.agg({‘length_ft’:[‘min’,‘max’,‘mean’,‘median’]}).T |
9. JOIN
Pandas对应的是.merge()函数,之前的文章有详细说明函数具体参数的含义。
SQL | Pandas |
---|---|
select airport_ident, type, description, frequency_mhz from airport_freq join airports on airport_freq.airport_ref = airports.id where airports.ident=‘KLAX’ | airports_freq.merge(airports[airports.ident == ‘KLAX’][[‘id’]], left_on=‘airport_ref’,right_on=‘id’,how=‘inner’)[[‘airport_ident’,‘type’,‘description’,‘frequency_mhz’]] |
10. UNION ALL and UNION
使用pd.concat()
进行对应操作。
SQL | Pandas |
---|---|
select name, municipality from airports where ident = ‘KLAX’ union all select name, municipality from airports where ident=‘KLGB’ | pd.concat([airports[airports.ident == ‘KLAX’][[‘name’,‘municipality’]], airports[airports.ident==‘KLGB’][[‘name’,‘municipality’]]]) |
11. INSERT
SQL | Pandas |
---|---|
create table heroes(id integer, name text); | df1 = pd.DataFrame({‘id’:[1,2],‘name’:[‘harry potter’, ‘ron weasley’]}) |
insert into heroes values(3, ‘hermione granger’) | pd.concat([df1, pd.DataFrame({‘id’:[3],‘name’:[‘hermione granger’]})]).reset_index(drop=True) |
12. UPDATE
SQL | Pandas |
---|---|
update airports set home_link=‘http://www.lawa.org/welcomelax.aspx’ where ident=='KLAX | airports.loc[airports.ident==‘KLAX’,‘home_link’] = ‘http://www.lawa.org/welcomelax.aspx’ |
13. DELETE
SQL | Pandas |
---|---|
delete from lax_freq where type=‘MISC’ | lax_freq=lax_freq[lax_freq[‘type’] !=‘MISC’] |
lax_freq.drop(lax_freq[lax_freq.type==‘MISC’].index) |
两种方法,一种直接重写,一种使用drop
14. and more
导出到其格式:
df.to_csv(...) # csv file
df.to_hdf(...) # HDF5 file
df.to_pickle(...) # serialized object
df.to_sql(...) # to SQL database
df.to_excel(...) # to Excel sheet
df.to_json(...) # to JSON string
df.to_html(...) # render as HTML table
df.to_feather(...) # binary feather-format
df.to_latex(...) # tabular environment table
df.to_stata(...) # Stata binary data files
df.to_msgpack(...) # msgpack (serialize) object
df.to_gbq(...) # to a Google BigQuery table.
df.to_string(...) # console-friendly tabular output.
df.to_clipboard(...) # clipboard that can be pasted into Excel
绘图:
top_10.plot(
x='iso_country',
y='airport_count',
kind='barh',
figsize=(10, 7),
title='Top 10 countries with most airports')