[pandas练习册] 如何在pandas中重写SQL命令实现同样的数据查询效果

前言

python必知必会的数据处理知识1(pandas))
python必知必会的数据处理知识2(pandas))
pandas中高频函数详细说明

之前有文章介绍pandas这个包,以上为对应链接。本文使用pandas实现SQL的语言功能,可以当做是pandas的练习。

建议本文的使用方法:

首先看SQL这一列想要取出来的数据,然后在jupyter上使用pandas操作取出对应数据。最后与答案进行核对。

首先准备数据:

所有数据从该链接下载: http://ourairports.com/data/

import pandas as pd
airports = pd.read_csv('data/airports.csv')
airport_freq = pd.read_csv('data/airport-frequencies.csv')
runways = pd.read_csv('data/runways.csv')

1. SELECT, WHERE, DISTINCT, LIMIT

在SQL中的一些SELECT命令:我们使用where过滤数据,使用limit截断数据,使用distinct移除重复数据。那么在于pandas的对应中:

SQLPandas
select * from airportsairports
select * from airports limit 3airports.head(3)
select id from airports where ident = ‘KLAX’airports[airports.ident==‘KLAX’].id
select distinct type from airportairports.type.unique()

2. SELECT with multiple conditions

SQLPandas
select * from airports where iso_region = ‘US-CA’ and type=‘seaplane_base’airports[(airports.iso_region == ‘US-CA’) & (airports.type==‘seaplane_base’)]
select ident, name, municipality from airports where iso_region = ‘US-CA’ and type=‘large_airport’airports[(airports.iso_region == ‘US-CA’) & (airports.type==‘large_airport’)][[‘ident’,‘name’,‘municipality’]]

3. ORDER BY

默认pandas会按照升序排序。关键函数sort_values()

SQLPandas
select * from airport_freq where airport_ident = ‘KLAX’ order by typeairports_freq[airports_freq.airport_ident == ‘KLAX’].sort_values(‘type’)
select * from airport_freq where airport_ident = ‘KLAX’ order by type descairports_freq[airports_freq.airport_ident == ‘KLAX’].sort_values(‘type’, ascending=False)

4. IN … NOT IN

pandas的.isin() 能够同样起到效果。取反的话,使用 ~

SQLPandas
select * from airport where type in (‘heliport’, ‘balloonport’)airports[airports.type.isin([‘heliport’, ‘balloonport’])]
select * from airports where type not in (‘heliport’, ‘balloonport’)airports[~airports.type.isin([‘heliport’, ‘balloonport’])]

5. GROUP BY, COUNT, ORDER BY

group by 很直接,在pandas中直接用.groupby()

count 有区别,pandas的count统计的是NaN value 的个数,如果想实现SQL中同样的语义,应该使用.size()

SQLPandas
select iso_country, type, count(*) from airports group by iso_country, type order by iso_country, typeairports.groupby([‘iso_country’, ‘type’]).size()
select iso_country, type, count(*) from airports group by iso_country, type order by iso_country, count(*) descairports.groupby([‘iso_country’,‘type’]).size().to_frame(‘size’).reset_index().sort_values([‘iso_country’,‘size’],ascending=[True, False])

第二个稍微复杂些,这里做些详细解释:

因为我们想要使用size的大小排序,所以.size()输出的结果需要成为DataFrame数据结构的一部分。而.groupby()操作之后,我们得到的是一个GroupByObject的数据类型。所以我们需要使用.to_frame()将其转换为DataFrame,使用.reset_index()重新设置索引。

6. HAVING

SQL在过滤数据的时候,可以使用having
Pandas 中可以使用 .filter()并提供func(lambda)

SQLPandas
select type, count(*) from airports where iso_country = ‘US’ group by type having count(*) > 1000 order by count(*) descairports[airports.iso_country == ‘US’].groupby(‘type’).filter(lambda g: len(g) > 1000).groupby(‘type’).size().sort_values(ascending=False)

7. Top N records

by_country = airports.groupby(['iso_country']).size().to_frame('airport_count').reset_index()

首先通过如上命令获取一个by_country的dataframe

如下例子中,首先依据count选取最多的十个国家;然后选出第十一到第二十的国家。

SQLPandas
select iso_country from by_country order by size desc limit 10by_country.nlargest(10, columns=‘airport_count’)
select iso_country from by_country order by size desc limit 10 offset 10by_country.nlargest(20, columns=‘airport_count’).tail(10)

8. aggregate functions(MIN, MAX, MEAN)

SQLPandas
select max(length_ft), min(length_ft),avg(length_ft), median(length_ft) from runwaysrunways.agg({‘length_ft’:[‘min’,‘max’,‘mean’,‘median’]}).T

9. JOIN

Pandas对应的是.merge()函数,之前的文章有详细说明函数具体参数的含义。

SQLPandas
select airport_ident, type, description, frequency_mhz from airport_freq join airports on airport_freq.airport_ref = airports.id where airports.ident=‘KLAX’airports_freq.merge(airports[airports.ident == ‘KLAX’][[‘id’]], left_on=‘airport_ref’,right_on=‘id’,how=‘inner’)[[‘airport_ident’,‘type’,‘description’,‘frequency_mhz’]]

10. UNION ALL and UNION

使用pd.concat()进行对应操作。

SQLPandas
select name, municipality from airports where ident = ‘KLAX’ union all select name, municipality from airports where ident=‘KLGB’pd.concat([airports[airports.ident == ‘KLAX’][[‘name’,‘municipality’]], airports[airports.ident==‘KLGB’][[‘name’,‘municipality’]]])

11. INSERT

SQLPandas
create table heroes(id integer, name text);df1 = pd.DataFrame({‘id’:[1,2],‘name’:[‘harry potter’, ‘ron weasley’]})
insert into heroes values(3, ‘hermione granger’)pd.concat([df1, pd.DataFrame({‘id’:[3],‘name’:[‘hermione granger’]})]).reset_index(drop=True)

12. UPDATE

SQLPandas
update airports set home_link=‘http://www.lawa.org/welcomelax.aspx’ where ident=='KLAXairports.loc[airports.ident==‘KLAX’,‘home_link’] = ‘http://www.lawa.org/welcomelax.aspx’

13. DELETE

SQLPandas
delete from lax_freq where type=‘MISC’lax_freq=lax_freq[lax_freq[‘type’] !=‘MISC’]
lax_freq.drop(lax_freq[lax_freq.type==‘MISC’].index)

两种方法,一种直接重写,一种使用drop

14. and more

导出到其格式:

df.to_csv(...)  # csv file
df.to_hdf(...)  # HDF5 file
df.to_pickle(...)  # serialized object
df.to_sql(...)  # to SQL database
df.to_excel(...)  # to Excel sheet
df.to_json(...)  # to JSON string
df.to_html(...)  # render as HTML table
df.to_feather(...)  # binary feather-format
df.to_latex(...)  # tabular environment table
df.to_stata(...)  # Stata binary data files
df.to_msgpack(...)	# msgpack (serialize) object
df.to_gbq(...)  # to a Google BigQuery table.
df.to_string(...)  # console-friendly tabular output.
df.to_clipboard(...) # clipboard that can be pasted into Excel

绘图:


top_10.plot(
    x='iso_country', 
    y='airport_count',
    kind='barh',
    figsize=(10, 7),
    title='Top 10 countries with most airports')

在这里插入图片描述

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值