数据科学 5.3 数据整合 代码sql

import pandas as pd
import os 
# os.chdir(r"D:\Python_Training\script_Python\5Preprocessing")
sale=pd.read_csv(r".\sale.csv",encoding="gbk")
sale.head()
yearmarketsaleprofit
02010339122641
12010322462699
22010西347922574
32010318842673
42011316512437

5.1 SQL语句介绍

  • SQL2数据过滤与排序
  • 选择表中指定列

5.1.1 SQL数据库连接

import sqlite3 # sqlite3相当于轻量版,更多功能可使用SQLAlchemy
con = sqlite3.connect(':memory:') # 数据库连接
sale.to_sql('sale', con) # 将DataFrame注册成可用sql查询的表
newTable = pd.read_sql_query("select year, market, sale, profit from sale", con) # 也可使用read_sql
newTable.head()
yearmarketsaleprofit
02010339122641
12010322462699
22010西347922574
32010318842673
42011316512437

5.1.2 选择表中所有列

sqlResult = pd.read_sql_query('select * from sale', con)
sqlResult.head()
indexyearmarketsaleprofit
002010339122641
112010322462699
222010西347922574
332010318842673
442011316512437

5.1.3 删除重复的行

pd.read_sql_query("select DISTINCT  year from sale", con)
year
02010
12011
22012

5.1.4 选择满足条件的行

pd.read_sql_query("select * from sale where market in ('东','西') and year=2012", con)
indexyearmarketsaleprofit
082012316192106
1102012西321032593

5.1.5 对行进行排序

sql = '''select year, market, sale, profit
      from sale
      order by  sale desc'''
pd.read_sql_query(sql, con).head()
yearmarketsaleprofit
02010西347922574
12011西341752877
22010339122641
32012324433124
42010322462699

5.2纵向连接表

one = pd.read_csv(r"One.csv")
one.to_sql('One', con, index=False)
one.T
0123456
x1112346
aaabcveg
two = pd.read_csv(r"Two.csv")
two.to_sql('Two', con, index=False)
two.T
01234
x12335
bxyzvw

5.2.1 union 和 union all

  • UNION:返回两个结果集的并集
    • union剔除重复值
    • union all不剔除重复值
union = pd.read_sql('select * from one UNION select * from two', con)
union_all = pd.read_sql('select * from one UNION ALL select * from two', con)
union.T
0123456789
x1112233456
aabxcyvzewg
union_all.T
01234567891011
x111234612335
aaabcvegxyzvw

5.2.2 EXCEPT 和 INTERSECT

  • EXCEPT: 返回两个结果集的差(即从左查询中返回右查询没有找到的所有非重复值)
  • INTERSECT: 返回两个结果集的交集(即两个查询都返回的所有非重复值)
exceptTable = pd.read_sql('select * from one EXCEPT select * from two', con)
intersectTable = pd.read_sql('select * from one INTERSECT select * from two', con)
exceptTable.T
01234
x11246
aabceg
intersectTable.T
0
x3
av

5.3 练习: 多表纵向连接

5.3.1DataFrame操作

pd.concat([one, two], axis=0, join='outer',sort=False, ignore_index=True) # 更多参数可查看文档或帮助
xab
01aNaN
11aNaN
21bNaN
32cNaN
43vNaN
54eNaN
66gNaN
71NaNx
82NaNy
93NaNz
103NaNv
115NaNw

5.3.2 横向连接表

1、读取数据
table1 = pd.read_csv(r"Table1.csv")
table1.to_sql('table1', con, index=False)
table1.head()
ida
01a
12b
23c
table2 = pd.read_csv(r"Table2.csv")
table2.to_sql('table2', con, index=False)
table2.head()
idb
04d
13e
2、笛卡尔积
pd.read_sql("select * from table1, table2", con)
idaidb
01a4d
11a3e
22b4d
32b3e
43c4d
53c3e
3、内连接(使用inner join或使用where子句)
pd.read_sql("select * from table1 as a inner join table2 as b on a.id=b.id", con)
# pd.read_sql("select * from table1 as a, table2 as b where a.id=b.id", con)
idaidb
03c3e
4、左连接
pd.read_sql("select * from table1 as a left join table2 as b on a.id=b.id", con)
idaidb
01aNaNNone
12bNaNNone
23c3.0e
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

irober

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值