34. Pandas和数据库查询语言SQL的对比

Pandas和数据库查询语言SQL的对比

  • Pandas:Python最流行的数据处理与数据分析的类库
  • SQL:结构化查询语言,用于对MySQL、Oracle等关系型数据库的增删改查

两者都是对“表格型”数据的操作和查询,所以很多语法都能对应起来

对比列表:

  1. SELECT数据查询
  2. WHERE按条件查询
  3. in和not in的条件查询
  4. groupby分组统计
  5. JOIN数据关联
  6. UNION数据合并
  7. Order Limit先排序后分页
  8. 取每个分组group的top n
  9. UPDATE数据更新
  10. DELETE删除数据

0. 读取泰坦尼克数据集

import pandas as pd
import numpy as np
df = pd.read_csv("./datas/titanic/titanic_train.csv")
df.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS

1. SELECT数据查询

# SQL:
sql = """
    SELECT PassengerId, Sex, Age, Survived
    FROM titanic
    LIMIT 5;
"""
# Pandas
df[["PassengerId", "Sex", "Age", "Survived"]].head(5)
PassengerIdSexAgeSurvived
01male22.00
12female38.01
23female26.01
34female35.01
45male35.00

df.head(5)类似select * from table limit 5,查询所有的字段

2. WHERE按条件查询

# SQL:
sql = """
    SELECT *
    FROM titanic
    where Sex='male' and Age>=20.0 and Age<=40.0
    LIMIT 5;
"""
# 使用括号的方式,级联多个条件|
condition = (df["Sex"]=="male") & (df["Age"]>=20.0) & (df["Age"]<=40.0)
condition.value_counts()
False    629
True     262
dtype: int64
df[condition].head(5)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.250NaNS
4503Allen, Mr. William Henrymale35.0003734508.050NaNS
121303Saundercock, Mr. William Henrymale20.000A/5. 21518.050NaNS
131403Andersson, Mr. Anders Johanmale39.01534708231.275NaNS
202102Fynney, Mr. Joseph Jmale35.00023986526.000NaNS

3. in和not in的条件查询

df["Pclass"].unique()
array([3, 1, 2], dtype=int64)
# SQL:
sql = """
    SELECT *
    FROM titanic
    where Pclass in (1,2)
    LIMIT 5;
"""
# in 
df[df["Pclass"].isin((1,2))].head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
6701McCarthy, Mr. Timothy Jmale54.0001746351.8625E46S
91012Nasser, Mrs. Nicholas (Adele Achem)female14.01023773630.0708NaNC
111211Bonnell, Miss. Elizabethfemale58.00011378326.5500C103S
# not in 
df[~df["Pclass"].isin((1,2))].head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
5603Moran, Mr. JamesmaleNaN003308778.4583NaNQ
7803Palsson, Master. Gosta Leonardmale2.03134990921.0750NaNS

4. groupby分组统计

4.1 单个列的聚合
# SQL:
sql = """
    SELECT 
        -- 分性别的存活人数
        sum(Survived),
        -- 分性别的平均年龄
        mean(Age)
        -- 分性别的平均票价
        mean(Fare)
    FROM titanic
    group by Sex
"""
df.groupby("Sex").agg({"Survived":np.sum, "Age":np.mean, "Fare":np.mean})
SurvivedAgeFare
Sex
female23327.91570944.479818
male10930.72664525.523893
4.2 多个列的聚合
# SQL:
sql = """
    SELECT 
        -- 不同存活和性别分组的,平均年龄
        mean(Age)
        -- 不同存活和性别分组的,平均票价
        mean(Fare)
    FROM titanic
    group by Survived, Sex
"""
df.groupby(["Survived", "Sex"]).agg({"Age":np.mean, "Fare":np.mean})
AgeFare
SurvivedSex
0female25.04687523.024385
male31.61805621.960993
1female28.84771651.938573
male27.27602240.821484

5. JOIN数据关联

# 电影评分数据集,评分表
df_rating = pd.read_csv("./datas/ml-latest-small/ratings.csv")
df_rating.head(5)
userIdmovieIdratingtimestamp
0114.0964982703
1134.0964981247
2164.0964982224
31475.0964983815
41505.0964982931
# 电影评分数据集,电影信息表
df_movies = pd.read_csv("./datas/ml-latest-small/movies.csv")
df_movies.head(5)
movieIdtitlegenres
01Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy
12Jumanji (1995)Adventure|Children|Fantasy
23Grumpier Old Men (1995)Comedy|Romance
34Waiting to Exhale (1995)Comedy|Drama|Romance
45Father of the Bride Part II (1995)Comedy
# SQL:
sql = """
    SELECT *
    FROM 
        rating join movies 
        on(rating.movieId=movies.movieId)
    limit 5
"""
df_merged = pd.merge(left=df_rating, right=df_movies, on="movieId")
df_merged.head(5)
userIdmovieIdratingtimestamptitlegenres
0114.0964982703Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy
1514.0847434962Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy
2714.51106635946Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy
31512.51510577970Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy
41714.51305696483Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy

6. UNION数据合并

df1 = pd.DataFrame({'city': ['Chicago', 'San Francisco', 'New York City'],
                    'rank': range(1, 4)}) 
df1
cityrank
0Chicago1
1San Francisco2
2New York City3
df2 = pd.DataFrame({'city': ['Chicago', 'Boston', 'Los Angeles'],
                    'rank': [1, 4, 5]})
df2
cityrank
0Chicago1
1Boston4
2Los Angeles5
# SQL:
sql = """
    SELECT city, rank
    FROM df1
    
    UNION ALL
    
    SELECT city, rank
    FROM df2;
"""
# pandas
pd.concat([df1, df2])
cityrank
0Chicago1
1San Francisco2
2New York City3
0Chicago1
1Boston4
2Los Angeles5

7. Order Limit先排序后分页

# SQL:
sql = """
    SELECT *
    from titanic
    order by Fare
    limit 5
"""
df.sort_values("Fare", ascending=False).head(5)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
25825911Ward, Miss. Annafemale35.000PC 17755512.3292NaNC
73773811Lesurer, Mr. Gustave Jmale35.000PC 17755512.3292B101C
67968011Cardeza, Mr. Thomas Drake Martinezmale36.001PC 17755512.3292B51 B53 B55C
888911Fortune, Miss. Mabel Helenfemale23.03219950263.0000C23 C25 C27S
272801Fortune, Mr. Charles Alexandermale19.03219950263.0000C23 C25 C27S

8. 取每个分组group的top n

# MYSQL不支持
# Oracle有ROW_NUMBER语法
# 按(Survived,Sex)分组,取Age的TOP 2
df.groupby(["Survived", "Sex"]).apply(
    lambda df:df.sort_values("Age", ascending=False).head(2))
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
SurvivedSex
0female77277302Mack, Mrs. (Mary)female57.000S.O./P.P. 310.5000E77S
17717801Isham, Miss. Ann Elizabethfemale50.000PC 1759528.7125C49C
male85185203Svensson, Mr. Johanmale74.0003470607.7750NaNS
49349401Artagaveytia, Mr. Ramonmale71.000PC 1760949.5042NaNC
1female48348413Turkula, Mrs. (Hedwig)female63.00041349.5875NaNS
27527611Andrews, Miss. Kornelia Theodosiafemale63.0101350277.9583D7S
male63063111Barkworth, Mr. Algernon Henry Wilsonmale80.0002704230.0000A23S
57057112Harris, Mr. Georgemale62.000S.W./PP 75210.5000NaNS

9. UPDATE数据更新

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
# SQL:
sql = """
    UPDATE titanic
    set Age=0
    where Age is null
"""
condition = df["Age"].isna()
condition.value_counts()
False    714
True     177
Name: Age, dtype: int64
df[condition] = 0
df["Age"].isna().value_counts()
False    891
Name: Age, dtype: int64

10. DELETE删除数据

# SQL:
sql = """
    DELETE FROM titanic
    where Age=0
"""
df_new = df[df["Age"]!=0]
df_new[df_new["Age"]==0]
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值