Pandas和数据库查询语言SQL的对比
- Pandas:Python最流行的数据处理与数据分析的类库
- SQL:结构化查询语言,用于对MySQL、Oracle等关系型数据库的增删改查
两者都是对“表格型”数据的操作和查询,所以很多语法都能对应起来
对比列表:
- SELECT数据查询
- WHERE按条件查询
- in和not in的条件查询
- groupby分组统计
- JOIN数据关联
- UNION数据合并
- Order Limit先排序后分页
- 取每个分组group的top n
- UPDATE数据更新
- DELETE删除数据
0. 读取泰坦尼克数据集
import pandas as pd
import numpy as np
df = pd.read_csv("./datas/titanic/titanic_train.csv")
df.head()
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
1. SELECT数据查询
sql = """
SELECT PassengerId, Sex, Age, Survived
FROM titanic
LIMIT 5;
"""
df[["PassengerId", "Sex", "Age", "Survived"]].head(5)
| PassengerId | Sex | Age | Survived |
---|
0 | 1 | male | 22.0 | 0 |
1 | 2 | female | 38.0 | 1 |
2 | 3 | female | 26.0 | 1 |
3 | 4 | female | 35.0 | 1 |
4 | 5 | male | 35.0 | 0 |
df.head(5)类似select * from table limit 5,查询所有的字段
2. WHERE按条件查询
sql = """
SELECT *
FROM titanic
where Sex='male' and Age>=20.0 and Age<=40.0
LIMIT 5;
"""
condition = (df["Sex"]=="male") & (df["Age"]>=20.0) & (df["Age"]<=40.0)
condition.value_counts()
False 629
True 262
dtype: int64
df[condition].head(5)
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.250 | NaN | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.050 | NaN | S |
12 | 13 | 0 | 3 | Saundercock, Mr. William Henry | male | 20.0 | 0 | 0 | A/5. 2151 | 8.050 | NaN | S |
13 | 14 | 0 | 3 | Andersson, Mr. Anders Johan | male | 39.0 | 1 | 5 | 347082 | 31.275 | NaN | S |
20 | 21 | 0 | 2 | Fynney, Mr. Joseph J | male | 35.0 | 0 | 0 | 239865 | 26.000 | NaN | S |
3. in和not in的条件查询
df["Pclass"].unique()
array([3, 1, 2], dtype=int64)
sql = """
SELECT *
FROM titanic
where Pclass in (1,2)
LIMIT 5;
"""
df[df["Pclass"].isin((1,2))].head()
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | NaN | C |
11 | 12 | 1 | 1 | Bonnell, Miss. Elizabeth | female | 58.0 | 0 | 0 | 113783 | 26.5500 | C103 | S |
df[~df["Pclass"].isin((1,2))].head()
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
4. groupby分组统计
4.1 单个列的聚合
sql = """
SELECT
-- 分性别的存活人数
sum(Survived),
-- 分性别的平均年龄
mean(Age)
-- 分性别的平均票价
mean(Fare)
FROM titanic
group by Sex
"""
df.groupby("Sex").agg({"Survived":np.sum, "Age":np.mean, "Fare":np.mean})
| Survived | Age | Fare |
---|
Sex | | | |
---|
female | 233 | 27.915709 | 44.479818 |
male | 109 | 30.726645 | 25.523893 |
4.2 多个列的聚合
sql = """
SELECT
-- 不同存活和性别分组的,平均年龄
mean(Age)
-- 不同存活和性别分组的,平均票价
mean(Fare)
FROM titanic
group by Survived, Sex
"""
df.groupby(["Survived", "Sex"]).agg({"Age":np.mean, "Fare":np.mean})
| | Age | Fare |
---|
Survived | Sex | | |
---|
0 | female | 25.046875 | 23.024385 |
male | 31.618056 | 21.960993 |
1 | female | 28.847716 | 51.938573 |
male | 27.276022 | 40.821484 |
5. JOIN数据关联
df_rating = pd.read_csv("./datas/ml-latest-small/ratings.csv")
df_rating.head(5)
| userId | movieId | rating | timestamp |
---|
0 | 1 | 1 | 4.0 | 964982703 |
1 | 1 | 3 | 4.0 | 964981247 |
2 | 1 | 6 | 4.0 | 964982224 |
3 | 1 | 47 | 5.0 | 964983815 |
4 | 1 | 50 | 5.0 | 964982931 |
df_movies = pd.read_csv("./datas/ml-latest-small/movies.csv")
df_movies.head(5)
| movieId | title | genres |
---|
0 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
1 | 2 | Jumanji (1995) | Adventure|Children|Fantasy |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama|Romance |
4 | 5 | Father of the Bride Part II (1995) | Comedy |
sql = """
SELECT *
FROM
rating join movies
on(rating.movieId=movies.movieId)
limit 5
"""
df_merged = pd.merge(left=df_rating, right=df_movies, on="movieId")
df_merged.head(5)
| userId | movieId | rating | timestamp | title | genres |
---|
0 | 1 | 1 | 4.0 | 964982703 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
1 | 5 | 1 | 4.0 | 847434962 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
2 | 7 | 1 | 4.5 | 1106635946 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
3 | 15 | 1 | 2.5 | 1510577970 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
4 | 17 | 1 | 4.5 | 1305696483 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
6. UNION数据合并
df1 = pd.DataFrame({'city': ['Chicago', 'San Francisco', 'New York City'],
'rank': range(1, 4)})
df1
| city | rank |
---|
0 | Chicago | 1 |
1 | San Francisco | 2 |
2 | New York City | 3 |
df2 = pd.DataFrame({'city': ['Chicago', 'Boston', 'Los Angeles'],
'rank': [1, 4, 5]})
df2
| city | rank |
---|
0 | Chicago | 1 |
1 | Boston | 4 |
2 | Los Angeles | 5 |
sql = """
SELECT city, rank
FROM df1
UNION ALL
SELECT city, rank
FROM df2;
"""
pd.concat([df1, df2])
| city | rank |
---|
0 | Chicago | 1 |
1 | San Francisco | 2 |
2 | New York City | 3 |
0 | Chicago | 1 |
1 | Boston | 4 |
2 | Los Angeles | 5 |
7. Order Limit先排序后分页
sql = """
SELECT *
from titanic
order by Fare
limit 5
"""
df.sort_values("Fare", ascending=False).head(5)
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|
258 | 259 | 1 | 1 | Ward, Miss. Anna | female | 35.0 | 0 | 0 | PC 17755 | 512.3292 | NaN | C |
737 | 738 | 1 | 1 | Lesurer, Mr. Gustave J | male | 35.0 | 0 | 0 | PC 17755 | 512.3292 | B101 | C |
679 | 680 | 1 | 1 | Cardeza, Mr. Thomas Drake Martinez | male | 36.0 | 0 | 1 | PC 17755 | 512.3292 | B51 B53 B55 | C |
88 | 89 | 1 | 1 | Fortune, Miss. Mabel Helen | female | 23.0 | 3 | 2 | 19950 | 263.0000 | C23 C25 C27 | S |
27 | 28 | 0 | 1 | Fortune, Mr. Charles Alexander | male | 19.0 | 3 | 2 | 19950 | 263.0000 | C23 C25 C27 | S |
8. 取每个分组group的top n
df.groupby(["Survived", "Sex"]).apply(
lambda df:df.sort_values("Age", ascending=False).head(2))
| | | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|
Survived | Sex | | | | | | | | | | | | | |
---|
0 | female | 772 | 773 | 0 | 2 | Mack, Mrs. (Mary) | female | 57.0 | 0 | 0 | S.O./P.P. 3 | 10.5000 | E77 | S |
177 | 178 | 0 | 1 | Isham, Miss. Ann Elizabeth | female | 50.0 | 0 | 0 | PC 17595 | 28.7125 | C49 | C |
male | 851 | 852 | 0 | 3 | Svensson, Mr. Johan | male | 74.0 | 0 | 0 | 347060 | 7.7750 | NaN | S |
493 | 494 | 0 | 1 | Artagaveytia, Mr. Ramon | male | 71.0 | 0 | 0 | PC 17609 | 49.5042 | NaN | C |
1 | female | 483 | 484 | 1 | 3 | Turkula, Mrs. (Hedwig) | female | 63.0 | 0 | 0 | 4134 | 9.5875 | NaN | S |
275 | 276 | 1 | 1 | Andrews, Miss. Kornelia Theodosia | female | 63.0 | 1 | 0 | 13502 | 77.9583 | D7 | S |
male | 630 | 631 | 1 | 1 | Barkworth, Mr. Algernon Henry Wilson | male | 80.0 | 0 | 0 | 27042 | 30.0000 | A23 | S |
570 | 571 | 1 | 2 | Harris, Mr. George | male | 62.0 | 0 | 0 | S.W./PP 752 | 10.5000 | NaN | S |
9. UPDATE数据更新
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
sql = """
UPDATE titanic
set Age=0
where Age is null
"""
condition = df["Age"].isna()
condition.value_counts()
False 714
True 177
Name: Age, dtype: int64
df[condition] = 0
df["Age"].isna().value_counts()
False 891
Name: Age, dtype: int64
10. DELETE删除数据
sql = """
DELETE FROM titanic
where Age=0
"""
df_new = df[df["Age"]!=0]
df_new[df_new["Age"]==0]
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|