目的:对于同一种数据查询/操作需求,比较SQL和Python Pandas的操作。
说明:
- 我的学习路径是先SQL后Python Pandas,所以本文也按照这个顺序。
- SQL的方言是T-SQL(SqlServer)
# %%
import pandas as pd
# %%
wine_reviews = pd.read_csv("./data/winemag-data/winemag-data-130k-v2.csv", index_col=0)
# %%
wine_reviews.head()
DDL
- ADD COLUMN
- DROP COLUMN
- ADD INDEX
- DROP INDEX
# %%
# -- ADD NEW COLUMN
# ALTER TABLE WINE_REVIEWS_NEW ADD price_point_ratio DECIMAL(17,4) NULL
# UPDATE WINE_REVIEWS_NEW SET price_point_ratio = price/(case when points=0 then 1 else points end)
wine_reviews['price_point_ratio'] = wine_reviews['price']/wine_reviews['points']
wine_reviews.head()
# %%
# -- DROP COLUMN
# ALTER TABLE WINE_REVIEWS_NEW DROP COLUMN price_point_ratio
wine_reviews = wine_reviews.drop(['price_point_ratio'], axis = 1) # drop column based on column name
# wine_reviews = wine_reviews.drop(wine_reviews.columns[13], axis = 1) # drop column based on column name
# %%
# -- ADD INDEX
# CREATE NONCLUSTERED INDEX IDX_WINE_REVIEWS_NEW_PROVINCE ON WINE_REVIEWS_NEW
# (
# province
# )
wine_test = wine_reviews.set_index(['province']) # set province column as index
wine_test.head()
## note: the original index column will be overwritten, so copy the original index to another column before-hand if neccessary. see above
# wine_test['province'] =wine_test.index
# %%
# -- DROP INDEX
# DROP INDEX IDX_WINE_REVIEWS_NEW_PROVINCE ON WINE_REVIEWS_NEW
# there no method for dropping index, you simply set index to another column
wine_test = wine_test.set_index(['country'])
DML
- SELECT VALUES
- SELECT … WHERE AND , OR , IS NULL, IS NOT NULL, IN,
- ORDER BY
- COLUMN OPERATION
- DISTINCT VALUES
- GROUP BY
- JOIN
- INNER JOIN
- LEFT JOIN
- FULL JO