【习题+注释+答案】Pandas--利用Python进行数据分析

目录

前言

1.MoviesLens 1M数据集 

2.美国1880-2010年的婴儿名字 

3.美国农业部视频数据库 

4.2012年联邦选举委员会数据库 


前言

首先,我们的习题是在jupyter notebook环境下进行的

如果没有配置jupyter notebook环境,可以参考这篇文章Jupyter Notebook打开教程_jupyter notebook怎么打开_XF鸭的博客-CSDN博客

我们的习题需要用到的数据包,可以点击百度网盘链接下载:

链接:https://pan.baidu.com/s/1HyrRgJommbv_VW09LNjmRg
提取码:jzqz

注:【Pandas-网络对抗赛题目.ipynb】是我们的习题,在jupyter notebook中打开,而【datasets】文件夹是我们搭配的数据,二者需要放在jupyter notebook的同一个目录下。


1.MoviesLens 1M数据集

GroupLens实验室提供了一些从MoviesLens用户那里收集的20世纪90年代末到21世纪初的电影评分数据的集合。浙西额数据提供了电影的评分、流派、年份和观众数据(年龄、邮编、性别、职业)。
MovisLens1M数据集包含6000个用户对4000部电影的100万个评分。数据分布在三个表格之中:分别包含评分、用户信息和电影信息。

在我们开始时,先要读取各个文件,请确保读取的文件与习题的相对位置与下面一致

import pandas as pd
#文件皆位于根目录下
# 读取"users.dat"文件,它将列名设置为unames,并将分隔符设置为::。header=None参数表示文件中没有标题行,因此应使用unames中提供的列名。
unames = ["user_id", "gender", "age", "occupation", "zip"]
users = pd.read_table("datasets/movielens/users.dat", sep="::", header=None, names=unames, engine="python")
 
# 读取"ratings.dat"文件
rnames = ["user_id", "movie_id", "rating", "timestamp"]
ratings = pd.read_table("datasets/movielens/ratings.dat", sep="::", header=None, names=rnames, engine="python")
 
# 读取"movies.dat"文件
mnames = ["movie_id", "title", "genres"]
movies = pd.read_table("datasets/movielens/movies.dat", sep="::", header=None, names=mnames, engine="python")
#这段代码片段读取了三个文件("users.dat"、"ratings.dat"和"movies.dat"),
#并将它们分别存储在不同的pandas DataFrame中(users、ratings和movies),以便进一步分析或处理。

 先执行下面四行代码试试吧

users.head(5)    #输出用户DataFrame的前5行数据
ratings.head(5)    #输出评分DataFrame的前5行数据
movies.head(5)    #输出电影DataFrame的前5行数据
ratings    #输出完整的ratings DataFrame
user_idmovie_idratingtimestamp
0111935978300760
116613978302109
219143978301968
3134084978300275
4123555978824291
...............
1000204604010911956716541
1000205604010945956704887
100020660405625956704746
1000207604010964956715648
1000208604010974956715569

1000209 rows × 4 columns

data = pd.merge(pd.merge(ratings, users), movies)#将三张表数据进行合并
data
data.iloc[0]#提取行索引为0,所有列索引的数据
#下面是运行的结果
user_id                                            1
movie_id                                        1193
rating                                             5
timestamp                                  978300760
gender                                             F
age                                                1
occupation                                        10
zip                                            48067
title         One Flew Over the Cuckoo's Nest (1975)
genres                                         Drama
Name: 0, dtype: object
mean_ratings = data.pivot_table("rating", index="title",
                                columns="gender", aggfunc="mean")
mean_ratings.head(5)#以电影标题作为索引 统计同一部电影以性别划分开的平均评分
genderFM
title
$1,000,000 Duck (1971)3.3750002.761905
'Night Mother (1986)3.3888893.352941
'Til There Was You (1997)2.6756762.733333
'burbs, The (1989)2.7934782.962085
...And Justice for All (1979)3.8285713.689024
#统计每个电影标题对应多少条评分
ratings_by_title = data.groupby("title").size()
ratings_by_title.head()

#被评分大于等于250次的电影标题划分为活跃标题并以此作为索引
active_titles = ratings_by_title.index[ratings_by_title >= 250]
active_titles

 #索引结果

Index([''burbs, The (1989)', '10 Things I Hate About You (1999)',
       '101 Dalmatians (1961)', '101 Dalmatians (1996)', '12 Angry Men (1957)',
       '13th Warrior, The (1999)', '2 Days in the Valley (1996)',
       '20,000 Leagues Under the Sea (1954)', '2001: A Space Odyssey (1968)',
       '2010 (1984)',
       ...
       'X-Men (2000)', 'Year of Living Dangerously (1982)',
       'Yellow Submarine (1968)', 'You've Got Mail (1998)',
       'Young Frankenstein (1974)', 'Young Guns (1988)',
       'Young Guns II (1990)', 'Young Sherlock Holmes (1985)',
       'Zero Effect (1998)', 'eXistenZ (1999)'],
      dtype='object', name='title', length=1216)
mean_ratings = mean_ratings.loc[active_titles]#提取出活跃标题的电影平均评分
mean_ratings
genderFM
title
'burbs, The (1989)2.7934782.962085
10 Things I Hate About You (1999)3.6465523.311966
101 Dalmatians (1961)3.7914443.500000
101 Dalmatians (1996)3.2400002.911215
12 Angry Men (1957)4.1843974.328421
.........
Young Guns (1988)3.3717953.425620
Young Guns II (1990)2.9347832.904025
Young Sherlock Holmes (1985)3.5147063.363344
Zero Effect (1998)3.8644073.723140
eXistenZ (1999)3.0985923.289086

1216 rows × 2 columns

#重命名一个电影标题的索引
mean_ratings = mean_ratings.rename(index={"Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)":"Seven Samurai (Shichinin no samurai) (1954)"})
                           
#按照女性对电影的评分将mean_ratings表格降序排列
top_female_ratings = mean_ratings.sort_values("F", ascending=False)
top_female_ratings.head()
genderFM
title
Clean Slate (Coup de Torchon) (1981)5.03.857143
Ballad of Narayama, The (Narayama Bushiko) (1958)5.03.428571
Raw Deal (1948)5.03.307692
Bittersweet Motel (2000)5.0NaN
Skipped Parts (2000)5.04.000000

 

#将男女评分差值(男性评分-女性评分)定义为diff
mean_ratings["diff"] = mean_ratings["M"] - mean_ratings["F"]
#diff升序排列mean_ratings表格
sorted_by_diff = mean_ratings.sort_values("diff")
sorted_by_diff.head()
genderFMdiff
title
Dirty Dancing (1987)3.7903782.959596-0.830782
Jumpin' Jack Flash (1986)3.2547172.578358-0.676359
Grease (1978)3.9752653.367041-0.608224
Little Women (1994)3.8705883.321739-0.548849
Steel Magnolias (1989)3.9017343.365957-0.535777
sorted_by_diff[::-1].head()#降序排列
genderFMdiff
title
Good, The Bad and The Ugly, The (1966)3.4949494.2213000.726351
Kentucky Fried Movie, The (1977)2.8787883.5551470.676359
Dumb & Dumber (1994)2.6979873.3365950.638608
Longest Day, The (1962)3.4117654.0314470.619682
Cable Guy, The (1996)2.2500002.8637870.613787
#计算并提取出活跃电影标题评分的标准差
rating_std_by_title = data.groupby("title")["rating"].std()
rating_std_by_title = rating_std_by_title.loc[active_titles]
rating_std_by_title.head()
#显示标准差
title
'burbs, The (1989)                   1.107760
10 Things I Hate About You (1999)    0.989815
101 Dalmatians (1961)                0.982103
101 Dalmatians (1996)                1.098717
12 Angry Men (1957)                  0.812731
Name: rating, dtype: float64
#针对提取了标准差的表格对标准差降序排列
rating_std_by_title.sort_values(ascending=False)[:10]
title
Dumb & Dumber (1994)                     1.321333
Blair Witch Project, The (1999)          1.316368
Natural Born Killers (1994)              1.307198
Tank Girl (1995)                         1.277695
Rocky Horror Picture Show, The (1975)    1.260177
Eyes Wide Shut (1999)                    1.259624
Evita (1996)                             1.253631
Billy Madison (1995)                     1.249970
Fear and Loathing in Las Vegas (1998)    1.246408
Bicentennial Man (1999)                  1.245533
Name: rating, dtype: float64
#将movies表格的类型部分字符进行划分
movies["genres"].head()
movies["genres"].head().str.split("|")
movies["genre"] = movies.pop("genres").str.split("|")
movies.head()
movie_idtitlegenre
01Toy Story (1995)[Animation, Children's, Comedy]
12Jumanji (1995)[Adventure, Children's, Fantasy]
23Grumpier Old Men (1995)[Comedy, Romance]
34Waiting to Exhale (1995)[Comedy, Drama]
45Father of the Bride Part II (1995)[Comedy]
]#将每个电影的类型拆分开排列在表中 例如Toy Story包含三个类型 则分三条罗列
movies_exploded = movies.explode("genre")
movies_exploded[:10]
movie_idtitlegenre
01Toy Story (1995)Animation
01Toy Story (1995)Children's
01Toy Story (1995)Comedy
12Jumanji (1995)Adventure
12Jumanji (1995)Children's
12Jumanji (1995)Fantasy
23Grumpier Old Men (1995)Comedy
23Grumpier Old Men (1995)Romance
34Waiting to Exhale (1995)Comedy
34Waiting to Exhale (1995)Drama
ratings_with_genre = pd.merge(pd.merge(movies_exploded, ratings), users)#合并movies_exploded ratings以及users表格
ratings_with_genre.iloc[0]#提取第0行数据
genre_ratings = (ratings_with_genre.groupby(["genre", "age"])
                 ["rating"].mean()
                 .unstack("age"))
genre_ratings[:10]#提取每种类型电影不同年龄段的平均评分
age1182535455056
genre
Action3.5063853.4470973.4533583.5381073.5285433.6113333.610709
Adventure3.4499753.4085253.4431633.5152913.5289633.6281633.649064
Animation3.4761133.6240143.7012283.7405453.7348563.7800203.756233
Children's3.2416423.2942573.4268733.5184233.5275933.5565553.621822
Comedy3.4974913.4604173.4903853.5619843.5917893.6468683.650949
Crime3.7101703.6680543.6803213.7337363.7506613.8106883.832549
Documentary3.7307693.8658653.9466903.9537473.9665213.9081083.961538
Drama3.7947353.7219303.7264283.7825123.7843563.8784153.933465
Fantasy3.3176473.3537783.4524843.4823013.5324683.5815703.532700
Film-Noir4.1454553.9973684.0587254.0649104.1053764.1754014.125932

2.美国1880-2010年的婴儿名字 

美国社会保障局(SSA)提供了从1880年至现在的婴儿姓名频率的数据。可以使用这些数据做很多事情:  
根据给定的名字对婴儿名字随时间的比例进行可视化  
确定一个名字的相对排位  
确定每年最受欢迎的名字,或者流行程度最高或最低的名字

 

#先引用必要的包
import matplotlib.pyplot as plt
import numpy as np
#读取"yob1880.txt"文件,并将数据存储在名为names1880的pandas DataFrame中
names1880 = pd.read_csv("datasets/babynames/yob1880.txt",
                        names=["name", "sex", "births"])
names1880
namesexbirths
0MaryF7065
1AnnaF2604
2EmmaF2003
3ElizabethF1939
4MinnieF1746
............
1995WoodieM5
1996WorthyM5
1997WrightM5
1998YorkM5
1999ZachariahM5

2000 rows × 3 columns

#统计新生儿性别
names1880.groupby("sex")["births"].sum()
sex
F     90993
M    110493
Name: births, dtype: int64
pieces = []

#读取从1880年到2010年的多个文件,并将每个文件的内容存储在单独的DataFrame中
for year in range(1880, 2011):
    path = f"datasets/babynames/yob{year}.txt"
    frame = pd.read_csv(path, names=["name", "sex", "births"])

    # Add a column for the year
    frame["year"] = year
    pieces.append(frame)

#用concat()函数将所有的DataFrame连接起来,创建了名为names的最终DataFrame
names = pd.concat(pieces, ignore_index=True)
names    #查看names的内容
namesexbirthsyear
0MaryF70651880
1AnnaF26041880
2EmmaF20031880
3ElizabethF19391880
4MinnieF17461880
...............
1690779ZymaireM52010
1690780ZyonneM52010
1690781ZyquariusM52010
1690782ZyranM52010
1690783ZzyzxM52010

1690784 rows × 4 columns

#使用pivot_table()函数对names进行了数据透视,并创建名为total_births的新的DataFrame
total_births = names.pivot_table("births", index="year",
                                 columns="sex", aggfunc=sum)

#打印total_births的最后几行数据
total_births.tail()    

#使用plot()函数绘制了一个关于年份和性别的出生总数的折线图
total_births.plot(title="Total births by sex and year")
<Axes: title={'center': 'Total births by sex and year'}, xlabel='year'>

#为names中的每个分组计算出生比例,并通过新增一个名为"prop"的列将结果添加到每个分组中。
def add_prop(group):
    group["prop"] = group["births"] / group["births"].sum()
    return group

names = names.groupby(["year", "sex"], group_keys=False).apply(add_prop)
names    #查看names的内容
namesexbirthsyearprop
0MaryF706518800.077643
1AnnaF260418800.028618
2EmmaF200318800.022013
3ElizabethF193918800.021309
4MinnieF174618800.019188
..................
1690779ZymaireM520100.000003
1690780ZyonneM520100.000003
1690781ZyquariusM520100.000003
1690782ZyranM520100.000003
1690783ZzyzxM520100.000003

1690784 rows × 5 columns

#获得一个Series对象,其中包含了按年份和性别分组后的所有组合的出生比例之和。
names.groupby(["year", "sex"])["prop"].sum()

 

year  sex
1880  F      1.0
      M      1.0
1881  F      1.0
      M      1.0
1882  F      1.0
            ... 
2008  M      1.0
2009  F      1.0
      M      1.0
2010  F      1.0
      M      1.0
Name: prop, Length: 262, dtype: float64
#定义函数,根据每个分组中的出生数对数据进行降序排序,并选择前1000个
def get_top1000(group):
    return group.sort_values("births", ascending=False)[:1000]

#使用groupby()函数按照"year"和"sex"两个列对names进行分组
grouped = names.groupby(["year", "sex"])

#使用apply()函数将get_top1000函数应用到每个分组上
top1000 = grouped.apply(get_top1000)

top1000.head()
namesexbirthsyearprop
yearsex
1880F0MaryF706518800.077643
1AnnaF260418800.028618
2EmmaF200318800.022013
3ElizabethF193918800.021309
4MinnieF174618800.019188
#使用reset_index()函数将top1000的索引重置
top1000 = top1000.reset_index(drop=True)

top1000.head()
namesexbirthsyearprop
0MaryF706518800.077643
1AnnaF260418800.028618
2EmmaF200318800.022013
3ElizabethF193918800.021309
4MinnieF174618800.019188
#根据性别将数据分成两个新的DataFrame:boys和girls
boys = top1000[top1000["sex"] == "M"]
girls = top1000[top1000["sex"] == "F"]
"""total_births包含了按年份和姓名进行聚合的数据,并
将每个姓名在每年的总出生数作为值填充到相应的单元格中。"""
total_births = top1000.pivot_table("births", index="year",
                                   columns="name",
                                   aggfunc=sum)
total_births.info()
subset = total_births[["John", "Harry", "Mary", "Marilyn"]]
subset.plot(subplots=True, figsize=(12, 10),
            title="Number of births per year")
<class 'pandas.core.frame.DataFrame'>
Index: 131 entries, 1880 to 2010
Columns: 6868 entries, Aaden to Zuri
dtypes: float64(6868)
memory usage: 6.9 MB
array([<Axes: xlabel='year'>, <Axes: xlabel='year'>,
       <Axes: xlabel='year'>, <Axes: xlabel='year'>], dtype=object)

plt.figure()    #创建一个新的图形窗口
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
table = top1000.pivot_table("prop", index="year",
                            columns="sex", aggfunc=sum)
table.plot(title="Sum of table1000.prop by year and sex",
           yticks=np.linspace(0, 1.2, 13))

 

<Axes: title={'center': 'Sum of table1000.prop by year and sex'}, xlabel='year'>

df = boys[boys["year"] == 2010]    #筛选条件boys["year"] == 2010
df
namesexbirthsyearprop
260877JacobM2187520100.011523
260878EthanM1786620100.009411
260879MichaelM1713320100.009025
260880JaydenM1703020100.008971
260881WilliamM1687020100.008887
..................
261872CamiloM19420100.000102
261873DestinM19420100.000102
261874JaquanM19420100.000102
261875JaydanM19420100.000102
261876MaxtonM19320100.000102

1000 rows × 5 columns

prop_cumsum = df["prop"].sort_values(ascending=False).cumsum()
prop_cumsum[:10]    #提取前10个值
prop_cumsum.searchsorted(0.5)    #查找累计和达到0.5的位置
116
#选择了1900年的数据并通过sort_values函数对"prop"列进行降序排序
df = boys[boys.year == 1900]
in1900 = df.sort_values("prop", ascending=False).prop.cumsum()
in1900.searchsorted(0.5) + 1    #由于索引从0开始,需要加1
25
#用于计算分组数据在累计和达到特定分位数时的计数
def get_quantile_count(group, q=0.5):
    group = group.sort_values("prop", ascending=False)
    return group.prop.cumsum().searchsorted(q) + 1

#使用groupby对数据框进行分组
diversity = top1000.groupby(["year", "sex"]).apply(get_quantile_count)

#使用unstack函数将结果从多级索引的序列转换为数据框
diversity = diversity.unstack()    
#创建一个名为fig的新图形对象
fig = plt.figure()
<Figure size 640x480 with 0 Axes>
#查看diversity数据框的前几行
diversity.head()
diversity.plot(title="Number of popular names in top 50%")
<Axes: title={'center': 'Number of popular names in top 50%'}, xlabel='year'>

def get_last_letter(x):
    return x[-1]

last_letters = names["name"].map(get_last_letter)
last_letters.name = "last_letter"

table = names.pivot_table("births", index=last_letters,
                          columns=["sex", "year"], aggfunc=sum)
subtable = table.reindex(columns=[1910, 1960, 2010], level="year")
subtable.head()
sexFM
year191019602010191019602010
last_letter
a108376.0691247.0670605.0977.05204.028438.0
bNaN694.0450.0411.03912.038859.0
c5.049.0946.0482.015476.023125.0
d6750.03729.02607.022111.0262112.044398.0
e133569.0435013.0313833.028655.0178823.0129012.0
subtable.sum()
letter_prop = subtable / subtable.sum()
letter_prop
sexFM
year191019602010191019602010
last_letter
a0.2733900.3418530.3812400.0050310.0024400.014980
bNaN0.0003430.0002560.0021160.0018340.020470
c0.0000130.0000240.0005380.0024820.0072570.012181
d0.0170280.0018440.0014820.1138580.1229080.023387
e0.3369410.2151330.1784150.1475560.0838530.067959
fNaN0.0000100.0000550.0007830.0043250.001188
g0.0001440.0001570.0003740.0022500.0094880.001404
h0.0515290.0362240.0758520.0455620.0379070.051670
i0.0015260.0399650.0317340.0008440.0006030.022628
jNaNNaN0.000090NaNNaN0.000769
k0.0001210.0001560.0003560.0365810.0493840.018541
l0.0431890.0338670.0263560.0650160.1049040.070367
m0.0012010.0086130.0025880.0580440.0338270.024657
n0.0792400.1306870.1402100.1434150.1525220.362771
o0.0016600.0024390.0012430.0170650.0128290.042681
p0.0000180.0000230.0000200.0031720.0056750.001269
qNaNNaN0.000030NaNNaN0.000180
r0.0133900.0067640.0180250.0644810.0310340.087477
s0.0390420.0127640.0133320.1308150.1027300.065145
t0.0274380.0152010.0078300.0728790.0656550.022861
u0.0006840.0005740.0004170.0001240.0000570.001221
vNaN0.0000600.0001170.0001130.0000370.001434
w0.0000200.0000310.0011820.0063290.0077110.016148
x0.0000150.0000370.0007270.0039650.0018510.008614
y0.1109720.1525690.1168280.0773490.1609870.058168
z0.0024390.0006590.0007040.0001700.0001840.001831
import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 1, figsize=(10, 8))
letter_prop["M"].plot(kind="bar", rot=0, ax=axes[0], title="Male")
letter_prop["F"].plot(kind="bar", rot=0, ax=axes[1], title="Female",
                      legend=False)
<Axes: title={'center': 'Female'}, xlabel='last_letter'>

plt.subplots_adjust(hspace=0.25)
letter_prop = table / table.sum()

dny_ts = letter_prop.loc[["d", "n", "y"], "M"].T
dny_ts.head()
last_letterdny
year
18800.0830550.1532130.075760
18810.0832470.1532140.077451
18820.0853400.1495600.077537
18830.0840660.1516460.079144
18840.0861200.1499150.080405
plt.close("all")
fig = plt.figure()
dny_ts.plot()
<Axes: xlabel='year'>

all_names = pd.Series(top1000["name"].unique())
lesley_like = all_names[all_names.str.contains("Lesl")]
lesley_like
632     Leslie
2294    Lesley
4262    Leslee
4728     Lesli
6103     Lesly
dtype: object
filtered = top1000[top1000["name"].isin(lesley_like)]
filtered.groupby("name")["births"].sum()
Leslee      1082
Lesley     35022
Lesli        929
Leslie    370429
Lesly      10067
Name: births, dtype: int64
table = filtered.pivot_table("births", index="year",
                             columns="sex", aggfunc="sum")
table = table.div(table.sum(axis="columns"), axis="index")
table.tail()
sexFM
year
20061.0NaN
20071.0NaN
20081.0NaN
20091.0NaN
20101.0NaN
fig = plt.figure()
table.plot(style={"M": "k-", "F": "k--"})
<AxesSubplot:xlabel='year'>


3.美国农业部视频数据库 

美国农业部提供了食物营养信息数据库。每种事务都有一些识别属性以及两份营养元素和营养比例的列表。这种形式的数据不适合分析,所以需要做一些工作将数据转换成更好的形式。

import json
db = json.load(open("datasets/usda_food/database.json"))
len(db)
6636
db[0].keys()
db[0]["nutrients"][0]
nutrients = pd.DataFrame(db[0]["nutrients"])
nutrients.head(7)
valueunitsdescriptiongroup
025.18gProteinComposition
129.20gTotal lipid (fat)Composition
23.06gCarbohydrate, by differenceComposition
33.28gAshOther
4376.00kcalEnergyEnergy
539.28gWaterComposition
61573.00kJEnergyEnergy
info_keys = ["description", "group", "id", "manufacturer"]
info = pd.DataFrame(db, columns=info_keys)
info.head()
info.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6636 entries, 0 to 6635
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   description   6636 non-null   object
 1   group         6636 non-null   object
 2   id            6636 non-null   int64 
 3   manufacturer  5195 non-null   object
dtypes: int64(1), object(3)
memory usage: 207.5+ KB
pd.value_counts(info["group"])[:10]
Vegetables and Vegetable Products    812
Beef Products                        618
Baked Products                       496
Breakfast Cereals                    403
Legumes and Legume Products          365
Fast Foods                           365
Lamb, Veal, and Game Products        345
Sweets                               341
Fruits and Fruit Juices              328
Pork Products                        328
Name: group, dtype: int64
nutrients = []

for rec in db:
    fnuts = pd.DataFrame(rec["nutrients"])
    fnuts["id"] = rec["id"]
    nutrients.append(fnuts)

nutrients = pd.concat(nutrients, ignore_index=True)

nutrients
valueunitsdescriptiongroupid
025.180gProteinComposition1008
129.200gTotal lipid (fat)Composition1008
23.060gCarbohydrate, by differenceComposition1008
33.280gAshOther1008
4376.000kcalEnergyEnergy1008
..................
3893500.000mcgVitamin B-12, addedVitamins43546
3893510.000mgCholesterolOther43546
3893520.072gFatty acids, total saturatedOther43546
3893530.028gFatty acids, total monounsaturatedOther43546
3893540.041gFatty acids, total polyunsaturatedOther43546

389355 rows × 5 columns

nutrients.duplicated().sum()  # number of duplicates
nutrients = nutrients.drop_duplicates()
col_mapping = {"description" : "food",
               "group"       : "fgroup"}
info = info.rename(columns=col_mapping, copy=False)
info.info()
col_mapping = {"description" : "nutrient",
               "group" : "nutgroup"}
nutrients = nutrients.rename(columns=col_mapping, copy=False)
nutrients
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6636 entries, 0 to 6635
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   food          6636 non-null   object
 1   fgroup        6636 non-null   object
 2   id            6636 non-null   int64 
 3   manufacturer  5195 non-null   object
dtypes: int64(1), object(3)
memory usage: 207.5+ KB
valueunitsnutrientnutgroupid
025.180gProteinComposition1008
129.200gTotal lipid (fat)Composition1008
23.060gCarbohydrate, by differenceComposition1008
33.280gAshOther1008
4376.000kcalEnergyEnergy1008
..................
3893500.000mcgVitamin B-12, addedVitamins43546
3893510.000mgCholesterolOther43546
3893520.072gFatty acids, total saturatedOther43546
3893530.028gFatty acids, total monounsaturatedOther43546
3893540.041gFatty acids, total polyunsaturatedOther43546

375176 rows × 5 columns

ndata = pd.merge(nutrients, info, on="id")
ndata.info()
ndata.iloc[30000]
<class 'pandas.core.frame.DataFrame'>
Int64Index: 375176 entries, 0 to 375175
Data columns (total 8 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   value         375176 non-null  float64
 1   units         375176 non-null  object 
 2   nutrient      375176 non-null  object 
 3   nutgroup      375176 non-null  object 
 4   id            375176 non-null  int64  
 5   food          375176 non-null  object 
 6   fgroup        375176 non-null  object 
 7   manufacturer  293054 non-null  object 
dtypes: float64(1), int64(1), object(6)
memory usage: 25.8+ MB
value                                             0.04
units                                                g
nutrient                                       Glycine
nutgroup                                   Amino Acids
id                                                6158
food            Soup, tomato bisque, canned, condensed
fgroup                      Soups, Sauces, and Gravies
manufacturer                                          
Name: 30000, dtype: object
fig = plt.figure()

result = ndata.groupby(["nutrient", "fgroup"])["value"].quantile(0.5)
result["Zinc, Zn"].sort_values().plot(kind="barh")
<AxesSubplot:ylabel='fgroup'>

by_nutrient = ndata.groupby(["nutgroup", "nutrient"])

def get_maximum(x):
    return x.loc[x.value.idxmax()]

max_foods = by_nutrient.apply(get_maximum)[["value", "food"]]

# make the food a little smaller
max_foods["food"] = max_foods["food"].str[:50]
max_foods.loc["Amino Acids"]["food"]
nutrient
Alanine                            Gelatins, dry powder, unsweetened
Arginine                                Seeds, sesame flour, low-fat
Aspartic acid                                    Soy protein isolate
Cystine                 Seeds, cottonseed flour, low fat (glandless)
Glutamic acid                                    Soy protein isolate
Glycine                            Gelatins, dry powder, unsweetened
Histidine                 Whale, beluga, meat, dried (Alaska Native)
Hydroxyproline    KENTUCKY FRIED CHICKEN, Fried Chicken, ORIGINAL RE
Isoleucine        Soy protein isolate, PROTEIN TECHNOLOGIES INTERNAT
Leucine           Soy protein isolate, PROTEIN TECHNOLOGIES INTERNAT
Lysine            Seal, bearded (Oogruk), meat, dried (Alaska Native
Methionine                     Fish, cod, Atlantic, dried and salted
Phenylalanine     Soy protein isolate, PROTEIN TECHNOLOGIES INTERNAT
Proline                            Gelatins, dry powder, unsweetened
Serine            Soy protein isolate, PROTEIN TECHNOLOGIES INTERNAT
Threonine         Soy protein isolate, PROTEIN TECHNOLOGIES INTERNAT
Tryptophan          Sea lion, Steller, meat with fat (Alaska Native)
Tyrosine          Soy protein isolate, PROTEIN TECHNOLOGIES INTERNAT
Valine            Soy protein isolate, PROTEIN TECHNOLOGIES INTERNAT
Name: food, dtype: object

4.2012年联邦选举委员会数据库 

美国联邦选举委员会公布了有关政治运动贡献的数据。这些数据包括捐赠者姓名、职业和雇主、地址和缴费金额。你可以尝试做一下的分析:  
按职业和雇主的捐赠统计  
按捐赠金额统计  
按州进行统计 

fec = pd.read_csv("datasets/fec/P00000001-ALL.csv", low_memory=False)
fec.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1001731 entries, 0 to 1001730
Data columns (total 16 columns):
 #   Column             Non-Null Count    Dtype  
---  ------             --------------    -----  
 0   cmte_id            1001731 non-null  object 
 1   cand_id            1001731 non-null  object 
 2   cand_nm            1001731 non-null  object 
 3   contbr_nm          1001731 non-null  object 
 4   contbr_city        1001712 non-null  object 
 5   contbr_st          1001727 non-null  object 
 6   contbr_zip         1001620 non-null  object 
 7   contbr_employer    988002 non-null   object 
 8   contbr_occupation  993301 non-null   object 
 9   contb_receipt_amt  1001731 non-null  float64
 10  contb_receipt_dt   1001731 non-null  object 
 11  receipt_desc       14166 non-null    object 
 12  memo_cd            92482 non-null    object 
 13  memo_text          97770 non-null    object 
 14  form_tp            1001731 non-null  object 
 15  file_num           1001731 non-null  int64  
dtypes: float64(1), int64(1), object(14)
memory usage: 122.3+ MB
fec.iloc[123456]
cmte_id                             C00431445
cand_id                             P80003338
cand_nm                         Obama, Barack
contbr_nm                         ELLMAN, IRA
contbr_city                             TEMPE
contbr_st                                  AZ
contbr_zip                          852816719
contbr_employer      ARIZONA STATE UNIVERSITY
contbr_occupation                   PROFESSOR
contb_receipt_amt                        50.0
contb_receipt_dt                    01-DEC-11
receipt_desc                              NaN
memo_cd                                   NaN
memo_text                                 NaN
form_tp                                 SA17A
file_num                               772372
Name: 123456, dtype: object
unique_cands = fec["cand_nm"].unique()
unique_cands
unique_cands[2]
'Obama, Barack'
parties = {"Bachmann, Michelle": "Republican",
           "Cain, Herman": "Republican",
           "Gingrich, Newt": "Republican",
           "Huntsman, Jon": "Republican",
           "Johnson, Gary Earl": "Republican",
           "McCotter, Thaddeus G": "Republican",
           "Obama, Barack": "Democrat",
           "Paul, Ron": "Republican",
           "Pawlenty, Timothy": "Republican",
           "Perry, Rick": "Republican",
           "Roemer, Charles E. 'Buddy' III": "Republican",
           "Romney, Mitt": "Republican",
           "Santorum, Rick": "Republican"}
fec["cand_nm"][123456:123461]
fec["cand_nm"][123456:123461].map(parties)
# Add it as a column
fec["party"] = fec["cand_nm"].map(parties)
fec["party"].value_counts()
Democrat      593746
Republican    407985
Name: party, dtype: int64
(fec["contb_receipt_amt"] > 0).value_counts()
True     991475
False     10256
Name: contb_receipt_amt, dtype: int64
fec = fec[fec["contb_receipt_amt"] > 0]

fec_mrbo = fec[fec["cand_nm"].isin(["Obama, Barack", "Romney, Mitt"])]

fec["contbr_occupation"].value_counts()[:10]
RETIRED                                   233990
INFORMATION REQUESTED                      35107
ATTORNEY                                   34286
HOMEMAKER                                  29931
PHYSICIAN                                  23432
INFORMATION REQUESTED PER BEST EFFORTS     21138
ENGINEER                                   14334
TEACHER                                    13990
CONSULTANT                                 13273
PROFESSOR                                  12555
Name: contbr_occupation, dtype: int64
occ_mapping = {
   "INFORMATION REQUESTED PER BEST EFFORTS" : "NOT PROVIDED",
   "INFORMATION REQUESTED" : "NOT PROVIDED",
   "INFORMATION REQUESTED (BEST EFFORTS)" : "NOT PROVIDED",
   "C.E.O.": "CEO"
}

def get_occ(x):
    # If no mapping provided, return x
    return occ_mapping.get(x, x)

fec["contbr_occupation"] = fec["contbr_occupation"].map(get_occ)
emp_mapping = {
   "INFORMATION REQUESTED PER BEST EFFORTS" : "NOT PROVIDED",
   "INFORMATION REQUESTED" : "NOT PROVIDED",
   "SELF" : "SELF-EMPLOYED",
   "SELF EMPLOYED" : "SELF-EMPLOYED",
}

def get_emp(x):
    # If no mapping provided, return x
    return emp_mapping.get(x, x)

fec["contbr_employer"] = fec["contbr_employer"].map(get_emp)
by_occupation = fec.pivot_table("contb_receipt_amt",
                                index="contbr_occupation",
                                columns="party", aggfunc="sum")
over_2mm = by_occupation[by_occupation.sum(axis="columns") > 2000000]
over_2mm
partyDemocratRepublican
contbr_occupation
ATTORNEY11141982.977477194.43
CEO2074974.794211040.52
CONSULTANT2459912.712544725.45
ENGINEER951525.551818373.70
EXECUTIVE1355161.054138850.09
HOMEMAKER4248875.8013634275.78
INVESTOR884133.002431768.92
LAWYER3160478.87391224.32
MANAGER762883.221444532.37
NOT PROVIDED4866973.9620565473.01
OWNER1001567.362408286.92
PHYSICIAN3735124.943594320.24
PRESIDENT1878509.954720923.76
PROFESSOR2165071.08296702.73
REAL ESTATE528902.091625902.25
RETIRED25305116.3823561244.49
SELF-EMPLOYED672393.401640252.54
plt.figure()

over_2mm.plot(kind="barh")
<AxesSubplot:ylabel='contbr_occupation'>

def get_top_amounts(group, key, n=5):
    totals = group.groupby(key)["contb_receipt_amt"].sum()
    return totals.nlargest(n)

grouped = fec_mrbo.groupby("cand_nm")
grouped.apply(get_top_amounts, "contbr_occupation", n=7)
grouped.apply(get_top_amounts, "contbr_employer", n=10)
cand_nm        contbr_employer                       
Obama, Barack  RETIRED                                   22694358.85
               SELF-EMPLOYED                             17080985.96
               NOT EMPLOYED                               8586308.70
               INFORMATION REQUESTED                      5053480.37
               HOMEMAKER                                  2605408.54
               SELF                                       1076531.20
               SELF EMPLOYED                               469290.00
               STUDENT                                     318831.45
               VOLUNTEER                                   257104.00
               MICROSOFT                                   215585.36
Romney, Mitt   INFORMATION REQUESTED PER BEST EFFORTS    12059527.24
               RETIRED                                   11506225.71
               HOMEMAKER                                  8147196.22
               SELF-EMPLOYED                              7409860.98
               STUDENT                                     496490.94
               CREDIT SUISSE                               281150.00
               MORGAN STANLEY                              267266.00
               GOLDMAN SACH & CO.                          238250.00
               BARCLAYS CAPITAL                            162750.00
               H.I.G. CAPITAL                              139500.00
Name: contb_receipt_amt, dtype: float64
bins = np.array([0, 1, 10, 100, 1000, 10000,
                 100_000, 1_000_000, 10_000_000])
labels = pd.cut(fec_mrbo["contb_receipt_amt"], bins)
labels
411         (10, 100]
412       (100, 1000]
413       (100, 1000]
414         (10, 100]
415         (10, 100]
             ...     
701381      (10, 100]
701382    (100, 1000]
701383        (1, 10]
701384      (10, 100]
701385    (100, 1000]
Name: contb_receipt_amt, Length: 694282, dtype: category
Categories (8, interval[int64, right]): [(0, 1] < (1, 10] < (10, 100] < (100, 1000] < (1000, 10000] < (10000, 100000] < (100000, 1000000] < (1000000, 10000000]]
grouped = fec_mrbo.groupby(["cand_nm", labels])
grouped.size().unstack(level=0)
cand_nmObama, BarackRomney, Mitt
contb_receipt_amt
(0, 1]49377
(1, 10]400703681
(10, 100]37228031853
(100, 1000]15399143357
(1000, 10000]2228426186
(10000, 100000]21
(100000, 1000000]30
(1000000, 10000000]40
plt.figure()

bucket_sums = grouped["contb_receipt_amt"].sum().unstack(level=0)
normed_sums = bucket_sums.div(bucket_sums.sum(axis="columns"),
                              axis="index")
normed_sums
normed_sums[:-2].plot(kind="barh")
<AxesSubplot:ylabel='contb_receipt_amt'>

grouped = fec_mrbo.groupby(["cand_nm", "contbr_st"])
totals = grouped["contb_receipt_amt"].sum().unstack(level=0).fillna(0)
totals = totals[totals.sum(axis="columns") > 100000]
totals.head(10)
cand_nmObama, BarackRomney, Mitt
contbr_st
AK281840.1586204.24
AL543123.48527303.51
AR359247.28105556.00
AZ1506476.981888436.23
CA23824984.2411237636.60
CO2132429.491506714.12
CT2068291.263499475.45
DC4373538.801025137.50
DE336669.1482712.00
FL7318178.588338458.81
percent = totals.div(totals.sum(axis="columns"), axis="index")
percent.head(10)
cand_nmObama, BarackRomney, Mitt
contbr_st
AK0.7657780.234222
AL0.5073900.492610
AR0.7729020.227098
AZ0.4437450.556255
CA0.6794980.320502
CO0.5859700.414030
CT0.3714760.628524
DC0.8101130.189887
DE0.8027760.197224
FL0.4674170.532583

以上便是所有习题内容,仅供参考

参考资源链接:[Python实战:小学生数学练习题生成器开发](https://wenku.csdn.net/doc/5s3gnmbpmh?utm_source=wenku_answer2doc_content) 在当今数据驱动的环境下,掌握网页数据抓取和分析技术对于开发者来说至关重要。通过《Python实战:小学生数学练习题生成器开发》资源,我们可以系统地学习如何从零开始构建一个完整的Python项目,其中涵盖了网页数据抓取和数据分析的实战技能。 首先,要进行网页数据抓取,我们需要了解HTTP协议和网页结构。可以通过requests库发送GET或POST请求来获取网页内容。例如,使用requests.get(url)函数获取网页数据,并通过响应对象的.text属性获取HTML源码。 其次,要对获取的HTML内容进行解析,提取出所需的数据。此时,可以使用BeautifulSoup或lxml库来解析HTML文档。例如,使用BeautifulSoup进行解析并提取特定标签或属性。 之后,将抓取到的数据进行清洗和格式化。pandas库在数据分析中扮演了重要角色,能够帮助我们轻松地处理和分析数据。使用pandas的read_html函数可以直接从网页中读取表格数据到DataFrame中进行进一步分析。 在数据分析方面,我们可以利用pandas进行数据的清洗、处理、统计分析,并使用matplotlib或seaborn库进行数据可视化,以图形的方式展示分析结果。 最后,结合《Python实战:小学生数学练习题生成器开发》资源中的实战案例,我们可以更好地理解这些技术在实际项目中的应用。资源中的每个案例都包含了详细的代码注释和文档,使得学习者可以跟踪每个步骤,从而加深理解。 综上所述,通过使用Python进行网页数据抓取和分析,我们不仅可以锻炼编程技能,还能培养解决问题的能力和编程思维。如果你希望更深入地学习这些技术,并在实际项目中得到应用,建议详细学习《Python实战:小学生数学练习题生成器开发》资源。这份资源将带你一步步构建项目,同时提升你在数据分析、机器学习模型构建以及Web应用开发等多个领域的实战能力。 参考资源链接:[Python实战:小学生数学练习题生成器开发](https://wenku.csdn.net/doc/5s3gnmbpmh?utm_source=wenku_answer2doc_content)
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值