Python数据分析pandas入门练习题(八)

Python数据分析基础

Preparation

需要数据集可以自行网上寻找(都是公开的数据集)或私聊博主,传到csdn,你们下载要会员,就不传了。下面数据集链接下载不一定能成功。

Exercise 1- US - Baby Names

Introduction:

We are going to use a subset of US Baby Names from Kaggle.
In the file it will be names from 2004 until 2014

Step 1. Import the necessary libraries

代码如下:

import pandas as pd

Step 2. Import the dataset from this address.

Step 3. Assign it to a variable called baby_names.

代码如下:

baby_names = pd.read_csv("US_Baby_Names_right.csv")
baby_names.info()

输出结果如下:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1016395 entries, 0 to 1016394
Data columns (total 7 columns):
Unnamed: 0    1016395 non-null int64
Id            1016395 non-null int64
Name          1016395 non-null object
Year          1016395 non-null int64
Gender        1016395 non-null object
State         1016395 non-null object
Count         1016395 non-null int64
dtypes: int64(4), object(3)
memory usage: 54.3+ MB

Step 4. See the first 10 entries

代码如下:

baby_names.head(10)

输出结果如下:

Unnamed: 0 Id Name Year Gender State Count
0 11349 11350 Emma 2004 F AK 62
1 11350 11351 Madison 2004 F AK 48
2 11351 11352 Hannah 2004 F AK 46
3 11352 11353 Grace 2004 F AK 44
4 11353 11354 Emily 2004 F AK 41
5 11354 11355 Abigail 2004 F AK 37
6 11355 11356 Olivia 2004 F AK 33
7 11356 11357 Isabella 2004 F AK 30
8 11357 11358 Alyssa 2004 F AK 29
9 11358 11359 Sophia 2004 F AK 28

Step 5. Delete the column ‘Unnamed: 0’ and ‘Id’

代码如下:

del baby_names['Id']
# OR del baby_names['Unnamed: 0']
baby_names = baby_names.loc[:, ~baby_names.columns.str.contains('^Unnamed')]
baby_names.head()

输出结果如下:

Name Year Gender State Count
0 Emma 2004 F AK 62
1 Madison 2004 F AK 48
2 Hannah 2004 F AK 46
3 Grace 2004 F AK 44
4 Emily 2004 F AK 41

Step 6. Is there more male or female names in the dataset?

代码如下:

# baby_names['Gender'].value_counts()
baby_names.groupby('Gender').Count.sum()

输出结果如下:

Gender
F    16380293
M    19041199
Name: Count, dtype: int64

Step 7. Group the dataset by name and assign to names

代码如下:

del baby_names["Year"]
names = baby_names.groupby("Name").sum()
names.head()
print(names.shape)
names.sort_values("Count", ascending = 0).head()
# names= baby_names.groupby('Name')
# names.head(1)

输出结果如下:

(17632, 1)
Count
Name
Jacob 242874
Emma 214852
Michael 214405
Ethan 209277
Isabella 204798

Step 8. How many different names exist in the dataset?

代码如下:

len(names)

输出结果如下:

17632

Step 9. What is the name with most occurrences?

代码如下:

# names['Count'].sum().argmax()
names.Count.idxmax()  # idxmax()获取pandas中series最大值对应的索引

输出结果如下:

'Jacob'

Step 10. How many different names have the least occurrences?

代码如下:

len(names[names.Count == names.Count.min()])

输出结果如下:

2578

Step 11. What is the median name occurrence?

代码如下:

names[names.Count == names.Count.median()]

输出结果如下:

Count
Name
Aishani 49
Alara 49
Alysse 49
Ameir 49
Anely 49
Antonina 49
Aveline 49
Aziah 49
Baily 49
Caleah 49
Carlota 49
Cristine 49
Dahlila 49
Darvin 49
Deante 49
Deserae 49
Devean 49
Elizah 49
Emmaly 49
Emmanuela 49
Envy 49
Esli 49
Fay 49
Gurshaan
  • 1
    点赞
  • 15
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值