Python数据分析基础
- Preparation
- Exercise 1-Student Alcohol Consumption
-
-
- Introduction:
- Step 1. Import the necessary libraries
- Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/04_Apply/Students_Alcohol_Consumption/student-mat.csv).
- Step 3. Assign it to a variable called df.
- Step 4. For the purpose of this exercise slice the dataframe from 'school' until the 'guardian' column
- Step 5. Create a lambda function that capitalize strings.
- Step 6. Capitalize both Mjob and Fjob
- Step 7. Print the last elements of the data set.
- Step 8. Did you notice the original dataframe is still lowercase? Why is that? Fix it and capitalize Mjob and Fjob.
- Step 9. Create a function called majority that return a boolean value to a new column called legal_drinker (Consider majority as older than 17 years old)
- Step 10. Multiply every number of the dataset by 10.
-
- Exercise 2-United States - Crime Rates - 1960 - 2014
-
-
- Introduction:
- Step 1. Import the necessary libraries
- Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/04_Apply/US_Crime_Rates/US_Crime_Rates_1960_2014.csv).
- Step 3. Assign it to a variable called crime.
- Step 4. What is the type of the columns?
- Step 5. Convert the type of the column Year to datetime64
- Step 6. Set the Year column as the index of the dataframe
- Step 7. Delete the Total column
- Step 8. Group the year by decades and sum the values
- Step 9. What is the most dangerous decade to live in the US?
-
- Conclusion
Preparation
下面是练习题的数据集,尽量下载下来使用。下面习题的连接不一定能打开。
需要数据集可以私聊博主或者自行网上寻找,传到csdn,你们下载要会员,就不传了。
Exercise 1-Student Alcohol Consumption
Introduction:
This time you will download a dataset from the UCI.
Step 1. Import the necessary libraries
import pandas as pd
Step 2. Import the dataset from this address.
Step 3. Assign it to a variable called df.
代码如下:
df = pd.read_csv("student-mat.csv", sep=',')
df.head()
输出结果如下:
school | sex | age | address | famsize | Pstatus | Medu | Fedu | Mjob | Fjob | ... | famrel | freetime | goout | Dalc | Walc | health | absences | G1 | G2 | G3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | GP | F | 18 | U | GT3 | A | 4 | 4 | at_home | teacher | ... | 4 | 3 | 4 | 1 | 1 | 3 | 6 | 5 | 6 | 6 |
1 | GP | F | 17 | U | GT3 | T | 1 | 1 | at_home | other | ... | 5 | 3 | 3 | 1 | 1 | 3 | 4 | 5 | 5 | 6 |
2 | GP | F | 15 | U | LE3 | T | 1 | 1 | at_home | other | ... | 4 | 3 | 2 | 2 | 3 | 3 | 10 | 7 | 8 | 10 |
3 | GP | F | 15 | U | GT3 | T | 4 | 2 | health | services | ... | 3 | 2 | 2 | 1 | 1 | 5 | 2 | 15 | 14 | 15 |
4 | GP | F | 16 | U | GT3 | T | 3 | 3 | other | other | ... | 4 | 3 | 2 | 1 | 2 | 5 | 4 | 6 | 10 | 10 |
5 rows × 33 columns
Step 4. For the purpose of this exercise slice the dataframe from ‘school’ until the ‘guardian’ column
代码如下:
stud_alcoh = df.loc[:, 'school':'guardian'] # loc切片一般用行列名,iloc一般用行列号
stud_alcoh.head()
输出结果如下:
school | sex | age | address | famsize | Pstatus | Medu | Fedu | Mjob | Fjob | reason | guardian | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | GP | F | 18 | U | GT3 | A | 4 | 4 | at_home | teacher | course | mother |
1 | GP | F | 17 | U | GT3 | T | 1 | 1 | at_home | other | course | father |
2 | GP | F | 15 | U | LE3 | T | 1 | 1 | at_home | other | other | mother |
3 | GP | F | 15 | U | GT3 | T | 4 | 2 | health | services | home | mother |
4 | GP | F | 16 | U | GT3 | T | 3 | 3 | other | other | home | father |
Step 5. Create a lambda function that capitalize strings.
代码如下:
capitalizer = lambda str: str.capitalize() #capitalize()将字符串首字母转换为大写字母,upper()将整个字符串转化为大写
print(capitalizer('www'))
输出结果如下:
Www
Step 6. Capitalize both Mjob and Fjob
代码如下:
# for i in df['Mjob']:
# print(capitalizer(i))
stud_alcoh.Mjob.apply(capitalizer)
stud_alcoh.Fjob.apply(capitalizer)
输出结果如下:
0 Teacher
1 Other
2 Other
3 Services
4 Other
5 Other
6 Other
7 Teacher
8 Other
9 Other
10 Health
11 Other
12 Services
13 Other
14 Other
15 Other
16 Services
17 Other
18 Services
19 Other
20 Other
21 Health
22 Other
23 Other
24 Health
25 Services
26 Other
27 Services
28 Other
29 Teacher
...
365 Other
366 Services
367 Services
368 Services
369 Teacher
370 Services
371 Services
372 At_home
373 Other
374 Other
375 Other
376 Other
377 Services
378 Other
379 Other
380 Teacher
381 Other
382 Services
383 Services
384 Other
385 Other
386 At_home
387 Other
388 Services
389 Other
390 Services
391 Services
392 Other
393 Other
394 At_home
Name: Fjob, Length: 395, dtype: object
Step 7. Print the last elements of the data set.
代码如下:
# df.iloc[394, 32]
stud_alcoh.tail()
输出结果如下:
school | sex | age | address | famsize | Pstatus | Medu | Fedu | Mjob | Fjob | reason | guardian | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
390 | MS | M | 20 | U | LE3 | A | 2 | 2 | services | services | course | other |
391 | MS | M | 17 | U | LE3 | T | 3 | 1 | services | services | course | mother |
392 | MS | M | 21 | R | GT3 | T | 1 | 1 | other | other | course | other |
393 | MS | M | 18 | R | LE3 | T | 3 | 2 | services | other | course | mother |
394 | MS | M | 19 | U | LE3 | T | 1 | 1 | other | at_home | course | father |
Step 8. Did you notice the original dataframe is still lowercase? Why is that? Fix it and capitalize Mjob and Fjob.
代码如下:
stud_alcoh.Mjob = stud_alcoh.Mjob.apply(capitalizer)
stud_alcoh.Fjob = stud_alcoh.Fjob.apply(capitalizer)
stud_alcoh
输出结果如下:
school | sex | age | address | famsize | Pstatus | Medu | Fedu | Mjob | Fjob | reason | guardian | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | GP | F | 18 | U | GT3 | A | 4 | 4 | At_home | Teacher | course | mother |
1 | GP | F | 17 | U | GT3 | T | 1 | 1 | At_home | Other |