Python数据分析基础
- Preparation
- Exercise 1- US - Baby Names
-
-
- Introduction:
- Step 1. Import the necessary libraries
- Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv).
- Step 3. Assign it to a variable called baby_names.
- Step 4. See the first 10 entries
- Step 5. Delete the column 'Unnamed: 0' and 'Id'
- Step 6. Is there more male or female names in the dataset?
- Step 7. Group the dataset by name and assign to names
- Step 8. How many different names exist in the dataset?
- Step 9. What is the name with most occurrences?
- Step 10. How many different names have the least occurrences?
- Step 11. What is the median name occurrence?
- Step 12. What is the standard deviation of names?
- Step 13. Get a summary with the mean, min, max, std and quartiles.
-
- Exercise 2- Wind Statistics
-
-
- Introduction:
- Step 1. Import the necessary libraries
- Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/Wind_Stats/wind.data)
- Step 3. Assign it to a variable called data and replace the first 3 columns by a proper datetime index.
- Step 4. Year 2061? Do we really have data from this year? Create a function to fix it and apply it.
- Step 5. Set the right dates as the index. Pay attention at the data type, it should be datetime64[ns].
- Step 6. Compute how many values are missing for each location over the entire record.
- Step 7. Compute how many non-missing values there are in total.
- Step 8. Calculate the mean windspeeds of the windspeeds over all the locations and all the times.
- Step 9. Create a DataFrame called loc_stats and calculate the min, max and mean windspeeds and standard deviations of the windspeeds at each location over all the days
- Step 10. Create a DataFrame called day_stats and calculate the min, max and mean windspeed and standard deviations of the windspeeds across all the locations at each day.
- Step 11. Find the average windspeed in January for each location.
- Step 12. Downsample the record to a yearly frequency for each location.
- Step 13. Downsample the record to a monthly frequency for each location.
- Step 14. Downsample the record to a weekly frequency for each location.
- Step 15. Calculate the min, max and mean windspeeds and standard deviations of the windspeeds across all locations for each week (assume that the first week starts on January 2 1961) for the first 52 weeks.
-
- Conclusion
Preparation
需要数据集可以自行网上寻找(都是公开的数据集)或私聊博主,传到csdn,你们下载要会员,就不传了。下面数据集链接下载不一定能成功。
Exercise 1- US - Baby Names
Introduction:
We are going to use a subset of US Baby Names from Kaggle.
In the file it will be names from 2004 until 2014
Step 1. Import the necessary libraries
代码如下:
import pandas as pd
Step 2. Import the dataset from this address.
Step 3. Assign it to a variable called baby_names.
代码如下:
baby_names = pd.read_csv("US_Baby_Names_right.csv")
baby_names.info()
输出结果如下:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1016395 entries, 0 to 1016394
Data columns (total 7 columns):
Unnamed: 0 1016395 non-null int64
Id 1016395 non-null int64
Name 1016395 non-null object
Year 1016395 non-null int64
Gender 1016395 non-null object
State 1016395 non-null object
Count 1016395 non-null int64
dtypes: int64(4), object(3)
memory usage: 54.3+ MB
Step 4. See the first 10 entries
代码如下:
baby_names.head(10)
输出结果如下:
Unnamed: 0 | Id | Name | Year | Gender | State | Count | |
---|---|---|---|---|---|---|---|
0 | 11349 | 11350 | Emma | 2004 | F | AK | 62 |
1 | 11350 | 11351 | Madison | 2004 | F | AK | 48 |
2 | 11351 | 11352 | Hannah | 2004 | F | AK | 46 |
3 | 11352 | 11353 | Grace | 2004 | F | AK | 44 |
4 | 11353 | 11354 | Emily | 2004 | F | AK | 41 |
5 | 11354 | 11355 | Abigail | 2004 | F | AK | 37 |
6 | 11355 | 11356 | Olivia | 2004 | F | AK | 33 |
7 | 11356 | 11357 | Isabella | 2004 | F | AK | 30 |
8 | 11357 | 11358 | Alyssa | 2004 | F | AK | 29 |
9 | 11358 | 11359 | Sophia | 2004 | F | AK | 28 |
Step 5. Delete the column ‘Unnamed: 0’ and ‘Id’
代码如下:
del baby_names['Id']
# OR del baby_names['Unnamed: 0']
baby_names = baby_names.loc[:, ~baby_names.columns.str.contains('^Unnamed')]
baby_names.head()
输出结果如下:
Name | Year | Gender | State | Count | |
---|---|---|---|---|---|
0 | Emma | 2004 | F | AK | 62 |
1 | Madison | 2004 | F | AK | 48 |
2 | Hannah | 2004 | F | AK | 46 |
3 | Grace | 2004 | F | AK | 44 |
4 | Emily | 2004 | F | AK | 41 |
Step 6. Is there more male or female names in the dataset?
代码如下:
# baby_names['Gender'].value_counts()
baby_names.groupby('Gender').Count.sum()
输出结果如下:
Gender
F 16380293
M 19041199
Name: Count, dtype: int64
Step 7. Group the dataset by name and assign to names
代码如下:
del baby_names["Year"]
names = baby_names.groupby("Name").sum()
names.head()
print(names.shape)
names.sort_values("Count", ascending = 0).head()
# names= baby_names.groupby('Name')
# names.head(1)
输出结果如下:
(17632, 1)
Count | |
---|---|
Name | |
Jacob | 242874 |
Emma | 214852 |
Michael | 214405 |
Ethan | 209277 |
Isabella | 204798 |
Step 8. How many different names exist in the dataset?
代码如下:
len(names)
输出结果如下:
17632
Step 9. What is the name with most occurrences?
代码如下:
# names['Count'].sum().argmax()
names.Count.idxmax() # idxmax()获取pandas中series最大值对应的索引
输出结果如下:
'Jacob'
Step 10. How many different names have the least occurrences?
代码如下:
len(names[names.Count == names.Count.min()])
输出结果如下:
2578
Step 11. What is the median name occurrence?
代码如下:
names[names.Count == names.Count.median()]
输出结果如下:
Count | |
---|---|
Name | |
Aishani | 49 |
Alara | 49 |
Alysse | 49 |
Ameir | 49 |
Anely | 49 |
Antonina | 49 |
Aveline | 49 |
Aziah | 49 |
Baily | 49 |
Caleah | 49 |
Carlota | 49 |
Cristine | 49 |
Dahlila | 49 |
Darvin | 49 |
Deante | 49 |
Deserae | 49 |
Devean | 49 |
Elizah | 49 |
Emmaly | 49 |
Emmanuela | 49 |
Envy | 49 |
Esli | 49 |
Fay | 49 |
Gurshaan |