文章目录
Instance
If money makes people happy, so you download the Better life index from the OECD’s website as well as stats about GDP per capita from the IMF’s website. Then you join the tables and sort by GDP per capita.
Reference: https://www.bilibili.com/video/BV1iJ411k7Gg
1. Preparation
Code:
assert : validate the command line, if yes, continue
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)
# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"
path.join( load the datasets’ path)
The last ‘’’’ : the symbol of \
import os
datapath = os.path.join("datasets", "lifesat", "")
%matplotlib inline: Show the figures directly within Jupyter
# To plot pretty figures directly within Jupyter
%matplotlib inline
import matplotlib as mpl
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)
2. Load And Preparation Life satisfaction and GDP per capita data
delimiter=’\t’ : delimiter(分隔符) \t(制表符)
na_values=‘n/a’ Change nan or NaN data to N/A
oecd_bli = pd.read_csv(datapath+'oecd_bli_2015.csv', thousands=',')
gdp_per_capita = pd.read_csv(datapath+'gdp_per_capita.csv', thousands=',',delimiter='\t', encoding='latin1',na_values='n/a')
3. Review Dataset Architecture
4. Data Processing
Need to filter Inequality == TOT
Use pivot to reset index and columns
Replace column name and set index
inplace: Modify on the original data
Merge two file into one
left_index: 如果为True,则使用左侧DataFrame中的索引(行标签)作为其连接键。 对于具有MultiIndex(分层)的DataFrame,级别数必须与右侧DataFrame中的连接键数相匹配。
right_index: 与left_index功能相似
Sort values as GDP per capita
Default is ascending order
Merge the final data
5.Visualize the data
kind, refer to https://blog.csdn.net/h_hxx/article/details/90635650
Save data to a file
country_stats.to_csv('country_stats.csv')
6. Outliers Processing
Delete outliers
iloc → 基于行、列索引序号进行查询
Visualize the outlier data
7. Model Processing
Show valuable data
Model conjecture(猜想)
Train the Model
Gain the perfect intercept and coefficient of the linear
Input Cyprus’s GDP and try to predict its satisfaction
Visualize all data