【COURSERA】Data Analysis with Python课程笔记

最新推荐文章于 2023-10-28 13:10:16 发布

SUIIIIY

最新推荐文章于 2023-10-28 13:10:16 发布

阅读量3.1k

点赞数 2

本文链接：https://blog.csdn.net/SUIIIIY/article/details/103192838

版权

本文详细介绍了使用Python进行数据科学分析的过程，涵盖了从数据导入、数据预处理（处理缺失值、数据格式化、标准化、数据分箱、数值变量转换）、探索性数据分析（描述性统计、分组、相关性分析、方差分析）到模型开发（简单线性回归、多项式回归、回归绘图、管道）和模型评估（模型选择、过拟合与欠拟合、岭回归、网格搜索）。文章提供了一系列实用的Pandas和Scikit-Learn代码示例。

摘要由CSDN通过智能技术生成

week1 Importing Datasets

each line in the CSV file represent row
the feature of CSV file: properties are seperated from each other by commas

two important properties:
·format
·file path of dataset(the address)

//read csv
import pandas as pd
url=“the address”
df=pd.read_csv (url)

//save csv
path=“the address”
df.to_csv(path)
//csv can be changed to json\excel\sql

df prints the entire dataframe
df.head(n) shows the first n nows
df.tail(n) shows the botton n nows

//replace default header
**df.columns=**headers
hearders=[“a”,“b”,“c”]

before begin any analysis:
check :data types & data description
locate potential issue with the data

Pandas Type
·object （string）
·int64（int ，numeric）
·float64（float ，numeric）
·datetim64，timedelt[ns] （time data）

//check data type
df.dtypes

//return a statistical summary
df.describe()
//provide full summary satatics(count\mean\standard deviation\maximum\minimum````)
df.describe(include=“all”)

//provide a concise summary of df(shows top 30rows &bottom 30 rows)
d.info()

week2 Data Wrangling(争吵）

data preprocessing is often caled data cleaning or data wrangling.

1. identify and handle missing value ("?"，“N/A”，0 or just a blank cell)

· check with the data collection source
· drop the missing values

df.dropna()
df.dropna(subset=[“prices”],axis=0,inplace=true)
//也可写作
df=df.dropna(subset=[“price”],axis=0)
//axis=0 drops the entire row
//axis=1 drops the entire column

drop the variable
drop the data entry

·replace the missing value

df.replace(missing_value,new_value)

replace it with an average (of similar datapoints)

mean=df[“normalized-losses”].mean()
df[“normalized-losses”].replace(np.nan,mean)
//替代数表中的NAN

replace it by frequency(for categorical,like for fule type the most common like gasoline)
replace it based on other function(find the relative and predict)

·leave it as missing data

2. data formatting

bring data into a common standard of expression

//convert "mpg(mile per gallon)"to “L/100KM”
df[“mpg”]=235/df[“mpg”]
df.rename(columns={“mpg”:“L/100KM”},inplace=ture)

//identify data types:
df.dtypes()
//conver data types:
df.astype()
//convert data type to integer in column “price”
df[“price”]=df[“price”].astype(“int”)

3. data normalization(centering/scaling)

make variables have the same impact
（similar value range,similar intrinsic本质的 influence on analytical model）
在这里插入图片描述

//Min-max
df[“length”]=(df[“length”]-df[“length”].min())/(df[“length”].max()-df[“length”].min())

//Z-score
df[“length”]=(df[“length”]-df[“length”].mean())/df[“length”].std()
//std returns the standard deviation of the features in the data set