【COURSERA】Data Analysis with Python课程笔记

本文详细介绍了使用Python进行数据科学分析的过程,涵盖了从数据导入、数据预处理(处理缺失值、数据格式化、标准化、数据分箱、数值变量转换)、探索性数据分析(描述性统计、分组、相关性分析、方差分析)到模型开发(简单线性回归、多项式回归、回归绘图、管道)和模型评估(模型选择、过拟合与欠拟合、岭回归、网格搜索)。文章提供了一系列实用的Pandas和Scikit-Learn代码示例。
摘要由CSDN通过智能技术生成

week1 Importing Datasets

each line in the CSV file represent row
the feature of CSV file: properties are seperated from each other by commas

two important properties:
·format
·file path of dataset(the address)

//read csv
import pandas as pd
url=“the address
df=pd.read_csv (url)

//save csv
path=the address
df.to_csv(path)
//csv can be changed to json\excel\sql

df prints the entire dataframe
df.head(n) shows the first n nows
df.tail(n) shows the botton n nows

//replace default header
**df.columns=**headers
hearders=[“a”,“b”,“c”]

before begin any analysis:
check :data types & data description
locate potential issue with the data

Pandas Type
·object (string)
·int64(int ,numeric)
·float64(float ,numeric)
·datetim64,timedelt[ns] (time data)

//check data type
df.dtypes

//return a statistical summary
df.describe()
//provide full summary satatics(count\mean\standard deviation\maximum\minimum````)
df.describe(include=“all”)

//provide a concise summary of df(shows top 30rows &bottom 30 rows)
d.info()

week2 Data Wrangling(争吵)

data preprocessing is often caled data cleaning or data wrangling.

1. identify and handle missing value ("?",“N/A”,0 or just a blank cell)

· check with the data collection source
· drop the missing values

df.dropna()
df.dropna(subset=[“prices”],axis=0,inplace=true)
//也可写作
df=df.dropna(subset=[“price”],axis=0)
//axis=0 drops the entire row
//axis=1 drops the entire column

drop the variable
drop the data entry

·replace the missing value

df.replace(missing_value,new_value)

replace it with an average (of similar datapoints)

mean=df[“normalized-losses”].mean()
df[“normalized-losses”].replace(np.nan,mean)
//替代数表中的NAN

replace it by frequency(for categorical,like for fule type the most common like gasoline)
replace it based on other function(find the relative and predict)

·leave it as missing data

2. data formatting

bring data into a common standard of expression

//convert "mpg(mile per gallon)"to “L/100KM”
df[“mpg”]=235/df[“mpg”]
df.rename(columns={“mpg”:“L/100KM”},inplace=ture)

//identify data types:
df.dtypes()
//conver data types:
df.astype()
//convert data type to integer in column “price”
df[“price”]=df[“price”].astype(“int”)

3. data normalization(centering/scaling)

make variables have the same impact
(similar value range,similar intrinsic本质的 influence on analytical model)
在这里插入图片描述

//Min-max
df[“length”]=(df[“length”]-df[“length”].min())/(df[“length”].max()-df[“length”].min())

//Z-score
df[“length”]=(df[“length”]-df[“length”].mean())/df[“length”].std()
//std returns the standard deviation of the features in the data set

4. data binning

grouping of values into “bins”
converts numeric into categorical variables
eg:

<
prices bins
5 low
6 low
12 mid
13 mid
25 high
29 high
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值