【COURSERA】Data Analysis with Python课程笔记

week1 Importing Datasets

each line in the CSV file represent row
the feature of CSV file: properties are seperated from each other by commas

two important properties:
·format
·file path of dataset(the address)

//read csv
import pandas as pd
url=“the address
df=pd.read_csv (url)

//save csv
path=the address
df.to_csv(path)
//csv can be changed to json\excel\sql

df prints the entire dataframe
df.head(n) shows the first n nows
df.tail(n) shows the botton n nows

//replace default header
**df.columns=**headers
hearders=[“a”,“b”,“c”]

before begin any analysis:
check :data types & data description
locate potential issue with the data

Pandas Type
·object (string)
·int64(int ,numeric)
·float64(float ,numeric)
·datetim64,timedelt[ns] (time data)

//check data type
df.dtypes

//return a statistical summary
df.describe()
//provide full summary satatics(count\mean\standard deviation\maximum\minimum````)
df.describe(include=“all”)

//provide a concise summary of df(shows top 30rows &bottom 30 rows)
d.info()

week2 Data Wrangling(争吵)

data preprocessing is often caled data cleaning or data wrangling.

1. identify and handle missing value ("?",“N/A”,0 or just a blank cell)

· check with the data collection source
· drop the missing values

df.dropna()
df.dropna(subset=[“prices”],axis=0,inplace=true)
//也可写作
df=df.dropna(subset=[“price”],axis=0)
//axis=0 drops the entire row
//axis=1 drops the entire column

drop the variable
drop the data entry

·replace the missing value

df.replace(missing_value,new_value)

replace it with an average (of similar datapoints)

mean=df[“normalized-losses”].mean()
df[“normalized-losses”].replace(np.nan,mean)
//替代数表中的NAN

replace it by frequency(for categorical,like for fule type the most common like gasoline)
replace it based on other function(find the relative and predict)

·leave it as missing data

2. data formatting

bring data into a common standard of expression

//convert "mpg(mile per gallon)"to “L/100KM”
df[“mpg”]=235/df[“mpg”]
df.rename(columns={“mpg”:“L/100KM”},inplace=ture)

//identify data types:
df.dtypes()
//conver data types:
df.astype()
//convert data type to integer in column “price”
df[“price”]=df[“price”].astype(“int”)

3. data normalization(centering/scaling)

make variables have the same impact
(similar value range,similar intrinsic本质的 influence on analytical model)
在这里插入图片描述

//Min-max
df[“length”]=(df[“length”]-df[“length”].min())/(df[“length”].max()-df[“length”].min())

//Z-score
df[“length”]=(df[“length”]-df[“length”].mean())/df[“length”].std()
//std returns the standard deviation of the features in the data set

4. data binning

grouping of values into “bins”
converts numeric into categorical variables
eg:

<
prices bins
5 low
6 low
12 mid
13 mid
25 high
29 high
  • 2
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值