week1 Importing Datasets
each line in the CSV file represent row
the feature of CSV file: properties are seperated from each other by commas
two important properties:
·format
·file path of dataset(the address)
//read csv
import pandas as pd
url=“the address”
df=pd.read_csv (url)
//save csv
path=“the address”
df.to_csv(path)
//csv can be changed to json\excel\sql
df prints the entire dataframe
df.head(n) shows the first n nows
df.tail(n) shows the botton n nows
//replace default header
**df.columns=**headers
hearders=[“a”,“b”,“c”]
before begin any analysis:
check :data types & data description
locate potential issue with the data
Pandas Type
·object (string)
·int64(int ,numeric)
·float64(float ,numeric)
·datetim64,timedelt[ns] (time data)
//check data type
df.dtypes
//return a statistical summary
df.describe()
//provide full summary satatics(count\mean\standard deviation\maximum\minimum````)
df.describe(include=“all”)
//provide a concise summary of df(shows top 30rows &bottom 30 rows)
d.info()
week2 Data Wrangling(争吵)
data preprocessing is often caled data cleaning or data wrangling.
1. identify and handle missing value ("?",“N/A”,0 or just a blank cell)
· check with the data collection source
· drop the missing values
df.dropna()
df.dropna(subset=[“prices”],axis=0,inplace=true)
//也可写作
df=df.dropna(subset=[“price”],axis=0)
//axis=0 drops the entire row
//axis=1 drops the entire column
drop the variable
drop the data entry
·replace the missing value
df.replace(missing_value,new_value)
replace it with an average (of similar datapoints)
mean=df[“normalized-losses”].mean()
df[“normalized-losses”].replace(np.nan,mean)
//替代数表中的NAN
replace it by frequency(for categorical,like for fule type the most common like gasoline)
replace it based on other function(find the relative and predict)
·leave it as missing data
2. data formatting
bring data into a common standard of expression
//convert "mpg(mile per gallon)"to “L/100KM”
df[“mpg”]=235/df[“mpg”]
df.rename(columns={“mpg”:“L/100KM”},inplace=ture)
//identify data types:
df.dtypes()
//conver data types:
df.astype()
//convert data type to integer in column “price”
df[“price”]=df[“price”].astype(“int”)
3. data normalization(centering/scaling)
make variables have the same impact
(similar value range,similar intrinsic本质的 influence on analytical model)
//Min-max
df[“length”]=(df[“length”]-df[“length”].min())/(df[“length”].max()-df[“length”].min())
//Z-score
df[“length”]=(df[“length”]-df[“length”].mean())/df[“length”].std()
//std returns the standard deviation of the features in the data set
4. data binning
grouping of values into “bins”
converts numeric into categorical variables
eg:
prices | bins |
---|---|
5 | low |
6 | low |
12 | mid |
13 | mid |
25 | high |
29 | high |