Python 学习笔记
pandas library
Day #1
df = pd.read_csv(’ ')
df.head(n)
df.tail(n)
df.nlargest(n,‘name’)
df.nsmallest(n,‘name’)
df [Pd[‘bla bla’]==‘labalaba’]
Day #2
df.shape
df.columns.tolist( )
df[‘name’].unique( )
df = df.replace(‘Orig’ , ‘New’)
df.isnull( ) // Checks if each value in the dato frame is missing
Trick: print(df.isnull().sum()) // return the total amount of missing values for each feature(column)
NaN stands for ‘not a number’
creating columns:
df[‘new column name’] = fomula
e.g. df[’% Female’] = df[‘Female’] / df[‘Total’]
df.dtypes
df.select_dtypes([‘object’])
measures of spread or distribution
df[‘column name’].mean( )
df[‘column name’].median( )
df[‘column name’].mode( )
df.groupby(by=‘column name’).agg(‘mean’)
// agg stands for ‘aggregate’. group the data by the ‘column name’
df[‘column name’].max( ) — df[‘column name’].min( )
df[‘column name’].std( ) // standard deviation
df[‘column name’].var( ) // variance
df.groupby(by=‘column name’).agg(‘std’)
correlation coefficient
df.corr( )
df.corr( )[‘x’] [‘y’]
Day #3
Bar Chart
df_top10 = df.nlargest(10,‘Total’)
df_top10.plot.bar(x=‘Major Name’, y=‘Total’)
Histograms
df.hist(column = ‘% Female’)
Boxplot
df.boxplot(column = ‘% Female’, vert=False)
Scatterplots
df.plot.scatter(x=‘Male’, y=‘Female’)
Simulation
import random
random.ranint(1,10)
for i in range(1,100):
print(random.randint(1,10))
Tricks
exercise 1
通过添加一个新的布尔值列(当且仅当以下两项均为 True 时为 True)修改 cities 表格:
- 城市以圣人命名。
- 城市面积大于 50 平方英里。
cities['wide and saint'] = (cities['Area square miles'] > 50) & cities['City name'].apply(lambda name: name.startswith('San'))
cities