Pandas按某列排序:pandas.DataFrame.sort_values
Pandas比单纯使用NumPy的好处在于:Pandas允许你每一列使用不同的数据类型
#读取文件并获取一些属性
import pandas as pd
food_info = pd.read_csv("food_info.csv")
first_rows = food_info.head(n) #显示前n行,缺省默认为5(标题行不计入),.tail()用于尾部数据
column_names = food_info.columns #显示列名
dimensions = food_info.shape #获取DataFrame的shape
pandas.DataFrame.iloc
:The iloc method allows us to retrieve rows and columns by position.
pandas.DataFrame.loc
:index using labels instead of positions.
reviews.iloc[0:5,:]
:输出0~4行,参数只能为数值类型,如reviews.iloc[:,1]
reviews.loc[0:5,:]
:输出0~5行,参数可以为字符串类型或列表类型,如reviews.loc[:,"score_phrase"]
A DataFrame stores tabular data, but a Series stores a single column or row of data.
When accessing an individual row, pandas returns a Series object containing the column names and that row’s value for each column:
food_info.loc[0]
:Series object representing the row at index 0 (非标题行,标题行不参与排序.).
If you’re interested in accessing multiple rows of the dataframe, you can pass in either a slice of row labels or a list of row labels and pandas will return a dataframe:
Unlike slicing lists in Python, a slice of a dataframe using .loc[]
will include both the start and the end row.
food_info.loc[3:6]
:DataFrame containing the rows at index 3, 4, 5, and 6 returned.
food_info.loc[[2,5,10]]
:DataFrame containing the rows at index 2, 5, and 10 returned.
When accessing a column in a dataframe, pandas returns a Series object containing the row label and each row’s value for that column:
ndb_col = food_info["NDB_No"]
:Series object representing the “NDB_No” column.
To select multiple columns, pass in a list of strings representing the column names and pandas will return a dataframe containing only the values in those columns:
zinc_copper = food_info[["Zinc_(mg)", "Copper_(mg)"]]
:returns a dataframe containing the “Zinc_(mg)” and “Copper_(mg)” columns.
#找出使用g来计量值的列
col_names = food_info.columns.tolist()
gram_columns = []
for colname in col_names:
if colname.endswith("(g)"):
gram_columns.append(colname)
gram_df = food_info[gram_columns]
plt.plot(women_degrees["Year"], women_degrees["Biology"], c="blue", label="Women")
plt.plot(women_degrees["Year"], 100-women_degrees["Biology"], c="green", label="Men") #DataFrame里的数字是float类型?
plt.legend(loc="upper right")
plt.title("Percentage of Biology Degrees Awarded By Gender")
plt.show()
##### Pandas数据分析实例
import pandas as pd
import numpy as np
df = pd.read_csv('DataAnalyst.csv', encoding = 'gb2312') # 还有read_excel、read_table,table可以读取txt,gb2312编码常见于windows,如果报错,可以尝试utf-8
df.shape
df.info()
len(df.positionId.unique()) #看一下是否有重复的数据,unique函数可以返回唯一值,该数据集中positionId是职位ID,值唯一
df_duplicates = df.drop_duplicates(subset = 'positionId', keep = 'first') # 清洗重复数据,subset参数选择以哪个列为去重基准,keep = 'first'保留第一个,'last'与之对应
# 加工salary薪资字段,既有小写k,也有大小K,还有「k以上」,要用到pandas中的apply。它可以针对DataFrame中的一行或者一行数据进行操作,允许使用自定义函数
def cut_word(word, method):
position = word.find('-') # 查找「-」符号所在的位置
if position != -1:
bottomSalary = word[:position-1] # 截取薪资范围开头至K之间的数字
topSalary = word[position+1:len(word)-1]
else: # 对于「k以上」这类脏数据,find函数会返回-1
bottomSalary = word[:word.upper().find('K')] # upper函数将k都转换为K
topSalary = bottomSalary
if method == 'bottom':
return bottomSalary
else:
return topSalary
df_duplicates['bottomSalary'] = df_duplicates.salary.apply(cut_word, method = 'bottom') # 将word_cut函数应用在salary列的所有行,并将结果写入'bottomSalary'列
df_duplicates['topSalary'] = df_duplicates.salary.apply(cut_word, method='top') # apply中,参数是添加在函数后面,而不是里面的
df_duplicates.bottomSalary = df_duplicates.bottomSalary.astype('int') # 如果转换成功,说明所有的薪资数字都成功截取了
df_duplicates.topSalary = df_duplicates.topSalary.astype('int')
df_duplicates['avgSalary'] = df_duplicates.apply(lambda x:(x.bottomSalary+x.topSalary)/2, axis=1) #这里的apply是针对DataFrame
# 切选出我们想要的内容进行后续分析
df_clean = df_duplicates[['city','companyShortName','companySize','education','positionName','positionLables','workYear','avgSalary']]
df_clean.city.value_counts() # value_counts是计数,统计所有非零元素的个数,以降序的方式输出Series
df_clean.describe() # describe函数能快速生成各类统计指标
#一般分类数据用value_counts,数值数据用describe,这是最常用的两个统计函数