Pandas学习笔记

Pandas按某列排序:pandas.DataFrame.sort_values
Pandas比单纯使用NumPy的好处在于:Pandas允许你每一列使用不同的数据类型

#读取文件并获取一些属性
import pandas as pd
food_info = pd.read_csv("food_info.csv")
first_rows = food_info.head(n)   #显示前n行,缺省默认为5(标题行不计入),.tail()用于尾部数据

column_names = food_info.columns   #显示列名
dimensions = food_info.shape    #获取DataFrame的shape

pandas.DataFrame.iloc:The iloc method allows us to retrieve rows and columns by position.

pandas.DataFrame.loc:index using labels instead of positions.

reviews.iloc[0:5,:]:输出0~4行,参数只能为数值类型,如reviews.iloc[:,1]

reviews.loc[0:5,:]:输出0~5行,参数可以为字符串类型或列表类型,如reviews.loc[:,"score_phrase"]

A DataFrame stores tabular data, but a Series stores a single column or row of data.

When accessing an individual row, pandas returns a Series object containing the column names and that row’s value for each column:
food_info.loc[0]:Series object representing the row at index 0 (非标题行,标题行不参与排序.).

If you’re interested in accessing multiple rows of the dataframe, you can pass in either a slice of row labels or a list of row labels and pandas will return a dataframe
Unlike slicing lists in Python, a slice of a dataframe using .loc[] will include both the start and the end row.
food_info.loc[3:6]:DataFrame containing the rows at index 3, 4, 5, and 6 returned.
food_info.loc[[2,5,10]]:DataFrame containing the rows at index 2, 5, and 10 returned.

When accessing a column in a dataframe, pandas returns a Series object containing the row label and each row’s value for that column:
ndb_col = food_info["NDB_No"]:Series object representing the “NDB_No” column.

To select multiple columns, pass in a list of strings representing the column names and pandas will return a dataframe containing only the values in those columns:
zinc_copper = food_info[["Zinc_(mg)", "Copper_(mg)"]]:returns a dataframe containing the “Zinc_(mg)” and “Copper_(mg)” columns.

#找出使用g来计量值的列
col_names = food_info.columns.tolist()
gram_columns = []
for colname in col_names:
    if colname.endswith("(g)"):
        gram_columns.append(colname)
gram_df = food_info[gram_columns]
plt.plot(women_degrees["Year"], women_degrees["Biology"], c="blue", label="Women")
plt.plot(women_degrees["Year"], 100-women_degrees["Biology"], c="green", label="Men") #DataFrame里的数字是float类型?
plt.legend(loc="upper right")
plt.title("Percentage of Biology Degrees Awarded By Gender")
plt.show()

PANDAS 数据合并与重塑

##### Pandas数据分析实例
import pandas as pd
import numpy as np
df = pd.read_csv('DataAnalyst.csv', encoding = 'gb2312')  # 还有read_excel、read_table,table可以读取txt,gb2312编码常见于windows,如果报错,可以尝试utf-8
df.shape
df.info()
len(df.positionId.unique()) #看一下是否有重复的数据,unique函数可以返回唯一值,该数据集中positionId是职位ID,值唯一
df_duplicates = df.drop_duplicates(subset = 'positionId', keep = 'first') # 清洗重复数据,subset参数选择以哪个列为去重基准,keep = 'first'保留第一个,'last'与之对应
#  加工salary薪资字段,既有小写k,也有大小K,还有「k以上」,要用到pandas中的apply。它可以针对DataFrame中的一行或者一行数据进行操作,允许使用自定义函数
def cut_word(word, method):
    position = word.find('-') # 查找「-」符号所在的位置
    if position != -1:
	    bottomSalary = word[:position-1] # 截取薪资范围开头至K之间的数字
	    topSalary = word[position+1:len(word)-1]
	else:  # 对于「k以上」这类脏数据,find函数会返回-1
		bottomSalary = word[:word.upper().find('K')] # upper函数将k都转换为K
		topSalary = bottomSalary
	if method == 'bottom':
	    return bottomSalary
	else:
        return topSalary
df_duplicates['bottomSalary'] = df_duplicates.salary.apply(cut_word, method = 'bottom') # 将word_cut函数应用在salary列的所有行,并将结果写入'bottomSalary'列
df_duplicates['topSalary'] = df_duplicates.salary.apply(cut_word, method='top') # apply中,参数是添加在函数后面,而不是里面的
df_duplicates.bottomSalary = df_duplicates.bottomSalary.astype('int')  # 如果转换成功,说明所有的薪资数字都成功截取了
df_duplicates.topSalary = df_duplicates.topSalary.astype('int')
df_duplicates['avgSalary'] = df_duplicates.apply(lambda x:(x.bottomSalary+x.topSalary)/2, axis=1) #这里的apply是针对DataFrame
# 切选出我们想要的内容进行后续分析
df_clean = df_duplicates[['city','companyShortName','companySize','education','positionName','positionLables','workYear','avgSalary']]
df_clean.city.value_counts() # value_counts是计数,统计所有非零元素的个数,以降序的方式输出Series
df_clean.describe() # describe函数能快速生成各类统计指标
#一般分类数据用value_counts,数值数据用describe,这是最常用的两个统计函数
  • 2
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值