Pandas学习笔记

最新推荐文章于 2021-10-19 22:22:35 发布

Famir_Tse

最新推荐文章于 2021-10-19 22:22:35 发布

阅读量277

点赞数 2

分类专栏： Python 文章标签： Python Pandas

本文链接：https://blog.csdn.net/famirtse/article/details/80060166

版权

Python 专栏收录该内容

13 篇文章 3 订阅

订阅专栏

Pandas按某列排序：pandas.DataFrame.sort_values
Pandas比单纯使用NumPy的好处在于：Pandas允许你每一列使用不同的数据类型

#读取文件并获取一些属性
import pandas as pd
food_info = pd.read_csv("food_info.csv")
first_rows = food_info.head(n)   #显示前n行,缺省默认为5(标题行不计入)，.tail()用于尾部数据

column_names = food_info.columns   #显示列名
dimensions = food_info.shape    #获取DataFrame的shape

pandas.DataFrame.iloc：The iloc method allows us to retrieve rows and columns by position．

pandas.DataFrame.loc：index using labels instead of positions．

reviews.iloc[0:5,:]：输出0~4行，参数只能为数值类型，如reviews.iloc[:,1]

reviews.loc[0:5,:]：输出0~5行，参数可以为字符串类型或列表类型，如reviews.loc[:,"score_phrase"]

A DataFrame stores tabular data, but a Series stores a single column or row of data．

When accessing an individual row, pandas returns a Series object containing the column names and that row’s value for each column：
food_info.loc[0]：Series object representing the row at index 0　(非标题行，标题行不参与排序.)．

If you’re interested in accessing multiple rows of the dataframe, you can pass in either a slice of row labels or a list of row labels and pandas will return a dataframe：
Unlike slicing lists in Python, a slice of a dataframe using .loc[] will include both the start and the end row．
food_info.loc[3:6]：DataFrame containing the rows at index 3, 4, 5, and 6 returned．
food_info.loc[[2,5,10]]：DataFrame containing the rows at index 2, 5, and 10 returned．

When accessing a column in a dataframe, pandas returns a Series object containing the row label and each row’s value for that column：
ndb_col = food_info["NDB_No"]：Series object representing the “NDB_No” column．

To select multiple columns, pass in a list of strings representing the column names and pandas will return a dataframe containing only the values in those columns：
zinc_copper = food_info[["Zinc_(mg)", "Copper_(mg)"]]：returns a dataframe containing the “Zinc_(mg)” and “Copper_(mg)” columns．

#找出使用g来计量值的列
col_names = food_info.columns.tolist()
gram_columns = []
for colname in col_names:
    if colname.endswith("(g)"):
        gram_columns.append(colname)
gram_df = food_info[gram_columns]

plt.plot(women_degrees["Year"], women_degrees["Biology"], c="blue", label="Women")
plt.plot(women_degrees["Year"], 100-women_degrees["Biology"], c="green", label="Men") #DataFrame里的数字是float类型？
plt.legend(loc="upper right")
plt.title("Percentage of Biology Degrees Awarded By Gender")
plt.show()

PANDAS 数据合并与重塑

##### Pandas数据分析实例
import pandas as pd
import numpy as np
df = pd.read_csv('DataAnalyst.csv', encoding = 'gb2312')  # 还有read_excel、read_table，table可以读取txt，gb2312编码常见于windows，如果报错，可以尝试utf-8
df.shape
df.info()
len(df.positionId.unique()) #看一下是否有重复的数据，unique函数可以返回唯一值，该数据集中positionId是职位ID，值唯一
df_duplicates = df.drop_duplicates(subset = 'positionId', keep = 'first') # 清洗重复数据，subset参数选择以哪个列为去重基准，keep = 'first'保留第一个，'last'与之对应
#  加工salary薪资字段，既有小写k，也有大小K，还有「k以上」，要用到pandas中的apply。它可以针对DataFrame中的一行或者一行数据进行操作，允许使用自定义函数
def cut_word(word, method):
    position = word.find('-') # 查找「-」符号所在的位置
    if position != -1:
	    bottomSalary = word[:position-1] # 截取薪资范围开头至K之间的数字
	    topSalary = word[position+1:len(word)-1]
	else:  # 对于「k以上」这类脏数据，find函数会返回-1
		bottomSalary = word[:word.upper().find('K')] # upper函数将k都转换为K
		topSalary = bottomSalary
	if method == 'bottom':
	    return bottomSalary
	else:
        return topSalary
df_duplicates['bottomSalary'] = df_duplicates.salary.apply(cut_word, method = 'bottom') # 将word_cut函数应用在salary列的所有行,并将结果写入'bottomSalary'列
df_duplicates['topSalary'] = df_duplicates.salary.apply(cut_word, method='top') # apply中，参数是添加在函数后面，而不是里面的
df_duplicates.bottomSalary = df_duplicates.bottomSalary.astype('int')  # 如果转换成功，说明所有的薪资数字都成功截取了
df_duplicates.topSalary = df_duplicates.topSalary.astype('int')
df_duplicates['avgSalary'] = df_duplicates.apply(lambda x:(x.bottomSalary+x.topSalary)/2, axis=1) #这里的apply是针对DataFrame
# 切选出我们想要的内容进行后续分析
df_clean = df_duplicates[['city','companyShortName','companySize','education','positionName','positionLables','workYear','avgSalary']]
df_clean.city.value_counts() # value_counts是计数，统计所有非零元素的个数，以降序的方式输出Series
df_clean.describe() # describe函数能快速生成各类统计指标
#一般分类数据用value_counts，数值数据用describe，这是最常用的两个统计函数

Famir_Tse

关注

2
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Pandas学习笔记

Pandas比单纯使用NumPy的好处在于：Pandas允许你每一列使用不同的数据类型#读取文件并获取一些属性import pandas as pdfood_info = pd.read_csv(&amp;amp;amp;amp;amp;amp;amp;amp;quot;food_info.csv&amp;amp;amp;amp;amp;amp;amp;amp;quot;)first_rows = food_info.head(n) #显示前n行,缺省默认为5(标题行不计入
复制链接

扫一扫