pandas学习

最新推荐文章于 2024-07-22 21:10:20 发布

only粉丝

最新推荐文章于 2024-07-22 21:10:20 发布

阅读量939

点赞数

分类专栏： python工具文章标签： python

本文链接：https://blog.csdn.net/weixin_41102519/article/details/121354186

版权

python工具专栏收录该内容

8 篇文章 0 订阅

订阅专栏

前言

最近要搞些深度学习和nlp，要处理不少数据，早就听说pandas和excel差不多，今天学习一下并做一些笔记，学习资料参考彭彭的python基础链接: link.

1. pandas基本介绍

pandas 内部主要有两种数据存储的格式

Series

DataFrame

Series 就像excel列表里的一列，可以用一个python list 来初始化
在这里插入图片描述
DataFrame 则像excel的整个二维数据，可以用一个字典来初始化，字典的key就是第一行的名称

在这里插入图片描述
举个例子

import pandas as pd
data = pd.Series([20,10,15]) #Series 用列表初始化
print(data)
# 0    20
# 1    10
# 2    15
# dtype: int64
print(data.max()) #20
print(data.median()) #15.0
print(data.mean())   #15.0
data = data*2
print(data)
# 0    40
# 1    20
# 2    30
# dtype: int64
print(data == 20)
# 0    False
# 1     True
# 2    False
# dtype: bool
#=====================================================================================
data = pd.DataFrame({
    'name':['Amy','Bob','christ'],
    'score':[100, 50, 87]
})
print(data)
#      name  score
# 0     Amy    100
# 1     Bob     50
# 2  christ     87
#取得特定一栏， 竖着看
print(data['name'])
# 0       Amy
# 1       Bob
# 2    christ
# Name: name, dtype: object
#取得特定一行， 横着看
print(data.iloc[0])
# name     Amy
# score    100
# Name: 0, dtype: object

2. Series详细介绍

资料索引

上面例子中左侧的0，1，2就是pandas的索引，这里是pandas自建的索引，我们还可以自定义索引
举个例子

data = pd.Series([5,4,-2,3,7])
print(data)
# 0    5
# 1    4
# 2   -2
# 3    3
# 4    7
# dtype: int64
data = pd.Series([5,4,-2,3,7], index = ['a','b','c','d','e'])
print(data)
# a    5
# b    4
# c   -2
# d    3
# e    7
# dtype: int64

资料观察

print('资料形态', data.dtype)
print('资料数量', data.size)
print('资料索引', data.index)
# 资料形态 int64
# 资料数量 5
# 资料索引 Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

取得资料

可以根据顺序或者索引来取得资料

print(data[0], data['a'])#5 5

数字运算

print('最大值', data.max())
print('总和', data.sum())
print('标准差', data.std())
print('中位数', data.median())
print('最大三个数\n', data.nlargest(3))
print('最小两个数\n', data.nsmallest(2))
# 最大值 7
# 总和 17
# 标准差 3.361547262794322
# 中位数 4.0
# 最大三个数
#  e    7
# a    5
# b    4
# dtype: int64
# 最小两个数
#  c   -2
# d    3
# dtype: int64

字符串运算

data = pd.Series(['您好','Python','Pandas'])
print(data.str.lower()) #全部变小写
# 0        您好
# 1    python
# 2    pandas
# dtype: object
print(data.str.len())   #算出每个字符串长度
# 0    2
# 1    6
# 2    6
# dtype: int64
print(data.str.cat(sep=',')) #把字符串串出来
# 您好,Python,Pandas
print(data.str.contains('P'))#检测字符串是否含有元素
# 0    False
# 1     True
# 2     True
# dtype: bool
print(data.str.replace('您好', 'hello')) #替换元素
# 0     hello
# 1    Python
# 2    Pandas
# dtype: object

3. DataFrame详细介绍

资料索引

data = pd.DataFrame({
    'name': ['Amy', 'Bob', 'Charles'],
    'salary':[30000, 50000, 100000]
}, index = ['a','b','c'])
print(data)
#       name  salary
# a      Amy   30000
# b      Bob   50000
# c  Charles  100000

资料观察

print('资料数量', data.size)
print('资料形状', data.shape)
print('资料索引', data.index)
# 资料数量 6
# 资料形状 (3, 2)
# 资料索引 Index(['a', 'b', 'c'], dtype='object')

取得资料

print('取得第二行', data.iloc[1], sep = '\n')
print('取得第c行', data.loc['c'], sep = '\n')
# 取得第二列
# name        Bob
# salary    50000
# Name: b, dtype: object
# 取得第c列
# name      Charles
# salary     100000
# Name: c, dtype: object
print('取得name列', data['name'], sep= '\n')
# 取得name列
# a        Amy
# b        Bob
# c    Charles
# Name: name, dtype: object
names = data['name']#取得的一列是Series数据结构
print('把名称转成大写', names.str.upper(), sep='\n')
# 把名称转成大写
# a        AMY
# b        BOB
# c    CHARLES
# Name: name, dtype: object
salaries = data['salary']
print('计算薪水平均值',salaries.mean())
# 计算薪水平均值 60000.0

建立新的一列

直接用列表就可以建立新的一列，或者复制已有的列，或者是已有列计算的结果

data['revenue'] = [500000, 400000, 300000]
data['years'] = [1, 3, 6]
data['cp'] = data['revenue']/ data['salary']
print(data)
#       name  salary  revenue  years         cp
# a      Amy   30000   500000      1  16.666667
# b      Bob   50000   400000      3   8.000000
# c  Charles  100000   300000      6   3.000000

4. 资料筛选

Series筛选

可以通过booling 的列表进行筛选，或者通过条件生成booling列表进而进行筛选

注意筛选之后的index并没有变化，还是保留原有的index

data = pd.Series([30,15,20])
condition = data > 18
print(condition)
# 0     True
# 1    False
# 2     True
# dtype: bool
filteredData = data[condition]
print(filteredData) 
# 0    30
# 2    20
# dtype: int64
#=================================================================
data = pd.Series(['您好', 'Python', 'Pandas'])
condition = data.str.contains('P')
print(condition)
# 0    False
# 1     True
# 2     True
# dtype: bool
filteredData = data[condition]
print(filteredData) 
# 1    Python
# 2    Pandas
# dtype: object

DataFrame筛选

和Series类似，可以用某一列通过条件生成condition列表，用这个列表广播到所有列，筛选出这个列表里为true 的index所对应的行

data = pd.DataFrame({
    'name': ['Amy', 'Bob', 'Charles'],
    'salary':[30000, 50000, 100000]
}, index = ['a','b','c'])
condition = data['salary'] == data['salary'].max()
print(condition)
# a    False
# b    False
# c     True
Name: salary, dtype: bool
filteredData = data[condition]
print(filteredData)
#       name  salary
# c  Charles  100000

5. Google Play Store 資料集分析

收集资料

Google Play Store 资料集的链接
下载之后是这个样子
在这里插入图片描述

读取资料

在这里插入图片描述

print('资料数量', data.shape)
print('资料列', data.columns)
# 资料数量 (10841, 13)
# 资料列 Index(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type',
#        'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver',
#        'Android Ver'],
#       dtype='object')

评分的各种统计数据

print('平均数', data['Rating'].mean())
print('中位数', data['Rating'].median())
print('取得前100的平均', data['Rating'].nlargest(100).mean())
# 平均数 4.193338315362448
# 中位数 4.3
# 取得前100的平均 5.14

发现前100 的rating竟然大于5，说明一定有错误的数据，找出错误的数据，并重新计算

print('特殊一行的数据是', data[data['Rating'] > 5], sep = '\n')
# App Category  Rating Reviews  \
# 10472  Life Made WI-Fi Touchscreen Photo Frame      1.9    19.0    3.0M   

#          Size Installs Type     Price Content Rating             Genres  \
# 10472  1,000+     Free    0  Everyone            NaN  February 11, 2018   

#       Last Updated Current Ver Android Ver  
# 10472       1.0.19  4.0 and up         NaN  
#可见Rating是19.0 远远大于5
condition = data['Rating'] <= 5
data = data[condition]
print('平均数', data['Rating'].mean())
print('中位数', data['Rating'].median())
print('取得前100的平均', data['Rating'].nlargest(100).mean())
# 平均数 4.191757420456978
# 中位数 4.3
# 取得前100的平均 5.0

安装数量的各种统计数据

首先在Installs目录下全是字符串，可以用pandas把字符串转成数字。
注意在转换的过程中有特殊字符和字母的话会不成功，用replace转换成空字符

import pandas as pd
#读取资料
data = pd.read_csv('googleplaystore.csv') #把csv读取成DataFrame
data
print(data['Installs'])
# 0            10,000+
# 1           500,000+
# 2         5,000,000+
# 3        50,000,000+
# 4           100,000+
#             ...     
# 10834           500+
# 10836         5,000+
# 10837           100+
# 10839         1,000+
# 10840    10,000,000+
# Name: Installs, Length: 9366, dtype: object
data['Installs'] = pd.to_numeric(data['Installs'].str.replace('[,+]', '', regex = True).replace('Free', ''))
print(data['Installs'])
# 0           10000.0
# 1          500000.0
# 2         5000000.0
# 3        50000000.0
# 4          100000.0
#             ...    
# 10836        5000.0
# 10837         100.0
# 10838        1000.0
# 10839        1000.0
# 10840    10000000.0
# Name: Installs, Length: 10841, dtype: float64

print('平均数', data['Installs'].mean())
condition = data['Installs'] > 100000
print('安装数量大于100000的个数', data[condition].shape[0])
#平均数 15464338.882564576
#安装数量大于100000的个数 4950

按关键字统计

keyword = input('请输入关键字')
condition = data['App'].str.contains(keyword, case = False)
print('包含关键字的app的数量', data[condition].shape[0])
# 请输入关键字game
# 包含关键字的app的数量 257