pandas基础02

最新推荐文章于 2024-07-18 06:35:50 发布

Tracey_Chen

最新推荐文章于 2024-07-18 06:35:50 发布

阅读量179

点赞数

分类专栏： Tracey的python编程

本文链接：https://blog.csdn.net/qq_41048228/article/details/111567424

版权

Tracey的python编程专栏收录该内容

19 篇文章 0 订阅

订阅专栏

一、文件读取与写入

1、文件读取
pandas的文件读数功能较多，可以读多种格式的数据：
在这里插入图片描述

一般情况下会读取csv,excel,txt文件。

读取本地文件：

file=pandas.read_csv('/Users/l/Desktop/sample.csv')

print(file.head())

在这里插入图片描述
读取txt文件：

df=pd.read_table('/Users/l/Desktop/result3.txt',sep='\t',names=['ap_categories','ssid','bssid','uuid_count','time'])#可以添加间隔符sep，列名

读取Excel文件：

file = pandas.read_excel('/Users/l/Desktop/user_ap_bs_poi/all.xlsx',nrows=200)#读取200行数据

读取Excel的速度会比CSV文件慢一些，可能与Excel文件有多个sheet有关.
读文件时的参数：
header=None 表示第一行不作为列名
index_col 表示把某一列或几列作为索引
usecols 表示读取列的集合，默认读取所有的列
parse_dates 表示需要转化为时间的列
nrows 表示读取的数据行数

文件写入：
我们处理完的数据需要保存为新的数据，需要写到笨地文件夹中：

pivot_file.to_csv('/Users/l/Desktop/数据验证/pivot_table.csv')

将数据中的行索引去除：

pivot_file.to_csv('/Users/l/Desktop/数据验证/pivot_table.csv',index=False)

保存为txt:

pivot_file.to_csv('/Users/l/Desktop/数据验证/pivot_table.txt',sep='\t',index=False)#用csv的方式，定义分隔符

转换为markdown或者latex：
先安装tabulate包：

import pandas
data=pandas.read_csv('/Users/l/Desktop/test.csv')

print(data.to_markdown())

在这里插入图片描述

print(data.to_latex())

在这里插入图片描述
2、基本数据结构
Series和Dataframe

series:
在这里插入图片描述

1/data
2/index
3/dtyp
4/name

a=pandas.Series(data=[1,2,'q',[4,5],{'name':'chen'}],index=[1,2,3,4,5],
                   dtype='object',name='chen')
print(a)


1                   1
2                   2
3                   q
4              [4, 5]
5    {'name': 'chen'}
Name: chen, dtype: object

可以通过.的方式获取series的属性：

print(a.values)
print(a.index)
print(a.dtype)
print(a.name)


[1 2 'q' list([4, 5]) {'name': 'chen'}]
Int64Index([1, 2, 3, 4, 5], dtype='int64')
object
chen

获取series长度：

print(a.shape)
(5,)

通过索引取值：
a=pandas.Series(data=[1,2,‘q’,[4,5],{‘name’:‘chen’}],index=[1,2,3,4,5],

                   dtype='object',name='chen')
print(a[3])

q

Dataframe
相当在series上增加维度,参数：
在这里插入图片描述

创建Dataframe:

data=np.arange(12).reshape(3,4)
print(data)
a=pandas.DataFrame(data=data,
                   index=['row_%d'%i for i in range(3)],
                   columns=['a','b','c','d'])
print(a)

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]

       a  b   c   d
row_0  0  1   2   3
row_1  4  5   6   7
row_2  8  9  10  11

用列名和数据的映射构造：

b=pandas.DataFrame(data={'a':[0,1,2],'b':[3,4,5],'c':[6,7,8]},
                   index=[0,1,2])
print(b)

   a  b  c
0  0  3  6
1  1  4  7
2  2  5  8

可以在dataframe中取想要的列：

b=pandas.DataFrame(data={'a':[0,1,2],'b':[3,4,5],'c':[6,7,8]},
                   index=[0,1,2])

print(b[['a','b']])


   a  b
0  0  3
1  1  4
2  2  5

取一列数据，结构为series,取多列数据，结构为dataframe.

取dataframe的属性;

b=pandas.DataFrame(data={'a':[0,1,2],'b':[3,4,5],'c':[6,7,8]},
                   index=[0,1,2])

print(b.values)
print(b.index)
print(b.columns)
print(b.dtypes)


[[0 3 6]
 [1 4 7]
 [2 5 8]]
Int64Index([0, 1, 2], dtype='int64')
Index(['a', 'b', 'c'], dtype='object')
a    int64
b    int64
c    int64
dtype: object

查看dataframe的形状：

b=pandas.DataFrame(data={'a':[0,1,2],'b':[3,4,5],'c':[6,7,8]},
                   index=[0,1,2])
print(b.shape)

(3, 3)

对dataframe进行转置：

b=pandas.DataFrame(data={'a':[0,1,2],'b':[3,4,5],'c':[6,7,8]},
                   index=[0,1,2])
                   
print(b)
print(b.T)

3、常用基本函数

汇总：head,tail

查看数据前3行和后4行：

import pandas
data=pandas.read_csv('/Users/liubingfeng/Desktop/test.csv')
print(data.head(3))
print(data.tail(4))

         long        lat start_time_format  end_time_format
0  116.864643  38.310846   2020/11/23 6:35  2020/11/23 7:25
1  116.864762  38.311371   2020/11/23 7:08  2020/11/23 7:25
2  116.831244  38.309362   2020/11/23 8:23  2020/11/23 8:40
          long        lat start_time_format   end_time_format
10  116.831652  38.308645   2020/11/23 8:42   2020/11/23 9:43
11  116.831674  38.308702   2020/11/23 9:29   2020/11/23 9:45
12  116.831652  38.308645   2020/11/23 9:29   2020/11/23 9:43
13  116.864731  38.311345  2020/11/23 10:11  2020/11/23 10:36

info,discribe查看数据概况和一些基本统计：

print(data.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   long               14 non-null     float64
 1   lat                14 non-null     float64
 2   start_time_format  14 non-null     object 
 3   end_time_format    14 non-null     object 
dtypes: float64(2), object(2)
memory usage: 576.0+ bytes
None

可以看到各列数据的缺失情况和类型。

print(data.describe())
             long        lat
count   14.000000  14.000000
mean   116.838654  38.309448
std      0.014123   0.001027
min    116.831244  38.308645
25%    116.831534  38.308649
50%    116.831652  38.309032
75%    116.831696  38.309702
max    116.864762  38.311371

可以查看各列数据的描述性统计分析结果。

特征统计函数：
sum，mean，median，var，std,max,min等

d1=data[['long','lat']]

print(d1.max())

long    116.864762
lat      38.311371
dtype: float64

print(d1.mean())

long    116.838654
lat      38.309448
dtype: float64

print(d1.min())

long    116.831244
lat      38.308645
dtype: float64

可以结合之前的numpy对矩阵的操作，axis=0，1进行行和列的处理。

quantile函数：

d1=data[['long','lat']]
print(d1.quantile(0.25))#1/4分位数
print(d1.quantile(0.8))#4/5分位数

long    116.831534
lat      38.308649
Name: 0.25, dtype: float64


long    116.844877
lat      38.310206
Name: 0.8, dtype: float64

print(d1.count())#非缺失值个数

long    14
lat     14
dtype: int64

print(d1.idxmax())#最大值对应的索引

long    1
lat     1
dtype: int64

print(d1.mean(axis=1).head())#对每一行计算均值

0    77.587744
1    77.588067
2    77.570303
3    77.570390
4    77.570390
dtype: float64

唯一值函数：
unique：唯一值列表
nunique:唯一值的个数

print(data['long'].unique())#计算long这一列的唯一值列表
print(data['lat'].nunique())#计算lat这一列唯一值的个数


[116.864643 116.864762 116.831244 116.831309 116.831495 116.8317
 116.831652 116.831686 116.831674 116.864731]

10

print(data['long'].value_counts())

116.831652    4
116.831309    2
116.831700    1
116.831495    1
116.864643    1
116.831244    1
116.864762    1
116.831686    1
116.864731    1
116.831674    1
Name: long, dtype: int64

数据去重：

data=data.drop_duplicates(['long','lat'],keep='first')#first表示保留第一行
data=data.drop_duplicates(['long'],keep='last')#flast表示保留最后一行
data=data.drop_duplicates(['long','lat'],keep=False)#False表示删除有重复的行

可以选择根据哪几列去重。

duplicated 和 drop_duplicates 的功能类似，但duplicated返回了是否为唯一值的布尔列表，keep 参数者一致。其返回的序列，把重复元素设为 True ，否则为 False 。drop_duplicates 等价于把 duplicated 为True 的对应行剔除。

print(data.duplicated(['long','lat']).head())

0    False
1    False
2    False
3    False
4     True
dtype: bool

替换函数：

replace:

print(data['end_time_format'].replace({'2020/11/23 7:25':'early','2020/11/23 8:40':'late'}).head())

0    early
1    early
2     late
3     late
4     late
Name: end_time_format, dtype: object

replace 还有一种特殊的方向替换，指定 method 参数为 ffill 则为用前面一个最近的未被替换的值进行替换，bfill 则使用后面最近的未被替换的值进行替换.

m=pandas.Series([1,2,‘s’,‘t’,5])
print(m)
print(m.replace([1,2],method=‘ffill’))
print(m.replace([1,2],method=‘bfill’))

0    1
1    2
2    s
3    t
4    5
dtype: object
0    1
1    1
2    s
3    t
4    5
dtype: object
0    s
1    s
2    s
3    t
4    5
dtype: object

逻辑替换包括了 where 和 mask ，这两个函数是完全对称的：where 函数在传入条件为 False 的对应行进行
替换，而 mask 在传入条件为 True 的对应行进行替换，当不指定替换值时，替换为缺失值。

where:

a=pandas.Series([0,1,2,3,4,5])
print(a.where(a<3))

0    0.0
1    1.0
2    2.0
3    NaN
4    NaN
5    NaN
dtype: float64

a=pandas.Series([0,1,2,3,4,5])
print(a.where(a<3,3))
print(a.mask(a<3,3))


0    0
1    1
2    2
3    3
4    3
5    3
dtype: int64
0    3
1    3
2    3
3    3
4    4
5    5
dtype: int64

数值替换包含了 round, abs, clip 方法，它们分别表示取整、取绝对值和截断:

a=pandas.Series([0.1,1,-2,3.14159,4,5])
print(a.abs())
print(a.round())
print(a.clip(0,2))#0-2之间截断


0    0.10000
1    1.00000
2    2.00000
3    3.14159
4    4.00000
5    5.00000
dtype: float64
0    0.0
1    1.0
2   -2.0
3    3.0
4    4.0
5    5.0
dtype: float64
0    0.1
1    1.0
2    0.0
3    2.0
4    2.0
5    2.0
dtype: float64

排序函数：

print(data.sort_values('long').head())#值排序

   long        lat start_time_format  end_time_format
2  116.831244  38.309362   2020/11/23 8:23  2020/11/23 8:40
3  116.831309  38.309470   2020/11/23 8:23  2020/11/23 8:40
4  116.831309  38.309470   2020/11/23 8:23  2020/11/23 8:40
5  116.831495  38.309779   2020/11/23 8:30  2020/11/23 8:40
7  116.831652  38.308645   2020/11/23 8:42  2020/11/23 9:29

索引排序的用法和值排序完全一致，只不过元素的值在索引中，此时需要指定索引层的名字或者层号，用参
数 level 表示。另外，需要注意的是字符串的排列顺序由字母顺序决定。

print(data.sort_index(level=['long',['lat']],ascending=False).head())

          long        lat start_time_format   end_time_format
13  116.864731  38.311345  2020/11/23 10:11  2020/11/23 10:36
12  116.831652  38.308645   2020/11/23 9:29   2020/11/23 9:43
11  116.831674  38.308702   2020/11/23 9:29   2020/11/23 9:45
10  116.831652  38.308645   2020/11/23 8:42   2020/11/23 9:43
9   116.831652  38.308645   2020/11/23 8:42   2020/11/23 8:57

apply方法：

apply 方法常用于 DataFrame 的行迭代或者列迭代，它的 axis 含义与第 2 小节中的统计聚合函数一致，apply的参数往往是一个以序列为输入的函数

def sum(x):
    return x.sum()

print(data.apply(sum))

long                                                           1635.74
lat                                                            536.332
start_time_format    2020/11/23 6:352020/11/23 7:082020/11/23 8:232...
end_time_format      2020/11/23 7:252020/11/23 7:252020/11/23 8:402...
dtype: object

窗口对象：

pandas 中有 3 类窗口，分别是滑动窗口 rolling 、扩张窗口 expanding 以及指数加权窗口 ewm 。

要使用滑窗函数，就必须先要对一个序列使用 .rolling 得到滑窗对象，其最重要的参数为窗口大小 window ：

a=pandas.Series([0.1,1,-2,3.14159,4,5])

r=a.rolling(window=2)
print(r)
print(r.mean())
print(r.sum())

Rolling [window=2,center=False,axis=0]

0         NaN
1    0.550000
2   -0.500000
3    0.570795
4    3.570795
5    4.500000
dtype: float64


0        NaN
1    1.10000
2   -1.00000
3    1.14159
4    7.14159
5    9.00000
dtype: float64

shift, diff, pct_change 是一组类滑窗函数，它们的公共参数为 periods=n ，默认为 1，分别表示取向前第 n
个元素的值、与向前第 n 个元素做差（与 Numpy 中不同，后者表示 n 阶差分）、与向前第 n 个元素相比计
算增长率。这里的 n 可以为负，表示反方向的类似操作

a=pandas.Series([0.1,1,-2,3.14159,4,5])
print(a.shift(2))
print(a.diff(3))
print(a.pct_change())


0        NaN
1        NaN
2    0.10000
3    1.00000
4   -2.00000
5    3.14159
dtype: float64
0        NaN
1        NaN
2        NaN
3    3.04159
4    3.00000
5    7.00000
dtype: float64
0         NaN
1    9.000000
2   -3.000000
3   -2.570795
4    0.273241
5    0.250000
dtype: float64

print(a.rolling(2).apply(lambda x:list(x)[0]))

print(a.rolling(3).apply(lambda x:list(x)[1]-list(x)[-1]))

0        NaN
1    0.10000
2    1.00000
3   -2.00000
4    3.14159
5    4.00000
dtype: float64
0        NaN
1        NaN
2    3.00000
3   -5.14159
4   -0.85841
5   -1.00000
dtype: float64

扩张窗口：
扩张窗口又称累计窗口，可以理解为一个动态长度的窗口，其窗口的大小就是从序列开始处到具体操作的对
应位置，其使用的聚合函数会作用于这些逐步扩张的窗口上

a=pandas.Series([0.1,1,-2,3.14159,4,5])
print(a.expanding().mean())


0    0.100000
1    0.550000
2   -0.300000
3    0.560397
4    1.248318
5    1.873598
dtype: float64