pandas数据结构及一些基本用法

最新推荐文章于 2022-12-02 11:02:47 发布

章逸佳

最新推荐文章于 2022-12-02 11:02:47 发布

阅读量187

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/weixin_43161647/article/details/92964449

版权

python 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

1.数据结构
（1）Series
下面是一些用pandas生成序列的操作，更多内容可见链接：
http://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data=[1,2,3,4]
index=['a','b','c','d']
s=pd.Series(data,index,name='something')#生成序列并给他取名叫something,abcd是序列的索引标签，1234是序列的值
print(s)
print(s.index)

dic={'a':0,'b':1,'c':3}
s2=pd.Series(dic)#也可以通过字典变量生成序列
print(s2)
print(s['a'])

（2）DataFrame
-dataframe可以通过字典、列表、记录等等生成。

df2 = pd.DataFrame({ 'A' : 1.,
                     'B' : pd.Timestamp('20130102'),
                     'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                     'D' : np.array([3] * 4,dtype='int32'),
                     'E' : pd.Categorical(["test","train","test","train"]),
                     'F' : 'foo' })
print(df2)
index=df2.index
columns=df2.columns#列名是第一行
print(index)#标签名是第一列
print(columns)
>>>
     A          B            C   D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo
Int64Index([0, 1, 2, 3], dtype='int64')
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')

df=pd.DataFrame.from_dict(dict([('A', [1, 2, 3]), ('B', [4, 5, 6])]))
df2=pd.DataFrame.from_dict(dict([('A', [1, 2, 3]), ('B', [4, 5, 6])]),orient='index',columns=[1,2,3])
##默认是将AB作为列名，‘orient=index’这条指令可以把AB作为index，然后再输入新的列名。
print(df)
print(df2)
>>>
   A  B
0  1  4
1  2  5
2  3  6
    1  2  3
A  1  2  3
B  4  5  6

date=pd.date_range('20190101','20190110')#生成一个日历序列，并将日历作为index
print(date)
df=pd.DataFrame(np.random.rand(10,4),index=date,columns=['1','2','3','4'])
print(df)

2.dataframe的操作
下面是一些数据帧的基本操作

print('The first six value of df are:\n',df.head())
print('The last three value of df are:\n',df.tail(3))
print('The values contained in df are:\n',df.values)
print('The columns of df are:\n',df.columns)
print('The index of df are:\n',df.index)
print(df.describe())#输出描述性统计的结果
print(df.T)#输出转置数据帧
df1=df.sort_index(axis=0,ascending=False)#按index排序
df2=df.sort_values(by='1')#按column1的value排序
print(df1)
print(df2)

3.获取数据操作
直接引用，loc，iloc引用

最简单的引用
print(df['1'])
print(df['20190102':'20190107'])

通过标签引用
print(df.loc['20190102':'20190104',['1','2']])

通过标签位置引用
print(df.iloc[[3,4,5],[1,2]])
print(df.iloc[3,3])

4.赋值操作
赋值操作主要包括添加新的column，按标签赋值，按位置赋值，where函数赋值等。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#一些赋值操作
date=pd.date_range('20190101',periods=6)
df=pd.DataFrame(np.random.rand(6,4),index=date,columns=['A','B','C','D'])
print(date)
print(df)
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20190101', periods=6))
df['E']=s1#在后面加一列
df.loc[date[0:3],'A']=1#按标签
df.iloc[1:3,1:2]=5#按位置
print(df)

df2=df.copy()
df2[df2>0]=-df2
print(df2)

5.将函数应用于dataframe
可以将函数用于dataframe进行计算

date=pd.date_range('20190101',periods=6)
df=pd.DataFrame(np.random.rand(6,4),index=date,columns=['A','B','C','D'])
print(df)
print(df.apply(np.sum))#按列求和

#print(df.apply(lambda x:x.max()-x.min()))#每列的最大值减最小值

6.字符串方法
Series在str属性中配备了一组字符串处理方法，可以轻松地对数组的每个元素进行操作。
-分割操作：series.str.split，更多的用法可使用help函数

s = pd.Series(['lower', 'CAPITALS', 'this is a sentence', 'SwApCaSe'])
print(s)
print(s.str.split(' '))#把每一个字符串用逗号分隔开
print(s.str.split('_'))#把每一个字符串分开
>>>
0                 lower
1              CAPITALS
2    this is a sentence
3              SwApCaSe
dtype: object
0                    [lower]
1                 [CAPITALS]
2    [this, is, a, sentence]
3                 [SwApCaSe]
dtype: object
0                 [lower]
1              [CAPITALS]
2    [this is a sentence]
3              [SwApCaSe]
dtype: object

-连接操作：series.str.cat，更多的用法可使用help函数**

s = pd.Series(['a', 'b', np.nan, 'd'])
print(s)
print(s.str.cat(['A', 'B', 'C', 'D'], sep=',', na_rep='-'))
s1 = pd.Series(['a', 'b', np.nan, 'd'])
print(s1.str.cat(sep=' ',na_rep='?'))
>>>
0      a
1      b
2    NaN
3      d
dtype: object
0    a,A
1    b,B
2    -,C
3    d,D
dtype: object
a b ? d

-大小写转换

s = pd.Series(['lower', 'CAPITALS', 'this is a sentence', 'SwApCaSe'])
print(s)
print(s.str.lower())
print(s.str.upper())
print(s.str.title())#每个单词首字母大写3
print(s.str.capitalize())#句子句首首字母大写
print(s.str.swapcase())#大小写交换

7.数据帧的合并添加

 ##使用append在已有数据帧后面加上新的
    >>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
    >>> df
       A  B
    0  1  2
    1  3  4
    >>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
    >>> df.append(df2)
       A  B
    0  1  2
    1  3  4
    0  5  6
    1  7  8
    With `ignore_index` set to True:
   >>> df.append(df2, ignore_index=True)
       A  B
    0  1  2
    1  3  4
    2  5  6
    3  7  8
#使用concat合并数据帧
df = pd.DataFrame(np.random.randn(10, 4))
print(pd.concat([df[:3], df[3:7], df[7:]]))

8.分组操作
-分组
我们所说的“group by“是指涉及下列一项或多项步骤的程序：
Splitting：根据一些标准将数据分解成组
Applying：将函数独立地应用于每个组
Combining：将结果组合成数据结构

df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],'B': ['one', 'one', 'two', 'three',
 'two', 'two', 'one', 'three'],'C': np.random.randn(8),'D': np.random.randn(8)})
print(df)
print(df.groupby('A').sum())#按A汇总分类
print(df.groupby(['A','B']).sum())#先按A分类，再按B分类
>>>
            C         D
A                      
bar  0.666157  2.072271
foo  4.958815 -1.054723
                  C         D
A   B                        
bar one   -0.766606  0.863325
    three  0.095483  0.645818
    two    1.337280  0.563128
foo one    5.175418  0.493768
    three -0.004776 -0.646949
    two   -0.211826 -0.901542

9.输入输出数据

##把描述性统计的结果分别保存在CSV和xlsx文件里
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],'B': ['one', 'one', 'two', 'three',
 'two', 'two', 'one', 'three'],'C': np.random.randn(8),'D': np.random.randn(8)})

describe=df.describe()
print(describe)
print(type(describe))
describe.to_csv('descriptive statistics.csv')
describe.to_excel('descriptive statistics.xlsx',sheet_name='sheet1')
##分别从CSV中读入数据，excel数据要先保存为CSV数据才可读入。
pd.read_csv('foo.csv')

章逸佳

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
pandas数据结构及一些基本用法

1.数据结构（1）Series下面是一些用pandas生成序列的操作，更多内容可见链接：http://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.htmlimport numpy as npimport pandas as pdimport matplotlib.pyplot as pltdata=[1,2...
复制链接

扫一扫

专栏目录