老男孩数据分析 02 pandas基础操作-CSDN博客

本文链接：https://blog.csdn.net/lemonguess/article/details/117642767

为什么学习pandas

numpy已经可以帮助我们进行数值型数据的处理了，pandas还可以进行许多其他类型的数据处理（字符串等）。

什么是pandas？

1. 首先认识pandas中两个常用的类：Series和Dataframe

>>> from pandas import Series
>>> import pandas as pd

Series
- 是一种类似与一维数组的对象，由两个部分组成：
- value：一组数据（ndarray类型）
- index：相关的数据索引标签
Dataframe(重点)
- 由列表或者numpy数组创建
- 由字典创建

2.Series的索引/数据源

  >>> from pandas import Series
  >>> s = Series(data=[1,2,3])
  >>> s
  0    1
  1    2
  2    3
  dtype: int64
  #索引是0,1,2的默认形式：隐式索引

  >>> s = Series(data=[7,8,9],index=['a','b','c'])
  >>> s['a']
  7
  >>> s[0]
  7
  #a,b,c索引叫做显式索引，不会覆盖原有的隐式索引
  #显式索引可以增加数据的可读性

numpy没有显式索引

#将一个二维的数据源作为Series的数据源，查看是否可行
——不可，Series只能处理一维数组

>>> s = Series(data=np.random.randint(0,100,size=(3,4)))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\SoftwareSpace\python\lib\site-packages\pandas\core\series.py", line 327, in __init__
    data = sanitize_array(data, index, dtype, copy, raise_cast_failure=True)
  File "D:\SoftwareSpace\python\lib\site-packages\pandas\core\construction.py", line 496, in sanitize_array
    raise Exception("Data must be 1-dimensional")
Exception: Data must be 1-dimensional

#字典作为数据源

>>> dic={'语文':100,'数学':120,'英语':120}
>>> Series(data=dic)
语文    100
数学    120
英语    120
dtype: int64

3.Series的索引和切片

索引：

>>> s['数学']
120
>>> s[1]
120

切片(传入)：

>>> s=Series(data = np.random.randint(0,10,size=(5,)))
>>> s
0    1
1    7
2    0
3    9
4    6
dtype: int32
>>> s[0:6]
0    1
1    7
2    0
3    9
4    6
dtype: int32

4.Series的常用属性

shape（属性）、
size（元素个数）、
index（索引）

values（元素）

 >>> s.shape
 (5,)
 >>> s.size
 5
 >>> s.index
 RangeIndex(start=0, stop=5, step=1)
 >>> s.values
 array([1, 7, 0, 9, 6])

5.Series的常用方法

head(),tail()
unique()
isnull()
notnull

add()/sub()/mul()/div

 #显示前3行
 >>> s = Series(data=[1,1,1,2,2,3,4,5,6,7])
 >>> s.head(3)
 0    1
 1    1
 2    1
 dtype: int64
 
 #显示后3行
 >>> s.tail(3)
 7    5
 8    6
 9    7
 dtype: int64
 
 #去重
 >>> s.unique()
 array([1, 2, 3, 4, 5, 6, 7], dtype=int64)
 
 #返回去重后的元素个数
 >>> s.nunique()
 7

 #判断是否为空
 >>> s.isnull()
 0    False
 1    False
 2    False
 3    False
 4    False
 5    False
 6    False
 7    False
 8    False
 9    False
 dtype: bool

 #判断是否为非空
 >>> s.notnull()
 0    True
 1    True
 2    True
 3    True
 4    True
 5    True
 6    True
 7    True
 8    True
 9    True
 dtype: bool

DataFrame

1.简介

DataFrame是一个表格型的数据结构。由按一定顺序排列的多列数据组成。设计初衷是将Series的使用场景从一维扩展到多维。DataFrame既有行索引，也有列索引。

行索引：index
列索引：columns
值：values

2.DataFrame的创建

nadarry创建

字典创建

>>> from pandas import DataFrame
>>> DataFrame(data=[[1,2,3],[4,5,6]])
   0  1  2
0  1  2  3
1  4  5  6
>>> DataFrame(data=np.random.randint(0,100,size=(6,8)))
    0   1   2   3   4   5   6   7
0   1  53  15  54  93  87  24  53
1  86  46  91  59  77  49  10  20
2  74  98  41  49  26  77  58  10
3  50  68  46  67  41   6  78  26
4   1  60  29  80  55  37  89  10
5  23  21  51  14  16  31  57  20
>>> dic = {'张三':[172,69],'李四':[168,64],'王五':[184,79]}
>>> df = DataFrame(data=dic,index=[r'身高/cm',r'体重/kg'])
>>> df
        张三   李四   王五
身高/cm  172  168  184
体重/kg   69   64   79

3.DataFrame的属性

values、columns、index、shape

>>> df
        张三   李四   王五
身高/cm  172  168  184
体重/kg   69   64   79
>>> df.values
array([[172, 168, 184],
       [ 69,  64,  79]], dtype=int64)
>>> df.values()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'numpy.ndarray' object is not callable
>>> df.values
array([[172, 168, 184],
       [ 69,  64,  79]], dtype=int64)
>>> df.columns
Index(['张三', '李四', '王五'], dtype='object')
>>> df.index
Index(['身高/cm', '体重/kg'], dtype='object')
>>> df.shape
(2, 3)

4.索引和切片

>>> df = DataFrame(data=np.random.randint(0,100,size=(5,4)),index=['a','b','c','d','e'],columns=['A','B','C','D'])
>>> df
    A   B   C   D
a  98  21  68  44
b  46  36  91   4
c  61  87  14  42
d  11  59  85  92
e  46  16  50  71

#索引取列

>>> df['A']
a    98
b    46
c    61
d    11
e    46
>>> df[['A','B']]
    A   B
a  98  21
b  46  36
c  61  87
d  11  59
e  46  16

#索引取行
df.loc[‘a’] #local作用的是显式索引
df.iloc[0]#iloc作用的是隐式索引

>>> df.iloc[[0,1]]
	A   B   C   D
a  98  21  68  44
b  46  36  91   4
>>> df.loc[['a','b']]
    A   B   C   D
a  98  21  68  44
b  46  36  91   4

索引取元素

>>> df.loc['b','A']
46
>>> df.iloc[1,1]
36

行切片

>>> df[0:2]
    A   B   C   D
a  98  21  68  44
b  46  36  91   4
>>> df['a':'b']
    A   B   C   D
a  98  21  68  44
b  46  36  91   4.

列切片

>>> df.loc[:,'A':'B']
    A   B
a  98  21
b  46  36
c  61  87
d  11  59
e  46  16
>>> df.iloc[:,0:2]
    A   B
a  98  21
b  46  36
c  61  87
d  11  59
e  46  16

总结：

●df索引和切片操作
■索引:

df[col]:取列
df.loc[index]:取行
df,iloc[index.col]:取元素

■切片:
。df.[index1:index3]:切行
。df.iloc.[col1:col3];切列
●DataFrame的运算
■同Seres

●时间数据类型的转换
■pd.to_datetime(col)
●将某一列设置为行索引
■df.set_index()

练习:

1.假设ddd是期中考试成绩，ddd2是期末考试成绩，请自由创建ddd2,并将其与ddd相加，求期中期末平均值。
2.假设张三期中考试数学被发现作弊，要记为0分，如何实现?
3.李四因为举报张三作弊立功，期中考试所有科目加100分，如何实现?
4.后来老师发现有一道题出错了，为了安抚学生情绪，给每位学生每个科目都加10分，如何实现?

>>> ddd = DataFrame(data={'张三':[172,69],'李四':[168,64],'王五':[184,79]},index=['数学','历史'])
>>> ddd2 = DataFrame(data={'张三':[,69],'李四163':[168,64],'王五':[184,79]},index=['数学','历史'])]
>>> DataFrame.mean(ddd+ddd2)
张三    236.5
李四    217.5
王五    220.5
dtype: float64
>>> ddd.loc['数学','张三']=0
>>> ddd
    张三   李四   王五
数学   0  168  184
历史  69   64   79
>>> ddd['李四']+100
数学    268
历史    164
>>> ddd+10
张三   李四   王五
数学  10  178  194
历史  79   74   89