数据分析——从入门到精通(四)

今晚务必早点睡

于 2022-04-06 21:18:56 发布

阅读量1.6k

点赞数 1

分类专栏：数据分析 Pandas 文章标签： python 数据分析大数据

本文链接：https://blog.csdn.net/m0_57021623/article/details/124000316

版权

数据分析同时被 2 个专栏收录

30 篇文章 4 订阅

订阅专栏

Pandas

27 篇文章 2 订阅

订阅专栏

本文通过实例展示了如何使用Pandas和NumPy创建、操作和分析数据。创建了Pandas课程20名同学的成绩，然后从中随机抽取部分学生成绩创建了ndarray和numpy课程的成绩。通过Pandas的Series对象进行了数据的相加、索引修改和合并操作，同时也探讨了如何处理缺失值和数据排序。最后，将两个Series合并成DataFrame并展示了数据筛选和分析的方法。

摘要由CSDN通过智能技术生成

import numpy as np
import pandas as pd 
from pandas import Series

创建Pandas课程20名同学的成绩，以学号为索引，开始学员以"0001"开始

# 填充  给"1"右填充4位，不够补"0"
"1".rjust(4,"0")

[out]:

'0001'

[int]:

# 创建一维数组
data = np.random.randint(1,150,size=20)  # low, high=None, size=None, dtype='l'
# 创建索引
# f"{i}" 表示格式化
index = [f"{i}".rjust(4,"0") for i in range(1,21)]  
# 也可以这样写 index = [str(i).rjust(4,'0') for i in range(1,21)]
# 创建Pandas课程的成绩
pandas = Series(data,index,name='Pandas')
pandas

[out]:

0001     58
0002      3
0003     88
0004    123
0005     27
0006    141
0007     12
0008    124
0009     61
0010    135
0011     36
0012    139
0013    118
0014    121
0015     55
0016     16
0017     85
0018     86
0019     59
0020      4
Name: Pandas, dtype: int32

# 创建选修ndarray课程的成绩，学员从Pandas考试成绩中随机抽到，索引即为学员的学号
# 选学号
# np.random.choice(index)  随机选一个索引
# np.random.choice(index,size=5) 随机选五个索引
np.random.choice(index,size=5)

[out]:

array(['0019', '0010', '0018', '0014', '0013'], dtype='<U4')

# 随机选五个索引进行排序
np.sort(np.random.choice(index,size=5))

[out]:

array(['0003', '0006', '0012', '0014', '0020'], dtype='<U4')

# 创建选修ndarray课程的成绩，学员从Pandas考试成绩中随机抽到，索引即为学员的学号
ndarray =Series(
    np.random.randint(1,150,size=5),
    np.sort(np.random.choice(index,size=5)),
    name="ndarray"
)
ndarray

[out]:

0001    35
0004    87
0007    43
0014     8
0016    31
Name: ndarray, dtype: int32

[int]:

# 创建选修numpy课程的成绩，学员从Pandas考试成绩中随机抽到，索引即为学员的学号
numpy =Series(
    np.random.randint(1,150,size=5),
    np.sort(np.random.choice(index,size=5)),
    name ="numpy"

)
display(ndarray,numpy)

[out]:

0001    35
0004    87
0007    43
0014     8
0016    31
Name: ndarray, dtype: int32



0001    84
0004    35
0013    76
0015    65
0016    64
Name: numpy, dtype: int32

[int]:

# 排序
np.sort(np.array(list({1,5,3,2,7,0})))

[out]:

array([0, 1, 2, 3, 5, 7])

#  为了让上例创建的  display(ndarray,numpy) 不重复，故构造函数
def create_sereis(name,size=5,low=0,high=100):
    data_ = np.random.randint(low,high,size)  # shift+Tab  randint(low, high=None, size=None, dtype='l')
    # 引入set，是为了去重
    index_ = set()
    while len(index_) < size:
        index_.add(np.random.choice(index))
    return Series(data_,np.sort(list(index_)),name=name)

# 加入随机种子，产生的随机数就不会变化
np.random.seed(5)
ndarray = create_sereis('ndarray',high=150)
numpy = create_sereis('numpy',high=150)
display(ndarray,numpy)

[out]:

0005     99
0008    118
0013    144
0016     73
0017      8
Name: ndarray, dtype: int32



0002    113
0010     80
0011     27
0016     44
0019     65
Name: numpy, dtype: int32

将两个series进行相加

ndarray+numpy         # 索引自动对齐，相同索引的两个值进行相加，不相同时，则和NAN相加，相加值为NAN

[out]:

0002      NaN
0005      NaN
0008      NaN
0010      NaN
0011      NaN
0013      NaN
0016    117.0
0017      NaN
0019      NaN
dtype: float64

# fill_value=0 表示给NAN值填充0
# 此方法是Series对象的算术运算方法
ndarray.add(numpy,fill_value=0)

[out]:

0002    113.0
0005     99.0
0008    118.0
0010     80.0
0011     27.0
0013    144.0
0016    117.0
0017      8.0
0019     65.0
dtype: float64

修改索引标签

ndarray

[out]:

0005     99
0008    118
0013    144
0016     73
0017      8
Name: ndarray, dtype: int32

ndarray.rename(index ={'0005':'0004'}) # index=None, **kwargs
ndarray  # 此时ndarray已经改了，改完是个副本，没有给它赋值

[out]:

0005     99
0008    118
0013    144
0016     73
0017      8
Name: ndarray, dtype: int32

ndarray.rename(index={'0005':'0004'},inplace=True) 
# s.rename(index={'原值：’新内容'}，inplace=True) inplace为True是表示是否在原表上修改
ndarray

[out]:

0004     99
0008    118
0013    144
0016     73
0017      8
Name: ndarray, dtype: int32

# fill_value= np.nan  默认不存在填充NAN值
ndarray.add(numpy,fill_value=np.nan)

[out]:

0002      NaN
0004      NaN
0008      NaN
0010      NaN
0011      NaN
0013      NaN
0016    117.0
0017      NaN
0019      NaN
dtype: float64

将两个Series合并到一起，变成了DataFrame了

如果以列的方式(axis=1)进行级联时(左右)，则变成了DataFrame(二维关系表，二维数组)
如果以行的方式(axis=0)进行级联时(上下)，还是Series

pd.concat((ndarray,numpy))

[out]:

0004     99
0008    118
0013    144
0016     73
0017      8
0002    113
0010     80
0011     27
0016     44
0019     65
dtype: int32

pd.concat((ndarray,numpy),axis=1,sort=False)   # axis=1/-1 ，横着看   axis=0 竖着看

[053!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!]

# sort= True时，表示索引排序
course = pd.concat((ndarray,numpy),axis=1,sort=True) 
course

[054!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!]

查看前或后几条数据

head()
tail()

pandas.head()  # pandas.head(n=5) 默认前5条

[055!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!]

pandas.tail()  # pandas.tail(n=5) 默认后五条

[056!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!]

series是否支持排序

pandas

[057!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!]

pandas.sort_index()

[058!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!]

# ascending 表示升序或降序
pandas.sort_values('numpy')  
# by,axis=0,ascending=True,inplace=False,kind='quicksort',na_position='last',

[059!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!]

查找哪些索引是NAN值

course   # 每一列都是一个series, 此例有两列series,一列是ndarray的series,一列是numpy的series

[060!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!]

course['ndarray']

0002      NaN
0004     99.0
0008    118.0
0010      NaN
0011      NaN
0013    144.0
0016     73.0
0017      8.0
0019      NaN
Name: ndarray, dtype: float64

ndarray_sum = course['ndarray']   # dataframe['列名']
ndarray_sum

0002      NaN
0004     99.0
0008    118.0
0010      NaN
0011      NaN
0013    144.0
0016     73.0
0017      8.0
0019      NaN
Name: ndarray, dtype: float64

ndarray_sum.isna()  # 判断是否nan的方法 ，返回的是Series,内容为bool值

0002     True
0004    False
0008    False
0010     True
0011     True
0013    False
0016    False
0017    False
0019     True
Name: ndarray, dtype: bool

# 【非常重要】可以将内容为bool值的series作为索引不连续选择使用。
# isna()，isnull()这两个方法是等价的，都是获取为nan的数据
# notna(),notnull()也是等价的，获取非nan的数据
ndarray_sum[ndarray_sum.isna()]

0002   NaN
0010   NaN
0011   NaN
0019   NaN
Name: ndarray, dtype: float64

今晚务必早点睡

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
打赏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录