Series——series_custom&reindex&sort_index

最新推荐文章于 2023-10-27 15:03:04 发布

mmい

最新推荐文章于 2023-10-27 15:03:04 发布

阅读量1.7k

点赞数

分类专栏：数据挖掘—dataquest

本文链接：https://blog.csdn.net/zm714981790/article/details/51212626

版权

数据挖掘—dataquest 专栏收录该内容

38 篇文章 4 订阅

订阅专栏

Pandas 中主要的数据类型有三个：

Series (collection of values)
DataFrame (collection of Series objects)
Panel (collection of DataFrame objects)

本篇重点讲Series

Series 对象使用numpy的数组进行快速计算，它是基于numpy的但是又扩展了numpy，ndarray的索引只能是整型数据，而Series 的索引可以是字符类型，并且Series 的数据是混合类型还可以是NaN(表示缺失值)。

Series 对象存储的数据的类型有以下几种：

float - for representing float values
int - for representing integer values
bool - for representing Boolean values
datetime64[ns] - for representing date & time, without time-zone
datetime64[ns, tz] - for representing date & time, with time-zone
timedelta[ns] - for representing differences in dates & times (seconds, minutes, etc.)
category - for representing categorical values
object - for representing String values (记住object是代表string哟)

Dataset

本文的数据集是fandango_score_comparison.csv集合了不同网站的评论家和用户对电影的评分

数据属性如下：

FILM - film name
RottenTomatoes - Rotten Tomatoes critics average score
RottenTomatoes_User - Rotten Tomatoes user average score
RT_norm - Rotten Tomatoes critics average score (normalized to a 0 to 5 point system)
RT_user-norm - Rotten Tomatoes user average score (normalized to a 0 to 5 point system)
Metacritic - Metacritic critics average score
Metacritic_User - Metacritic user average score

Integer Index

像numpy一样进行索引（用：切片，用标签索引一列数据）

fandango = pd.read_csv('fandango_score_comparison.csv')
series_film = fandango['FILM']
series_rt = fandango['RottenTomatoes']
print(series_film[0:5])
print(series_rt[0:5])
'''
0    Avengers: Age of Ultron (2015)
1                 Cinderella (2015)
2                    Ant-Man (2015)
3            Do You Believe? (2015)
4     Hot Tub Time Machine 2 (2015)
Name: FILM, dtype: object
0    74
1    85
2    80
3    18
4    14
Name: RottenTomatoes, dtype: int64
'''

Custom Index

自定义索引：前面一个例子给出的是一个电影名称的列数据和电影评分的列数据，当我们想要查找一个名称的电影的评分时我们得先找到这个名称电影的index然后再找到评分，这样这个过程就显得很复杂，因此我们希望找到一个能根据电影名称直接索引出电影评分的对象

Series(rt_scores , index=film_names)函数通过重新定义index和value创建新的Series对象(series_custom 是

# Import the Series object from pandas
from pandas import Series

film_names = series_film.values
rt_scores = series_rt.values
series_custom = Series(rt_scores , index=film_names)
print(series_custom[['Minions (2015)', 'Leviathan (2014)']])
'''
Minions (2015)      54
Leviathan (2014)    99
dtype: int64
'''

Reindexing

将series_custom的index排序，此时注意他的index是电影名称。

original_index = series_custom.index.tolist()
sorted_index = sorted(original_index)
sorted_by_index = series_custom.reindex(sorted_index)

Sorting

pandas有两个函数进行排序:
- sort_index()：根据series的index进行排序，返回一个series
- sort_values()：根据值进行排序（默认从小到大）

sc2 = series_custom.sort_index()
sc3 = series_custom.sort_values()

Vectorized Operations

Series 对象支持向量操作，因为pandas是基于numpy的，numpy的想量化操作被优化得难以置信（用低级语言C实现的），而用循环来进行计算的话要慢的多，因此要好好利用向量操作，毕竟它已经被优化得很好了。

series_normalized = (series_custom/100)*5

Comparing And Filtering

series_greater_than_50 = series_custom[series_custom > 50]

Alignment

对齐指的就是两个series的长度相同，只有两个对象的长度对齐了，才能利用python标准的加减乘除运算。

rt_critics = Series(fandango['RottenTomatoes'].values, index=fandango['FILM'])
rt_users = Series(fandango['RottenTomatoes_User'].values, index=fandango['FILM'])
rt_mean = (rt_critics + rt_users)/2

print(rt_mean)

mmい

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Series——series_custom&reindex&sort_index

Pandas 中主要的数据类型有三个：Series (collection of values) DataFrame (collection of Series objects) Panel (collection of DataFrame objects)本篇重点讲Series Series 对象使用numpy的数组进行快速计算，它是基于numpy的但是又扩展了numpy，ndarr
复制链接

扫一扫