pandas_文档01_Intro_to_data_structures之Series

https://pandas.pydata.org/docs/user_guide/dsintro.html#dataframe

Intro to data structures

We’ll start with a quick, non-comprehensive overview of the fundamental data structures in pandas to get you started. The fundamental behavior about data types, indexing, and axis labeling / alignment apply across all of the objects. To get started, import NumPy and load pandas into your namespace:

我们将快速、非全面概述pandas中基本数据结构作为开始,让您开始学习。有关数据类型、索引和轴标记/对齐的基本行为适用于所有对象。要开始,请导入NumPy并将pandas加载到您的命名空间中:

import numpy as np

import pandas as pd

Here is a basic tenet to keep in mind: data alignment is intrinsic. The link between labels and data will not be broken unless done so explicitly by you.

要记住一个基本原则:数据对齐是固有的。除非您明确这样做,否则标签和数据之间的链接不会断开。

We’ll give a brief intro to the data structures, then consider all of the broad categories of functionality and methods in separate sections.

我们将简要介绍数据结构,然后在单独的部分中考虑所有广泛的功能和方法类别

1.Series

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:

Series是一个一维带标签的数组,能够保存任何数据类型(整数、字符串、浮点数、Python对象等)。轴标签统称为索引。创建序列的基本方法是调用:

s = pd.Series(data, index=index)

Here, data can be many different things:

a Python dict

an ndarray

a scalar value (like 5)

这里,data(数据)可以是很多不同的东西
一个Python字典
n维数组
一个标量(如 5)

The passed index is a list of axis labels. Thus, this separates into a few cases depending on what data is:
传递的索引是轴标签的列表。因此,这根据数据的不同分为以下几种情况:

From ndarray
If data is an ndarray, index must be the same length as data. If no index is passed, one will be created having values [0, …, len(data) - 1].
如果数据是ndarray,则索引的长度必须与数据的长度相同。如果未传递任何索引,将创建一个具有值[0,…,len(data)-1]的索引。

s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])
s
a    0.151755
b    1.446566
c    0.770684
d    0.363329
e    2.245000
dtype: float64
s.index
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
pd.Series(np.random.randn(5))
0    0.080765
1    0.118849
2    0.128212
3    0.674339
4    1.515013
dtype: float64

Note
pandas supports non-unique index values. If an operation that does not support duplicate index values is attempted, an exception will be raised at that time. The reason for being lazy is nearly all performance-based (there are many instances in computations, like parts of GroupBy, where the index is not used).

pandas支持非唯一索引值。如果尝试执行不支持重复索引值的操作,则此时将引发异常。懒惰的原因几乎都是基于性能的(在计算中有许多实例,比如GroupBy的某些部分,没有使用索引)。

From dict
Series can be instantiated from dicts:
序列(Series)可以从字典(DICT)中实例化

d = {"b": 1, "a": 0, "c": 2}
pd.Series(d)
b    1
a    0
c    2
dtype: int64
pd.Series(d).index
Index(['b', 'a', 'c'], dtype='object')

note
When the data is a dict, and an index is not passed, the Series index will be ordered by the dict’s insertion order, if you’re using Python version >= 3.6 and pandas version >= 0.23.

If you’re using Python < 3.6 or pandas < 0.23, and an index is not passed, the Series index will be the lexically ordered list of dict keys.

如果使用的是Python版本>=3.6和pandas版本>=0.23,当数据是dict且未传递索引时,序列索引将按dict的插入顺序排序。
如果您使用的是Python<3.6或pandas<0.23,并且没有传递索引,则序列索引将是按字典顺序排列的dict键列表。

In the example above, if you were on a Python version lower than 3.6 or a pandas version lower than 0.23, the Series would be ordered by the lexical order of the dict keys (i.e. [‘a’, ‘b’, ‘c’] rather than [‘b’, ‘a’, ‘c’])

在上面的示例中,如果您使用的是低于3.6的Python版本或低于0.23的pandas版本,则序列将按照dict键的词法顺序排序(即[‘a’,b’,c’,而不是[‘b’,‘a’,‘c’))

If an index is passed, the values in data corresponding to the labels in the index will be pulled out.
如果传递了索引,则将拉出与索引中的标签相对应的数据中的值。

d = {"a": 0.0, "b": 1.0, "c": 2.0}
pd.Series(d)
a    0.0
b    1.0
c    2.0
dtype: float64
pd.Series(d, index=["b", "c", "d", "a"])
b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64

note
NaN (not a number) is the standard missing data marker used in pandas.
NaN(不是数字)是pandas使用的标准缺失数据标记。

From scalar value
If data is a scalar value, an index must be provided. The value will be repeated to match the length of index.
如果数据是标量值,则必须提供索引。将重复该值以匹配索引的长度。

pd.Series(5.0, index=["a", "b", "c", "d", "e"])
a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

1.1 Series is ndarray-like

Series acts very similarly to a ndarray, and is a valid argument to most NumPy functions. However, operations such as slicing will also slice the index.
Series的操作非常类似于ndarray,是大多数NumPy函数的有效参数。但是,切片等操作也会对索引进行切片。

s[0]
0.1517549715605409
s[:3]
a    0.151755
b    1.446566
c    0.770684
dtype: float64
s[s > s.median()]
b    1.446566
e    2.245000
dtype: float64
s[[4, 3, 1]]
e    2.245000
d    0.363329
b    1.446566
dtype: float64
np.exp(s)
a    1.163875
b    4.248501
c    2.161245
d    1.438110
e    9.440419
dtype: float64

note
We will address array-based indexing like s[[4, 3, 1]] in section on indexing.

在索引部分,我们将讨论基于数组的索引,如s[[4,3,1]]。

Like a NumPy array, a pandas Series has a dtype.
与NumPy数组一样,pandas Serie也有一个数据类型。

s.dtype
dtype('float64')

This is often a NumPy dtype. However, pandas and 3rd-party libraries extend NumPy’s type system in a few places, in which case the dtype would be an ExtensionDtype. Some examples within pandas are Categorical data and Nullable integer data type. See dtypes for more.

这通常是一种NumPy数据类型。然而,pandas和第三方库在一些地方扩展了NumPy的类型系统,在这种情况下,数据类型将是ExtensionDtype。pandas中的一些示例是分类数据和可为空的整数数据类型。有关更多信息,请参阅数据类型。

If you need the actual array backing a Series, use Series.array.
如果需要支持序列的实际数组,请使用Series.array

s.array
<PandasArray>
[0.1517549715605409, 1.4465661803667178, 0.7706844578901566,
 0.3633294353602837,  2.245000315489751]
Length: 5, dtype: float64

Accessing the array can be useful when you need to do some operation without the index (to disable automatic alignment, for example).

Series.array will always be an ExtensionArray. Briefly, an ExtensionArray is a thin wrapper around one or more concrete arrays like a numpy.ndarray. pandas knows how to take an ExtensionArray and store it in a Series or a column of a DataFrame. See dtypes for more.

当您需要在没有索引的情况下执行某些操作(例如,禁用自动对齐),访问数组时会很有用。 系列数组将始终是扩展数组。简而言之,ExtensionArray是围绕一个或多个numpy.ndarray.的简单装饰器。pandas知道如何获取ExtensionArray并将其存储在 DataFrame的序列或列中。有关更多信息,请参阅数据类型。

While Series is ndarray-like, if you need an actual ndarray, then use Series.to_numpy().
虽然Series类似于ndarray,但如果需要实际的ndarray,请使用Series.to_numpy().

s.to_numpy()
array([0.15175497, 1.44656618, 0.77068446, 0.36332944, 2.24500032])

1.2 Series is dict-like

Series 类似 dict

A Series is like a fixed-size dict in that you can get and set values by index label:
序列就像一个固定大小的dict,您可以通过索引标签获取和设置值:

s["a"]
0.1517549715605409
s["e"] = 12.0
s
a     0.151755
b     1.446566
c     0.770684
d     0.363329
e    12.000000
dtype: float64
"e" in s
True
"f" in s
False

If a label is not contained, an exception is raised:
如果标签没有被包含,则会引发异常:

s["f"]
---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

D:\d_programe\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3360             try:
-> 3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:


D:\d_programe\Anaconda3\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()


D:\d_programe\Anaconda3\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()


pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()


pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()


KeyError: 'f'


The above exception was the direct cause of the following exception:


KeyError                                  Traceback (most recent call last)

C:\Users\ADMINI~1\AppData\Local\Temp/ipykernel_12116/3429377182.py in <module>
----> 1 s["f"]


D:\d_programe\Anaconda3\lib\site-packages\pandas\core\series.py in __getitem__(self, key)
    940 
    941         elif key_is_scalar:
--> 942             return self._get_value(key)
    943 
    944         if is_hashable(key):


D:\d_programe\Anaconda3\lib\site-packages\pandas\core\series.py in _get_value(self, label, takeable)
   1049 
   1050         # Similar to Index.get_value, but we do not fall back to positional
-> 1051         loc = self.index.get_loc(label)
   1052         return self.index._get_values_for_loc(self, loc, label)
   1053 


D:\d_programe\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:
-> 3363                 raise KeyError(key) from err
   3364 
   3365         if is_scalar(key) and isna(key) and not self.hasnans:


KeyError: 'f'

Using the get method, a missing label will return None or specified default:
使用get方法,确实的标签将返回None或指定的默认值:

s.get("f")
s.get("f", np.nan)
nan

1.3 Vectorized operations and label alignment with Series

序列的矢量化操作和标签对齐

When working with raw NumPy arrays, looping through value-by-value is usually not necessary. The same is true when working with Series in pandas. Series can also be passed into most NumPy methods expecting an ndarray.

使用原始NumPy数组时,通常不需要逐值循环。在pandas.Series 中,情况也是如此。序列也可以传递到大多数适用于ndarray的NumPy方法中。

s
a     0.151755
b     1.446566
c     0.770684
d     0.363329
e    12.000000
dtype: float64
s + s
a     0.303510
b     2.893132
c     1.541369
d     0.726659
e    24.000000
dtype: float64
s * 2
a     0.303510
b     2.893132
c     1.541369
d     0.726659
e    24.000000
dtype: float64
np.exp(s)
a         1.163875
b         4.248501
c         2.161245
d         1.438110
e    162754.791419
dtype: float64

A key difference between Series and ndarray is that operations between Series automatically align the data based on label.Thus, you can write computations without giving consideration to whether the Series involved have the same labels.

Series和ndarray之间的一个关键区别操作是,Series会根据标签自动对齐数据。因此,您可以编写计算,而不必考虑所涉及的序列是否具有相同的标签。

s[1:] + s[:-1]
a         NaN
b    2.893132
c    1.541369
d    0.726659
e         NaN
dtype: float64

The result of an operation between unaligned Series will have the union of the indexes involved. If a label is not found in one Series or the other, the result will be marked as missing NaN. Being able to write code without doing any explicit data alignment grants immense freedom and flexibility in interactive data analysis and research. The integrated data alignment features of the pandas data structures set pandas apart from the majority of related tools for working with labeled data.

未对齐序列之间的运算结果将返回所涉及索引的并集。如果在一个系列或另一个系列中找不到标签,则结果将标记为缺失值NaN。能够在不进行任何显式数据对齐的情况下编写代码,为交互式数据分析和研究提供了极大的自由和灵活性。pandas数据结构的集成数据对齐功能将pandas与大多数用于处理标记数据的相关工具区分开来。

note In general, we chose to make the default result of operations between differently indexed objects yield the union of the indexes in order to avoid loss of information. Having an index label, though the data is missing, is typically important information as part of a computation. You of course have the option of dropping labels with missing data via the dropna function.

通常,我们选择使不同索引对象之间的默认操作结果产生索引的并集,以避免信息丢失。虽然缺少数据,但拥有索引标签通常是计算的重要信息。当然,您可以通过dropna函数删除丢失数据的标签。

1.4 Name attribute

名称属性

Series can also have a name attribute:
序列可以也有一个名称属性

s = pd.Series(np.random.randn(5), name="something")
s
0    0.679310
1   -0.483042
2   -0.340507
3   -0.750765
4   -0.469822
Name: something, dtype: float64
s.name
'something'

The Series name will be assigned automatically in many cases, in particular when taking 1D slices of DataFrame as you will see below.

在许多情况下,序列名称将自动分配,尤其是在获取DataFrame的1维切片时,如下文所示。

You can rename a Series with the pandas.Series.rename() method.
你可以通过方法pandas.Series.rename() 重命名一个序列

s2 = s.rename("different")
s2.name
'different'

Note that s and s2 refer to different objects.
请注意,s和s2 指向不同的对象(逻辑存储方面)。

s ==s2

s is s2
False

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值