https://pandas.pydata.org/docs/user_guide/index.html#user-guide
1.DataFrame
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input:
Dict of 1D ndarrays, lists, dicts, or Series
2-D numpy.ndarray
Structured or record ndarray
A Series
Another DataFrame
DataFrame是一种二维标记数据结构,其中列的类型可能不同。您可以将其视为电子表格或SQL表,或序列对象的dict。它通常是pandas最常用的对象。与Series一样,DataFrame接受多种不同类型的输入:
一维数组字典、列表、字典或序列
2维数组
结构化或记录数据阵列
一个序列
其他的数据框
Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame. Thus, a dict of Series plus a specific index will discard all data not matching up to the passed index.
伴随着数据产生,您可以选择传递索引(行标签)和列(列标签)参数。如果传递索引和/或列,则保证生成的数据框的索引/或列。因此,序列的dict加上特定索引将丢弃所有与传递的索引不匹配的数据。
If axis labels are not passed, they will be constructed from the input data based on common sense rules.
如果未传递轴标签,则将根据常识规则从输入数据构造轴标签。
Note
When the data is a dict, and columns is not specified, the DataFrame columns will be ordered by the dict’s insertion order, if you are using Python version >= 3.6 and pandas >= 0.23.
If you are using Python < 3.6 or pandas < 0.23, and columns is not specified, the DataFrame columns will be the lexically ordered list of dict keys.
当数据是dict且未指定列时,如果您使用的是Python版本>=3.6和pandas>=0.23,则数据框的列将按dict的插入顺序排序。
如果您使用的是Python<3.6或pandas<0.23,并且未指定列,那么DataFrame的列将是按字典顺序排列的dict键列表。
1.1 From dict of Series or dicts
来自序列字典或字典
The resulting index will be the union of the indexes of the various Series. If there are any nested dicts, these will first be converted to Series. If no columns are passed, the columns will be the ordered list of dict keys.
生成的索引将是各个系列索引的并集。如果有任何嵌套的dict,这些dict将首先转换为Series。如果未传递任何列,则这些列将是按dict键的有序列表。
import pandas as pd
d = {
"one": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
"two": pd.Series([1.0, 2.0, 3.0, 4.0], index=["a", "b", "c", "d"]),
}
df = pd.DataFrame(d)
df
one | two | |
---|---|---|
a | 1.0 | 1.0 |
b | 2.0 | 2.0 |
c | 3.0 | 3.0 |
d | NaN | 4.0 |
pd.DataFrame(d, index=["d", "b", "a"])
one | two | |
---|---|---|
d | NaN | 4.0 |
b | 2.0 | 2.0 |
a | 1.0 | 1.0 |
pd.DataFrame(d, index=["d", "b", "a"], columns=["two", "three"])
two | three | |
---|---|---|
d | 4.0 | NaN |
b | 2.0 | NaN |
a | 1.0 | NaN |
The row and column labels can be accessed respectively by accessing the index and columns attributes:
通过访问索引和列属性,可以分别访问行和列标签
Note
When a particular set of columns is passed along with a dict of data, the passed columns override the keys in the dict.
当一组特定的列与数据dict一起传递时,传递的列将覆盖dict中的键。
df.index
Index(['a', 'b', 'c', 'd'], dtype='object')
df.columns
Index(['one', 'two'], dtype='object')
1.2 From dict of ndarrays / lists
The ndarrays must all be the same length. If an index is passed, it must clearly also be the same length as the arrays. If no index is passed, the result will be range(n), where n is the array length
Ndarray的长度必须相同。如果传递了索引,那么它显然也必须与数组长度相同。如果未传递任何索引,则结果将是 range(n),其中n是数组长度
import pandas as pd
import numpy as np
d = {"one": [1.0, 2.0, 3.0, 4.0], "two": [4.0, 3.0, 2.0, 1.0]}
pd.DataFrame(d)
one | two | |
---|---|---|
0 | 1.0 | 4.0 |
1 | 2.0 | 3.0 |
2 | 3.0 | 2.0 |
3 | 4.0 | 1.0 |
pd.DataFrame(d, index=["a", "b", "c", "d"])
one | two | |
---|---|---|
a | 1.0 | 4.0 |
b | 2.0 | 3.0 |
c | 3.0 | 2.0 |
d | 4.0 | 1.0 |
1.3 From structured or record array
从结构化或记录数组生成
This case is handled identically to a dict of arrays.
这种情况的处理方式与dict数组相同
import numpy as np
data = np.zeros((2,), dtype=[("A", "i4"), ("B", "f4"), ("C", "a10")])
data
array([(0, 0., b''), (0, 0., b'')],
dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])
data[:] = [(1, 2.0, "Hello"), (2, 3.0, "World")]
data
array([(1, 2., b'Hello'), (2, 3., b'World')],
dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])
pd.DataFrame(data)
A | B | C | |
---|---|---|---|
0 | 1 | 2.0 | b'Hello' |
1 | 2 | 3.0 | b'World' |
pd.DataFrame(data, index=["first", "second"])
A | B | C | |
---|---|---|---|
first | 1 | 2.0 | b'Hello' |
second | 2 | 3.0 | b'World' |
pd.DataFrame(data, columns=["C", "A", "B"])
C | A | B | |
---|---|---|---|
0 | b'Hello' | 1 | 2.0 |
1 | b'World' | 2 | 3.0 |
note DataFrame is not intended to work exactly like a 2-dimensional NumPy ndarray.
注意:DataFrame的工作方式与二维NumPy ndarray的工作方式不同。
1.4 From a list of dicts
来源于字典列表
data2 = [{"a": 1, "b": 2}, {"a": 5, "b": 10, "c": 20}]
pd.DataFrame(data2)
a | b | c | |
---|---|---|---|
0 | 1 | 2 | NaN |
1 | 5 | 10 | 20.0 |
pd.DataFrame(data2, index=["first", "second"])
a | b | c | |
---|---|---|---|
first | 1 | 2 | NaN |
second | 5 | 10 | 20.0 |
pd.DataFrame(data2, columns=["a", "b"])
a | b | |
---|---|---|
0 | 1 | 2 |
1 | 5 | 10 |
1.5 From a dict of tuples
来源于元组 字典
You can automatically create a MultiIndexed frame by passing a tuples dictionary.
通过传递元组字典,可以自动创建多索引数据框。
pd.DataFrame(
{
("a", "b"): {("A", "B"): 1, ("A", "C"): 2},
("a", "a"): {("A", "C"): 3, ("A", "B"): 4},
("a", "c"): {("A", "B"): 5, ("A", "C"): 6},
("b", "a"): {("A", "C"): 7, ("A", "B"): 8},
("b", "b"): {("A", "D"): 9, ("A", "B"): 10},
}
)
a | b | |||||
---|---|---|---|---|---|---|
b | a | c | a | b | ||
A | B | 1.0 | 4.0 | 5.0 | 8.0 | 10.0 |
C | 2.0 | 3.0 | 6.0 | 7.0 | NaN | |
D | NaN | NaN | NaN | NaN | 9.0 |
1.6 From a Series
来源于序列
The result will be a DataFrame with the same index as the input Series, and with one column whose name is the original name of the Series (only if no other column name provided).
结果将是一个数据帧,其索引与输入序列相同,并且有一列的名称是序列的原始名称(仅当未提供其他列名时)。
1.7 From a list of namedtuples
来源于命名元组的列表中
The field names of the first namedtuple in the list determine the columns of the DataFrame. The remaining namedtuples (or tuples) are simply unpacked and their values are fed into the rows of the DataFrame. If any of those tuples is shorter than the first namedtuple then the later columns in the corresponding row are marked as missing values. If any are longer than the first namedtuple, a ValueError is raised.
列表中第一个命名元组的字段名决定了数据帧的列。剩下的命名元组(或元组)被简单地解包,它们的值被输入到数据帧的行中。如果这些元组中的任何一个短于第一个命名元组,则相应列后面的行将标记为缺少值。如果任何一个长度超过第一个命名元祖,则会引发ValueError。
from collections import namedtuple
import pandas as pd
Point = namedtuple("Point", "x y")
Point
__main__.Point
pd.DataFrame([Point(0, 0), Point(0, 3), (2, 3)])
x | y | |
---|---|---|
0 | 0 | 0 |
1 | 0 | 3 |
2 | 2 | 3 |
Point3D = namedtuple("Point3D", "x y z")
pd.DataFrame([Point3D(0, 0, 0), Point3D(0, 3, 5), Point(2, 3)])
x | y | z | |
---|---|---|---|
0 | 0 | 0 | 0.0 |
1 | 0 | 3 | 5.0 |
2 | 2 | 3 | NaN |
1.8 From a list of data classes
从数据类列表中
Data Classes as introduced in PEP557, can be passed into the DataFrame constructor. Passing a list of dataclasses is equivalent to passing a list of dictionaries.
PEP557中引入的数据类可以传递到数据帧构造容器中。传递数据类列表等同于传递字典列表。
Please be aware, that all values in the list should be dataclasses, mixing types in the list would result in a TypeError.
请注意,列表中的所有值都应该是数据类,在列表中混合类型将导致TypeError。
from dataclasses import make_dataclass
Point = make_dataclass("Point", [("x", int), ("y", int)])
pd.DataFrame([Point(0, 0), Point(0, 3), Point(2, 3)])
x | y | |
---|---|---|
0 | 0 | 0 |
1 | 0 | 3 |
2 | 2 | 3 |
Missing data
Much more will be said on this topic in the Missing data section. To construct a DataFrame with missing data, we use np.nan to represent missing values. Alternatively, you may pass a numpy.MaskedArray as the data argument to the DataFrame constructor, and its masked entries will be considered missing.
缺少数据
关于这个话题,我们将在缺失数据一节中进行更多讨论。为了构造一个含有缺失数据的数据帧,我们使用np.nan表示缺失值。或者,你也可以传递 numpy.MaskedArray作为数据参数传递给DataFrame 构造函数,其屏蔽条目将被视为缺失值。
1.9 Alternate constructors
备用构造方法
DataFrame.from_dict
DataFrame.from_dict takes a dict of dicts or a dict of array-like sequences and returns a DataFrame. It operates like the DataFrame constructor except for the orient parameter which is ‘columns’ by default, but which can be set to ‘index’ in order to use the dict keys as row labels.
DataFrame.from_dict获取一个字典的字典或一个类似数组的序列,并返回一个数据帧。它的操作与DataFrame构造函数类似,但orient参数除外,该参数默认为“columns”,但可以设置为“index”,以便将dict键用作行标签。
pd.DataFrame.from_dict(dict([("A", [1, 2, 3]), ("B", [4, 5, 6])]))
A | B | |
---|---|---|
0 | 1 | 4 |
1 | 2 | 5 |
2 | 3 | 6 |
If you pass orient=‘index’, the keys will be the row labels. In this case, you can also pass the desired column names:
如果传递orient=‘index’,则键将是行标签。在这种情况下,还可以传递想要的列名:
pd.DataFrame.from_dict(
dict([("A", [1, 2, 3]), ("B", [4, 5, 6])]),
orient="index",
columns=["one", "two", "three"],
)
one | two | three | |
---|---|---|---|
A | 1 | 2 | 3 |
B | 4 | 5 | 6 |
DataFrame.from_records
DataFrame.from_records takes a list of tuples or an ndarray with structured dtype. It works analogously to the normal DataFrame constructor, except that the resulting DataFrame index may be a specific field of the structured dtype. For example:
DataFrame.from_records获取元组列表或具有结构化数据类型的数据数组。它的工作方式类似于普通的数据帧构造函数,不同的是生成的数据帧索引可能是结构化数据类型的特定字段。例如:
data
array([(1, 2., b'Hello'), (2, 3., b'World')],
dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])
pd.DataFrame.from_records(data, index="C")
A | B | |
---|---|---|
C | ||
b'Hello' | 1 | 2.0 |
b'World' | 2 | 3.0 |