panda：从series、dataframe到切片|细说几个常见的方法

本文链接：https://blog.csdn.net/m0_74963458/article/details/146510462

panda：从series、dataframe到切片|细说几个常见的方法

pandas里面常用到的数据类型是dataframe和series,在对数据的各种变换过程中，涉及到多种有用的方法。然而，我们通常只关注这些方法最基本和最常用的调用方式。那么今天就一起看一看这些方法的参数和对应示例能为我们带来哪些启发呢？

一、数据：如何构造？

我们可以选择创建一个series数据，首先看一下pd.Series如何进行声明：

class Series(
    data: Sequence[str],
    index: Axes | None = ...,
    *,
    dtype: Dtype = ...,
    name: Hashable = ...,
    copy: bool = ...
)

在声明这个Series类过程中，给出的这些参数解释：

data:Sequence[str],可以是任意的序列类型(元组，列表，字典)
index:指定索引，若没有对应参数，则默认从0开始

这里，我注意到参数“name”在声明时存在“name:Hashable=…”。Hashable 是 Python 中的一个类型提示，用于表示一个对象是可哈希的。可哈希对象是指可以用作字典的键或存储在集合中的对象。通常，所有不可变的内置对象（如字符串、数字、元组）都是可哈希的。

…：在类型提示中，…（省略号）通常表示默认值未指定，或者表示该参数是可选的。

from typing import Hashable

def process_name(name: Hashable=...):
    print(f"The name is: {name}")

# 调用函数
process_name("Alice") #Alice
process_name(123) #123
process_name((1, 2, 3)) #(1, 2, 3)

1.1：异常处理->获取更多细节

下面给出一个有趣的例子，我们故意给出一些意料之外的错误，进而促使程序抛出异常，然后我们可以顺着这些线索看一下，程序如何处理我们传入的参数。


data = ['I','Love','Python','What about you']
index = ['who','how','what','then']
dtype = 'int64'
iscopy = True
s_new = pd.Series(data=data,index=index,dtype=dtype,copy=iscopy)
print(s_new)

--------请先观察上面的代码--------
--------哪里会发生错误呢？--------

聪明的读者已经发现了，我这里给它指定的数据类型是“int64”,但是传进入的却是字符串。
错误：

ValueError: invalid literal for int() with base 10: ‘I’
再查看python回溯的信息，首先第一条指向：“data = sanitize_array(data, index, dtype, copy)”
这里点进去查看sanitize,发现它实现的功能如下：Sanitize input data to an ndarray or ExtensionArray, copy if specified, coerce to the dtype if specified.

翻译出来：对输入数据进行处理，使其成为一个NumPy的ndarray数组或Pandas的ExtensionArray数组。如果指定了，则进行复制操作；如果指定了数据类型（dtype），则将数据强制转换为该数据类型。

最终数据异常的抛出位置：

 2038 try:
   2039     if not isinstance(arr, np.ndarray):
-> 2040         casted = np.array(arr, dtype=dtype, copy=copy)
   2041     else:
   2042         casted = arr.astype(dtype, copy=copy)

这段代码位置：pandas\core\dtypes\cast.py,函数：“maybe_cast_to_integer_array”,它实现的功能接受任意数据类型（dtype）并返回转换后的数据类型。若数据与整数 / 无符号整数数据类型不兼容，则会抛出异常。

1.2 创建丰富数据：DataFrame

# 使用DataFrame创建数据
'''
我们注意到:使用pd.Series创建的数据非常有限，因此我们使用pd.DataFrame来构造多列数据。但使用字典创建数据时，由于字典中的顺序会发生改变，因而我们需要为其指定columns顺序。
'''

arts = pd.DataFrame({
    'Name':['fangao','dafenqi','bijiasuo'],
    'Age':[62,73,84],
    'Sex':['man','woman','man'],
    'Died':['1920-07-20','1990-09-25','1987-5-12']
},index=['ONE','TWO','THREE'],
columns=['Name','Age','Sex','Died'])
print(arts)

总结：

元素数据类型必须相同
Series是一维容器，即DataFrame的每一列
当传入的是混合类型的列表时，将会使用常见数据类型“Object”
可以将DataFrame看作是Series对象组成的字典：键代表列名，值是列的内容

二、数据切片：方法与属性

读取数据，查看数据属性，对数据进行描述性统计：read_csv,index,values,dtypes,describe
在这一部分，我将介绍几个切片的方法，查看变量类型、以及如何访问series变量的属性：索引、值。

(1)导入数据：

#规定print函数的end参数格式，使其显示更为清晰
END_STRING = '\n'+'--------------------------------'+'\n' 
scientists = pd.read_csv('../data/scientists.csv')
print(scientists)

OUTPUT:
Name Born Died Age Occupation
0 Rosaline Franklin 1920-07-25 1958-04-16 37 Chemist
1 William Gosset 1876-06-13 1937-10-16 61 Statistician
2 Florence Nightingale 1820-05-12 1910-08-13 90 Nurse
3 Marie Curie 1867-11-07 1934-07-04 66 Chemist
4 Rachel Carson 1907-05-27 1964-04-14 56 Biologist
5 John Snow 1813-03-15 1858-06-16 45 Physician
6 Alan Turing 1912-06-23 1954-06-07 41 Computer Scientist
7 Johann Gauss 1777-04-30 1855-02-23 77 Mathematician

(2)切片及变量

# 获取切片
william = scientists.iloc[0,:]
print(type(william))
print(END_STRING,william)
print(END_STRING,william.values)
print(scientists.index)
print(scientists.dtypes)

选取age列进行切片：ages = scientists[‘Age’]
向量化操作，通过布尔值在原数据中选取符合要求的元素
获取series切片age_describe对应的值，例如我们需要age这里数据25%的值：

ages_describe = ages.describe()
print(type(ages_describe),ages_describe.index,
ages_describe[‘25%’],END_STRING)

ages = scientists['Age']
ages_describe = ages.describe() #<class 'pandas.core.series.Series'>
print(ages,type(ages),END_STRING)
print(ages.describe(),END_STRING)
print(type(ages_describe),ages_describe.index,\
      ages_describe['25%'],END_STRING)
print(ages[ages >ages.mean()],END_STRING)

三、更进一步：函数深究

这一段主要深入理解series.sort_index() pd.to_datetime()

(1)排序：sort_index

功能介绍->Sort Series by index labels.Returns a new Series sorted by label if inplace argument is False, otherwise updates the original series and returns None

函数声明：

def sort_index(
    *,
    axis: Axis = 0,
    level: Level | list[int] | list[str] | None = None,
    ascending: bool | Sequence[bool] = True,
    kind: SortKind = 'quicksort',
    na_position: NaPosition = 'last',
    sort_remaining: bool = True,
    ignore_index: bool = False,
    inplace: bool = False,
    key: ((...) -> Any) | None = None
) -> Series

注意：这里不仅可用于处理series数据，还能对dataframe数据的索引标签进行排序

重要参数介绍：

axis:默认为0，表示按照行索引排序
level：若为多级索引，则指定排序级别
ascending:默认为True表示升序
inplace:默认False会返回一个新的series
key:一个函数，用于将索引值按照一定规则进行替换

这里我再介绍一种生成随机数据的方法，利用该方法可以快速实现在相对较大的数据集上上验证我们对dataframe的各种猜想和试验。

# 生成随机元素
import random
import string
def generate_random_str(length):
    return ''.join(random.choices(string.ascii_letters + string.digits,k=length))
# 生成随机元素
num_rows = 5
num_cols = 3
data = {f'col_{i}':[generate_random_str(6)
                    for _ in range(num_rows) ] 
            for i in range(num_cols)}

#创建DataFrame
df = pd.DataFrame(data,index=[random.sample(string.digits,num_rows)])
df2 = df.copy()
df2.sort_index(axis=0,inplace=True,ascending=True)
df.rename(columns = lambda x: x.upper(),inplace=True)
print('生成随机数据元素：','\n',df)
print(END_STRING,df2)

生成结果如下：
生成随机数据元素：
col_0 col_1 col_2
1 aJKfgQ 1vWd4T giYCte
7 FnUfa4 ZmTOnY Q8iGJS
8 rTaYQc TUTjRR 2MYwcL
0 LFkzWO ofA3BA OUK33S
5 kW5yec RzvzQb FkIDk8

========================
排序后的数据：
col_0 col_1 col_2
0 LFkzWO ofA3BA OUK33S
1 aJKfgQ 1vWd4T giYCte
5 kW5yec RzvzQb FkIDk8
7 FnUfa4 ZmTOnY Q8iGJS
8 rTaYQc TUTjRR 2MYwcL

(2)重命名：rename

重命名函数既可以用于行标签，也可以在列名上进行操作，我们以

(method) def rename(
    mapper: Renamer | None = ...,
    *,
    index: Renamer | None = ...,
    columns: Renamer | None = ...,
    axis: Axis | None = ...,
    copy: bool = ...,
    inplace: Literal[True],
    level: Level | None = ...,
    errors: IgnoreRaise = ...
) -> None

rename函数中第一个参数mapper

df2.rename(mapper=lambda f:f.split('_')[1],inplace=True,
           axis='columns',copy=True)
print(df2)

结果：

0 Otx7E4 02PaEY tJqW3P
4 NtwVaC hXy6TI HZ4i4a
6 SlKL72 gH9hJs fnbKuK
8 EEDHtP UQ3HCh 35BBdp
9 jsPN7u 4k6EPE 26OYWa

(3)时间序列:to_datetime

born_datatime = pd.to_datetime(scientists['Born'],format = '%Y-%m-%d')
died_datetime = pd.to_datetime(scientists['Died'],format = '%Y-%m-%d')
scientists['born_dt'],scientists['died_dt'] = (born_datatime,died_datetime)
print(scientists.head())