Python酷库之旅-第三方库Pandas(061)

神奇夜光杯

于 2024-08-03 08:06:56 发布

阅读量385

点赞数 23

分类专栏： Myelsa的Python酷库之旅文章标签： python pandas 开发语言人工智能 excel 第三方库学习与成长

本文链接：https://blog.csdn.net/ygb_1024/article/details/140856450

版权

Myelsa的Python酷库之旅专栏收录该内容

105 篇文章 21 订阅

订阅专栏

一、用法精讲

236、pandas.Series.explode方法

236-1、语法

236-2、参数

236-3、功能

236-4、返回值

236-5、说明

236-6、用法

236-6-1、数据准备

236-6-2、代码示例

236-6-3、结果输出

237、pandas.Series.searchsorted方法

237-1、语法

237-2、参数

237-3、功能

237-4、返回值

237-5、说明

237-6、用法

237-6-1、数据准备

237-6-2、代码示例

237-6-3、结果输出

238、pandas.Series.ravel方法

238-1、语法

238-2、参数

238-3、功能

238-4、返回值

238-5、说明

238-6、用法

238-6-1、数据准备

238-6-2、代码示例

238-6-3、结果输出

239、pandas.Series.repeat方法

239-1、语法

239-2、参数

239-3、功能

239-4、返回值

239-5、说明

239-6、用法

239-6-1、数据准备

239-6-2、代码示例

239-6-3、结果输出

240、pandas.Series.squeeze方法

240-1、语法

240-2、参数

240-3、功能

240-4、返回值

240-5、说明

240-6、用法

一、用法精讲

236、pandas.Series.explode方法

236-1、语法

# 236、pandas.Series.explode方法
pandas.Series.explode(ignore_index=False)
Transform each element of a list-like to a row.

Parameters:
ignore_index
bool, default False
If True, the resulting index will be labeled 0, 1, …, n - 1.

Returns:
Series
Exploded lists to rows; index will be duplicated for these rows.

236-2、参数

236-2-1、ignore_index(可选，默认值为False)：布尔值，若设置为False，则保持原始索引，展开后的新Series保持原始Series的索引；若设置为True，则忽略原始索引，展开后的新Series使用新的整数索引。

236-3、功能

将包含列表、元组或类似的可迭代对象的Series进行展开，使每个元素在新Series中都有一行。简单来说，它可以将一个包含列表的Series转换为一个平坦的Series，其中每个列表元素占据一行。

236-4、返回值

返回一个新的Series，其索引可能是原来的索引(如果ignore_index=False)或者是重新生成的整数索引(如果ignore_index=True)每个列表-like 元素中的项都变成新的行，如果某元素不是列表-like，则保持不变。

236-5、说明

使用场景：

236-5-1、处理嵌套列表数据：在处理从JSON、数据库或其他数据源导入的嵌套数据时，常常会遇到列表嵌套在单个单元格中的情况。explode()方法可以将这些嵌套列表展开为单独的行，便于进一步分析。如：电商订单数据，每个订单包含多个商品。

236-5-2、数据清洗与预处理：在数据清洗过程中，常常需要将一个单元格中的多个值分成多行，以便进行进一步的操作和清洗。如：用户标签数据，每个用户可能有多个标签。

236-5-3、文本分析：在自然语言处理和文本分析中，常常需要将文本数据拆分成单词或短语，然后对这些拆分后的单词或短语进行分析，explode()方法可以帮助将分词后的列表展开为单独的行。如：分词后的文本数据。

236-5-4、时间序列数据处理：在时间序列数据处理中，可能会有某些时间点对应多个事件或值的情况，explode()方法可以将这些多值的时间点展开为多个时间点，以便于进一步分析和处理。如：某时间点的多个事件。

236-6、用法

236-6-1、数据准备

无

236-6-2、代码示例

# 236、pandas.Series.explode方法
# 236-1、处理嵌套列表数据
import pandas as pd
# 示例数据
orders = pd.Series([['item1', 'item2'], ['item3'], ['item4', 'item5', 'item6']])
# 使用explode方法展开商品列表
exploded_orders = orders.explode()
print(exploded_orders, end='\n\n')

# 236-2、数据清洗与预处理
import pandas as pd
# 示例数据
user_tags = pd.Series([['tag1', 'tag2'], ['tag3'], ['tag4', 'tag5', 'tag6']])
# 使用explode方法展开标签列表
exploded_tags = user_tags.explode()
print(exploded_tags, end='\n\n')

# 236-3、文本分析
import pandas as pd
# 示例数据
texts = pd.Series([['word1', 'word2', 'word3'], ['word4'], ['word5', 'word6']])
# 使用explode方法展开分词后的列表
exploded_texts = texts.explode()
print(exploded_texts, end='\n\n')

# 236-4、时间序列数据处理
import pandas as pd
# 示例数据
time_series = pd.Series([['event1', 'event2'], ['event3'], ['event4', 'event5', 'event6']])
# 使用explode方法展开时间点的事件列表
exploded_time_series = time_series.explode()
print(exploded_time_series)

236-6-3、结果输出

# 236、pandas.Series.explode方法
# 236-1、处理嵌套列表数据
# 0    item1
# 0    item2
# 1    item3
# 2    item4
# 2    item5
# 2    item6
# dtype: object

# 236-2、数据清洗与预处理
# 0    tag1
# 0    tag2
# 1    tag3
# 2    tag4
# 2    tag5
# 2    tag6
# dtype: object

# 236-3、文本分析
# 0    word1
# 0    word2
# 0    word3
# 1    word4
# 2    word5
# 2    word6
# dtype: object

# 236-4、时间序列数据处理
# 0    event1
# 0    event2
# 1    event3
# 2    event4
# 2    event5
# 2    event6
# dtype: object

237、pandas.Series.searchsorted方法

237-1、语法

# 237、pandas.Series.searchsorted方法
pandas.Series.searchsorted(value, side='left', sorter=None)
Find indices where elements should be inserted to maintain order.

Find the indices into a sorted Series self such that, if the corresponding elements in value were inserted before the indices, the order of self would be preserved.

Note

The Series must be monotonically sorted, otherwise wrong locations will likely be returned. Pandas does not check this for you.

Parameters:
value
array-like or scalar
Values to insert into self.

side
{‘left’, ‘right’}, optional
If ‘left’, the index of the first suitable location found is given. If ‘right’, return the last such index. If there is no suitable index, return either 0 or N (where N is the length of self).

sorter
1-D array-like, optional
Optional array of integer indices that sort self into ascending order. They are typically the result of np.argsort.

Returns:
int or array of int
A scalar or array of insertion points with the same shape as value.

237-2、参数

237-2-1、value(必须)：标量或数组型数据，表示要查找的值。

237-2-2、side(可选，默认值为'left')：{'left', 'right'}，表示在找到等于value的元素时，是插入到左边还是右边。'left'表示插入到等于value的元素的左侧，'right'表示插入到右侧。

237-2-3、sorter(可选，默认值为None)：可选数组型数据，表示Series排序后的索引。

237-3、功能

用于查找一个值或一组值在一个排序好的Series中应插入的位置，以保持顺序不变，该方法对于二分查找、数据插入和位置索引等操作非常有用。

237-4、返回值

返回整数或整数数组，表示插入位置的索引。

237-5、说明

无

237-6、用法

237-6-1、数据准备

无

237-6-2、代码示例

# 237、pandas.Series.searchsorted方法
# 237-1、基本用法
import pandas as pd
# 创建一个排序好的Series
s = pd.Series([1, 2, 3, 4, 5])
# 查找插入值的位置
index = s.searchsorted(3)
print(index, end='\n\n')

# 237-2、使用'side'参数
import pandas as pd
# 创建一个排序好的Series
s = pd.Series([1, 2, 3, 3, 4, 5])
# 查找插入值的位置（插入左侧）
index_left = s.searchsorted(3, side='left')
print(index_left)
# 查找插入值的位置（插入右侧）
index_right = s.searchsorted(3, side='right')
print(index_right, end='\n\n')

# 237-3、处理未排序的Series
import pandas as pd
# 创建一个未排序的Series
s = pd.Series([5, 1, 4, 2, 3])
# 获取排序后的索引
sorter = s.argsort()
# 查找插入值的位置
index = s.searchsorted(3, sorter=sorter)
print(index)

237-6-3、结果输出

# 237、pandas.Series.searchsorted方法
# 237-1、基本用法
# 2

# 237-2、使用'side'参数
# 2
# 4

# 237-3、处理未排序的Series
# 2

238、pandas.Series.ravel方法

238-1、语法

# 238、pandas.Series.ravel方法
pandas.Series.ravel(order='C')
Return the flattened underlying data as an ndarray or ExtensionArray.

Deprecated since version 2.2.0: Series.ravel is deprecated. The underlying array is already 1D, so ravel is not necessary. Use to_numpy() for conversion to a numpy array instead.

Returns:
numpy.ndarray or ExtensionArray
Flattened data of the Series.

238-2、参数

238-2-1、order(可选，默认值为'C')：字符串类型，选项有：

'C'：按照C语言的行优先顺序(行优先，即先按行读取再按列读取)展平数组。
'F'：按照Fortran语言的列优先顺序(列优先，即先按列读取再按行读取)展平数组。
'A'：如果原始数据在内存中是按行优先顺序存储的，则返回按行优先顺序展平的数组；如果原始数据在内存中是按列优先顺序存储的，则返回按列优先顺序展平的数组。
'K'：尽可能保持原始数据的存储顺序。

238-3、功能

用于将Series对象展平为一个一维的NumPy数组。

238-4、返回值

返回一个一维的NumPy数组，其中包含了原Series对象中的所有数据。

238-5、说明

此方法目前版本仍然能用，但后续将被pandas.Series.to_numpy方法替代。

238-6、用法

238-6-1、数据准备

无

238-6-2、代码示例

# 238、pandas.Series.ravel方法
import pandas as pd
import numpy as np
# 创建一个Pandas Series对象
data = pd.Series([1, 2, 3, 4, 5])
# 使用ravel()方法
flattened_data_C = data.ravel(order='C')
flattened_data_F = data.ravel(order='F')
print("Flattened data (C order):", flattened_data_C)
print("Flattened data (F order):", flattened_data_F)

238-6-3、结果输出

# 238、pandas.Series.ravel方法
# Flattened data (C order): [1 2 3 4 5]
# Flattened data (F order): [1 2 3 4 5]

239、pandas.Series.repeat方法

239-1、语法

# 239、pandas.Series.repeat方法
pandas.Series.repeat(repeats, axis=None)
Repeat elements of a Series.

Returns a new Series where each element of the current Series is repeated consecutively a given number of times.

Parameters:
repeats
int or array of ints
The number of repetitions for each element. This should be a non-negative integer. Repeating 0 times will return an empty Series.

axis
None
Unused. Parameter needed for compatibility with DataFrame.

Returns:
Series
Newly created Series with repeated elements.

239-2、参数

239-2-1、repeats(必须)：整数或整数数组，如果是单个整数，则Series中的每个元素都会被重复该整数指定的次数；如果是一个与Series等长的整数数组，则每个元素会按照对应位置的整数进行重复。

239-2-2、axis(可选，默认值为None)：参数在Series中无效，因为Series是一维的，因此这个参数在这里不被使用。

239-3、功能

用于将Series中的每个元素按指定的次数重复，该方法对于数据扩展或增加数据量非常有用。

239-4、返回值

返回一个新的Pandas Series对象，其中每个元素按指定的次数进行了重复。

239-5、说明

无

239-6、用法

239-6-1、数据准备

无

239-6-2、代码示例

# 239、pandas.Series.repeat方法
import pandas as pd
# 创建一个Pandas Series对象
data = pd.Series([1, 2, 3])
# 每个元素重复3次
repeated_data_1 = data.repeat(3)
# 每个元素根据给定的数组分别重复
repeated_data_2 = data.repeat([1, 2, 3])
print("Repeated data (3 times):")
print(repeated_data_1)
print("\nRepeated data (1, 2, 3 times respectively):")
print(repeated_data_2)

239-6-3、结果输出

# 239、pandas.Series.repeat方法
# Repeated data (3 times):
# 0    1
# 0    1
# 0    1
# 1    2
# 1    2
# 1    2
# 2    3
# 2    3
# 2    3
# dtype: int64
# 
# Repeated data (1, 2, 3 times respectively):
# 0    1
# 1    2
# 1    2
# 2    3
# 2    3
# 2    3
# dtype: int64

240、pandas.Series.squeeze方法

240-1、语法

# 240、pandas.Series.squeeze方法
pandas.Series.squeeze(axis=None)
Squeeze 1 dimensional axis objects into scalars.

Series or DataFrames with a single element are squeezed to a scalar. DataFrames with a single column or a single row are squeezed to a Series. Otherwise the object is unchanged.

This method is most useful when you don’t know if your object is a Series or DataFrame, but you do know it has just a single column. In that case you can safely call squeeze to ensure you have a Series.

Parameters:
axis
{0 or ‘index’, 1 or ‘columns’, None}, default None
A specific axis to squeeze. By default, all length-1 axes are squeezed. For Series this parameter is unused and defaults to None.

Returns:
DataFrame, Series, or scalar
The projection after squeezing axis or all the axes.

240-2、参数

240-2-1、axis(可选，默认值为None)：{None, 0, 1}，选项有：

None：默认值，自动删除长度为1的维度。
0或index：如果Series或DataFrame在索引轴上只有一个值，则压缩该维度。
1或columns：如果Series或DataFrame在列轴上只有一个值，则压缩该维度。

240-3、功能

用于去除Series中长度为1的维度，它常用于处理从DataFrame中提取的单列或单行结果，使得返回的结果更加简洁。

240-4、返回值

返回一个去除了长度为1的维度后的对象，如果没有长度为1的维度，则返回原对象。

240-5、说明

无

240-6、用法

240-6-1、数据准备

无

240-6-2、代码示例

# 240、pandas.Series.squeeze方法
# 240-1、从DataFrame提取单行或单列
import pandas as pd
# 创建一个DataFrame
df = pd.DataFrame({
    'A': [10, 20, 30],
    'B': [15, 25, 35]
})
# 提取单列
single_column = df[['A']]
squeezed_column = single_column.squeeze()
# 提取单行
single_row = df.iloc[[0]]
squeezed_row = single_row.squeeze()
print("Original single column DataFrame:")
print(single_column)
print("Squeezed Series from single column:")
print(squeezed_column)
print("Original single row DataFrame:")
print(single_row)
print("Squeezed Series from single row:")
print(squeezed_row, end='\n\n')

# 240-2、数据分组后的操作
import pandas as pd
# 创建一个DataFrame
df = pd.DataFrame({
    'Category': ['A', 'A', 'B'],
    'Value': [10, 20, 30]
})
# 按'Category'分组并计算均值
grouped = df.groupby('Category').mean()
# 获取特定类别的数据并使用squeeze
single_category_mean = grouped.loc[['A']]
squeezed_category_mean = single_category_mean.squeeze()
print("Grouped mean DataFrame:")
print(single_category_mean)
print("Squeezed mean for single category:")
print(squeezed_category_mean, end='\n\n')

# 240-3、提高内存效率和性能
import pandas as pd
# 创建一个大型DataFrame
large_df = pd.DataFrame({'Value': range(1000000)})
# 提取单列并使用squeeze
squeezed_series = large_df[['Value']].squeeze()
# 检查内存使用
print("Memory usage of original DataFrame:", large_df.memory_usage(deep=True).sum())
print("Memory usage of squeezed Series:", squeezed_series.memory_usage(deep=True), end='\n\n')

# 240-4、与函数进行交互
import matplotlib.pyplot as plt
# 定义一个只接受 Series 的绘图函数
def plot_series(series):
    series.plot(kind='line', title='Series Plot')
    plt.show()
# 提取数据并传递给函数
data = df[['Value']].iloc[0:3]  # 提取单列
plot_series(data.squeeze())

# 240-5、简化输出
# 计算平均值并使用squeeze
processed_result = df[['Value']].mean().squeeze()
def display_result(result):
    print(f"Processed Result: {result}")
# 使用squeeze简化输出
display_result(processed_result)

# 240-6、数据清洗与转换
import pandas as pd
# 创建一个包含冗余维度的DataFrame
redundant_df = pd.DataFrame({'Value': [[10], [20], [30]]})
# 使用apply和squeeze清理数据
cleaned_series = redundant_df['Value'].apply(lambda x: pd.Series(x).squeeze())
print("Original DataFrame with redundant dimension:")
print(redundant_df)
print("Cleaned Series:")
print(cleaned_series, end='\n\n')

# 240-7、数学与统计计算
import pandas as pd
# 创建一个DataFrame
df = pd.DataFrame({'Value': [10, 20, 30]})
# 计算总和并使用squeeze
total_sum = df[['Value']].sum().squeeze()
print("Total sum of values:", total_sum)

240-6-3、结果输出

# 240、pandas.Series.squeeze方法
# 240-1、从DataFrame提取单行或单列
# Original single column DataFrame:
#     A
# 0  10
# 1  20
# 2  30
# Squeezed Series from single column:
# 0    10
# 1    20
# 2    30
# Name: A, dtype: int64
# Original single row DataFrame:
#     A   B
# 0  10  15
# Squeezed Series from single row:
# A    10
# B    15
# Name: 0, dtype: int64

# 240-2、数据分组后的操作
# Grouped mean DataFrame:
#           Value
# Category       
# A          15.0
# Squeezed mean for single category:
# 15.0

# 240-3、提高内存效率和性能
# Memory usage of original DataFrame: 8000132
# Memory usage of squeezed Series: 8000132

# 240-4、与函数进行交互
# 见图1

# 240-5、简化输出
# Processed Result: 20.0

# 240-6、数据清洗与转换
# Original DataFrame with redundant dimension:
#   Value
# 0  [10]
# 1  [20]
# 2  [30]
# Cleaned Series:
# 0    10
# 1    20
# 2    30
# Name: Value, dtype: int64

# 240-7、数学与统计计算
# Total sum of values: 60

图1：