Python酷库之旅-第三方库Pandas(107)-CSDN博客

# 466、pandas.DataFrame.eq方法
pandas.DataFrame.eq(other, axis='columns', level=None)
Get Equal to of dataframe and other, element-wise (binary operator eq).

Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.

Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.

Parameters:
other
scalar, sequence, Series, or DataFrame
Any single or multiple element data structure, or list-like object.

axis
{0 or ‘index’, 1 or ‘columns’}, default ‘columns’
Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).

level
int or label
Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns:
DataFrame of bool
Result of the comparison.

466-2、参数

466-2-1、other(必须)：标量、Series、DataFrame或array-like对象，与DataFrame进行比较的对象，如果other是标量，则DataFrame中的每个元素都会与该标量进行比较；如果是另一个DataFrame或Series，则逐元素进行比较。

466-2-2、axis(可选，默认值为'columns')：{0, 1, 'index', 'columns'}，确定运算的轴，如果设为0或'index'，则对行标签进行对齐操作；如果设为1或'columns'，则对列标签进行对齐操作，通常只有在other是DataFrame或Series时才需要指定axis参数。

466-2-3、level(可选，默认值为None)：用于在多层索引(MultiIndex)中匹配特定级别，如果DataFrame或other有MultiIndex，level指定要在MultiIndex的哪个级别进行对齐。

466-3、功能

对DataFrame中的每个元素与另一个对象(如标量、Series、DataFrame等)进行逐元素比较，判断其是否等于该对象，该方法常用于检查两个数据集之间的相似性或验证数据的一致性。

466-4、返回值

返回一个布尔类型的DataFrame，其中每个元素表示原始DataFrame中对应元素是否等于other中的对应元素。即，如果df[i,j] == other[i,j]，则结果DataFrame的该元素为True，否则为False。

466-5、说明

无

466-6、用法

466-6-1、数据准备

无

466-6-2、代码示例

# 466、pandas.DataFrame.eq方法
import pandas as pd
# 创建示例DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6]
})
# 比较DataFrame中的元素是否等于标量4
result = df.eq(4)
print(result)

466-6-3、结果输出

# 466、pandas.DataFrame.eq方法
#        A      B
# 0  False   True
# 1  False  False
# 2  False  False

467、pandas.DataFrame.combine方法

467-1、语法

# 467、pandas.DataFrame.combine方法
pandas.DataFrame.combine(other, func, fill_value=None, overwrite=True)
Perform column-wise combine with another DataFrame.

Combines a DataFrame with other DataFrame using func to element-wise combine columns. The row and column indexes of the resulting DataFrame will be the union of the two.

Parameters:
other
DataFrame
The DataFrame to merge column-wise.

func
function
Function that takes two series as inputs and return a Series or a scalar. Used to merge the two dataframes column by columns.

fill_value
scalar value, default None
The value to fill NaNs with prior to passing any column to the merge func.

overwrite
bool, default True
If True, columns in self that do not exist in other will be overwritten with NaNs.

Returns:
DataFrame
Combination of the provided DataFrames.

467-2、参数

467-2-1、other(必须)：要与当前DataFrame组合的另一个DataFrame，两个DataFrame的索引和列会相互对齐。

467-2-2、func(必须)：用于组合两个DataFrame中元素的函数，该函数必须接受两个参数，并返回一个值，该函数将应用于两个DataFrame中的每对元素。

467-2-3、fill_value(可选，默认值为None)：标量值，在操作中用于填充缺失值的填充值，如果其中一个DataFrame中某个位置存在NaN或缺失值，而另一个DataFrame中相应位置有值，则使用fill_value进行替代。

467-2-4、overwrite(可选，默认值为True)：布尔值，如果为True，则当self和other中对应元素是缺失值时，后者的值会覆盖前者的值；如果为False，则保持第一个DataFrame中的缺失值，除非使用fill_value进行替代。

467-3、功能

用于逐元素地将两个DataFrame组合在一起，通过指定一个函数func，定义如何组合数据。此方法可以用于两个DataFrame之间进行灵活的元素级操作，例如选择最大值、最小值或其他自定义的组合方式。

467-4、返回值

返回一个新的DataFrame，其中每个元素由func函数在self和other的对应元素上进行计算得到，索引和列名与输入的两个DataFrame对齐。

467-5、说明

无

467-6、用法

467-6-1、数据准备

无

467-6-2、代码示例

# 467、pandas.DataFrame.combine方法
import pandas as pd
import numpy as np
# 创建两个示例DataFrame
df1 = pd.DataFrame({'A': [1, 2, np.nan], 'B': [4, np.nan, 6]})
df2 = pd.DataFrame({'A': [3, 4, 5], 'B': [np.nan, 2, 1]})
# 定义一个组合函数，选择两个DataFrame中的最大值
def comb_func(x, y):
    return np.maximum(x, y)
# 使用combine进行组合
result = df1.combine(df2, comb_func, fill_value=0)
print(result)

467-6-3、结果输出

# 467、pandas.DataFrame.combine方法
#      A    B
# 0  3.0  4.0
# 1  4.0  2.0
# 2  5.0  6.0

468、pandas.DataFrame.combine_first方法

468-1、语法

# 468、pandas.DataFrame.combine_first方法
pandas.DataFrame.combine_first(other)
Update null elements with value in the same location in other.

Combine two DataFrame objects by filling null values in one DataFrame with non-null values from other DataFrame. The row and column indexes of the resulting DataFrame will be the union of the two. The resulting dataframe contains the ‘first’ dataframe values and overrides the second one values where both first.loc[index, col] and second.loc[index, col] are not missing values, upon calling first.combine_first(second).

Parameters:
other
DataFrame
Provided DataFrame to use to fill null values.

Returns:
DataFrame
The result of combining the provided DataFrame with the other object.

468-2、参数

468-2-1、other(必须)：与当前DataFrame进行组合的另一个DataFrame，两个 DataFrame 按照索引和列进行对齐。

468-3、功能

用于将两个DataFrame逐元素进行比较，并优先选择第一个DataFrame中的非空值，如果第一个DataFrame中的值是缺失值(NaN)，则使用第二个DataFrame中的相应值进行填充。

468-4、返回值

返回一个新DataFrame，其中每个元素优先选择第一个DataFrame中的非空值，如果第一个DataFrame中的值是NaN，则用第二个DataFrame中的相应值进行替代。

468-5、说明

无

468-6、用法

468-6-1、数据准备

无

468-6-2、代码示例

# 468、pandas.DataFrame.combine_first方法
import pandas as pd
import numpy as np
# 创建两个示例DataFrame
df1 = pd.DataFrame({'A': [1, np.nan, 3], 'B': [4, np.nan, 6]})
df2 = pd.DataFrame({'A': [np.nan, 2, 5], 'B': [7, 8, np.nan]})
# 使用combine_first进行组合
result = df1.combine_first(df2)
print(result)

468-6-3、结果输出

# 468、pandas.DataFrame.combine_first方法
#      A    B
# 0  1.0  4.0
# 1  2.0  8.0
# 2  3.0  6.0

469、pandas.DataFrame.apply方法

469-1、语法

# 469、pandas.DataFrame.apply方法
pandas.DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), by_row='compat', engine='python', engine_kwargs=None, **kwargs)
Apply a function along an axis of the DataFrame.

Objects passed to the function are Series objects whose index is either the DataFrame’s index (axis=0) or the DataFrame’s columns (axis=1). By default (result_type=None), the final return type is inferred from the return type of the applied function. Otherwise, it depends on the result_type argument.

Parameters:
funcfunction
Function to apply to each column or row.

axis{0 or ‘index’, 1 or ‘columns’}, default 0
Axis along which the function is applied:

0 or ‘index’: apply function to each column.

1 or ‘columns’: apply function to each row.

rawbool, default False
Determines if row or column is passed as a Series or ndarray object:

False : passes each row or column as a Series to the function.

True : the passed function will receive ndarray objects instead. If you are just applying a NumPy reduction function this will achieve much better performance.

result_type{‘expand’, ‘reduce’, ‘broadcast’, None}, default None
These only act when axis=1 (columns):

‘expand’ : list-like results will be turned into columns.

‘reduce’ : returns a Series if possible rather than expanding list-like results. This is the opposite of ‘expand’.

‘broadcast’ : results will be broadcast to the original shape of the DataFrame, the original index and columns will be retained.

The default behaviour (None) depends on the return value of the applied function: list-like results will be returned as a Series of those. However if the apply function returns a Series these are expanded to columns.

argstuple
Positional arguments to pass to func in addition to the array/series.

by_rowFalse or “compat”, default “compat”
Only has an effect when func is a listlike or dictlike of funcs and the func isn’t a string. If “compat”, will if possible first translate the func into pandas methods (e.g. Series().apply(np.sum) will be translated to Series().sum()). If that doesn’t work, will try call to apply again with by_row=True and if that fails, will call apply again with by_row=False (backward compatible). If False, the funcs will be passed the whole Series at once.

New in version 2.1.0.

engine{‘python’, ‘numba’}, default ‘python’
Choose between the python (default) engine or the numba engine in apply.

The numba engine will attempt to JIT compile the passed function, which may result in speedups for large DataFrames. It also supports the following engine_kwargs :

nopython (compile the function in nopython mode)

nogil (release the GIL inside the JIT compiled function)

parallel (try to apply the function in parallel over the DataFrame)

Note: Due to limitations within numba/how pandas interfaces with numba, you should only use this if raw=True

Note: The numba compiler only supports a subset of valid Python/numpy operations.

Please read more about the supported python features and supported numpy features in numba to learn what you can or cannot use in the passed function.

New in version 2.2.0.

engine_kwargsdict
Pass keyword arguments to the engine. This is currently only used by the numba engine, see the documentation for the engine argument for more information.

**kwargs
Additional keyword arguments to pass as keywords arguments to func.

Returns:
Series or DataFrame
Result of applying func along the given axis of the DataFrame.

469-2、参数

469-2-1、func(必须)：要应用于DataFrame的函数，可以是一个用户定义的函数或numpy的ufunc(通用函数)。

469-2-2、axis(可选，默认值为0)：{0 或 'index', 1 或 'columns'}，确定函数应用方向，如果axis=0或'index'，函数应用于每一列；如果axis=1或'columns'，函数应用于每一行。

469-2-3、raw(可选，默认值为False)：布尔值，如果为True，传递给函数的将是ndarray对象而不是Series对象，可能会提高性能。

469-2-4、result_type(可选，默认值为None)：{‘expand’, ‘reduce’, ‘broadcast’, None}，确定结果的形状：

expand：结果将被扩展成一个DataFrame。
reduce：尝试减少结果的维度，例如从DataFrame到Series。
broadcast：保持与输入相同的形状，将结果广播回原始DataFrame。

469-2-5、args(可选，默认值为())：元组，传递给func的其他位置参数。

469-2-6、by_row(可选，默认值为'compat')：字符串，适用于DataFrame子类的兼容性参数，目前在Pandas中并未广泛使用。

469-2-7、engine(可选，默认值为'python')：{‘cython’, ‘numba’, 'python'}，指定计算引擎，'cython'和'numba' 可以用于加速计算；'python'使用默认的Python计算。

469-2-8、engine_kwargs(可选，默认值为None)：字典，传递给指定计算引擎的其他关键字参数。

469-2-9、**kwargs(可选)：传递给func的其他关键字参数。

469-3、功能

将给定的函数应用于DataFrame的行或列上，可以按照行或列进行迭代，并对每一行或列应用指定的函数，该方法非常适合数据转换和清洗工作。

469-4、返回值

取决于func的返回值和axis参数，如果func应用于列(axis=0)，返回值将是一个Series对象；如果应用于行(axis=1)，返回值可能是一个DataFrame或Series对象。

469-5、说明

使用场景：

469-5-1、数据转换和清洗：在数据科学和机器学习项目中，经常需要对数据进行转换或清洗，例如处理缺失值、标准化数据或改变数据格式，该方法可以方便地对每一行或每一列应用相同的转换函数。

469-5-2、特征工程：在构建机器学习模型时，特征工程是一个关键步骤，该方法可以用于生成新的特征或转换现有的特征。

469-5-3、数据聚合和汇总：在需要对数据进行聚合操作时，该方法可以用于对每一行或每一列的值进行计算。

469-5-4、条件逻辑操作：可以根据条件对数据进行过滤或转换操作。

469-5-5、复杂计算：在一些复杂的场景下，直接使用内置函数可能不够灵活或高效，此时可以使用该方法编写更为复杂的计算逻辑。

469-5-6、合并和连接操作：当需要在DataFrame内部进行多列的合并操作时，该方法也是一个常用工具。

469-6、用法

469-6-1、数据准备

无

469-6-2、代码示例

# 469、pandas.DataFrame.apply方法
# 469-1、数据转换和清洗
import pandas as pd
import numpy as np
data = {
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, np.nan, 8],
    'C': [9, 10, 11, 12]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# 填充列中缺失值为列的均值
df = df.apply(lambda x: x.fillna(x.mean()), axis=0)
print("\nDataFrame after filling missing values with mean:")
print(df)

# 469-2、特征工程
import pandas as pd
data = {'text': ['Python is great', 'Pandas makes things easy', 'Machine learning is fun']}
df = pd.DataFrame(data)
def extract_keywords(text):
    # 简单的关键词提取函数，实际使用可以考虑更复杂的方法
    return text.split()
df['keywords'] = df['text'].apply(extract_keywords)
print("\nDataFrame with extracted keywords:")
print(df)
data = {'date': pd.to_datetime(['2021-01-01', '2022-05-03', '2023-07-19'])}
df = pd.DataFrame(data)
df['year'] = df['date'].apply(lambda x: x.year)
df['month'] = df['date'].apply(lambda x: x.month)
df['day'] = df['date'].apply(lambda x: x.day)
print("\nDataFrame with extracted year, month, and day:")
print(df)

# 469-3、数据聚合和汇总
import pandas as pd
data = {
    'price': [100, 250, 150],
    'quantity': [1, 2, 3]
}
df = pd.DataFrame(data)
df['total_spent'] = df.apply(lambda x: x['price'] * x['quantity'], axis=1)
print("\nDataFrame with total spent:")
print(df)

# 469-4、条件逻辑操作
import pandas as pd
data = {
    'price': [70, 150, 90]
}
df = pd.DataFrame(data)
df['high_value'] = df.apply(lambda x: 'Yes' if x['price'] > 100 else 'No', axis=1)
print("\nDataFrame with high_value column based on price:")
print(df)

# 469-5、复杂计算
import pandas as pd
data = {
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': [7, 8, 9]
}
df = pd.DataFrame(data)
def custom_metric(row):
    return (row['A'] * 2 + row['B'] ** 2) / row['C']
df['metric'] = df.apply(custom_metric, axis=1)
print("\nDataFrame with custom metric:")
print(df)

# 469-6、合并和连接操作
import pandas as pd
data = {
    'first_name': ['John', 'Jane', 'Jim'],
    'last_name': ['Doe', 'Doe', 'Beam']
}
df = pd.DataFrame(data)
df['full_name'] = df.apply(lambda x: x['first_name'] + ' ' +  x['last_name'], axis=1)
print("\nDataFrame with full_name:")
print(df)

469-6-3、结果输出

# 469、pandas.DataFrame.apply方法
# 469-1、数据转换和清洗
# Original DataFrame:
#      A    B   C
# 0  1.0  5.0   9
# 1  2.0  NaN  10
# 2  NaN  NaN  11
# 3  4.0  8.0  12
#
# DataFrame after filling missing values with mean:
#           A    B   C
# 0  1.000000  5.0   9
# 1  2.000000  6.5  10
# 2  2.333333  6.5  11
# 3  4.000000  8.0  12

# 469-2、特征工程
# DataFrame with extracted keywords:
#                        text                       keywords
# 0           Python is great            [Python, is, great]
# 1  Pandas makes things easy  [Pandas, makes, things, easy]
# 2   Machine learning is fun   [Machine, learning, is, fun]
#
# DataFrame with extracted year, month, and day:
#         date  year  month  day
# 0 2021-01-01  2021      1    1
# 1 2022-05-03  2022      5    3
# 2 2023-07-19  2023      7   19

# 469-3、数据聚合和汇总
# DataFrame with total spent:
#    price  quantity  total_spent
# 0    100         1          100
# 1    250         2          500
# 2    150         3          450

# 469-4、条件逻辑操作
# DataFrame with high_value column based on price:
#    price high_value
# 0     70         No
# 1    150        Yes
# 2     90         No

# 469-5、复杂计算
# DataFrame with custom metric:
#    A  B  C    metric
# 0  1  4  7  2.571429
# 1  2  5  8  3.625000
# 2  3  6  9  4.666667

# 469-6、合并和连接操作
# DataFrame with full_name:
#   first_name last_name full_name
# 0       John       Doe  John Doe
# 1       Jane       Doe  Jane Doe
# 2        Jim      Beam  Jim Beam

470、pandas.DataFrame.applymap方法

470-1、语法

# 470、pandas.DataFrame.applymap方法
pandas.DataFrame.applymap(func, na_action=None, **kwargs)
Apply a function to a Dataframe elementwise.

Deprecated since version 2.1.0: DataFrame.applymap has been deprecated. Use DataFrame.map instead.

This method applies a function that accepts and returns a scalar to every element of a DataFrame.

Parameters:
func
callable
Python function, returns a single value from a single value.

na_action
{None, ‘ignore’}, default None
If ‘ignore’, propagate NaN values, without passing them to func.

**kwargs
Additional keyword arguments to pass as keywords arguments to func.

Returns:
DataFrame
Transformed DataFrame.

470-2、参数

470-2-1、func(必须)：对DataFrame中的每一个元素应用的函数。

470-2-2、na_action(可选，默认值为None)：指定如何处理NaN(缺失值)，默认值None表示NaN也会被传递到函数func中进行处理；'ignore'表示忽略NaN值，不对它们进行任何处理，即这些值将保持为NaN。

470-2-3、**kwargs(可选)：传递给映射函数的其他参数。

470-3、功能

将指定的函数应用于DataFrame中的每一个元素，它是逐元素操作的，因此适用于需要对DataFrame中每个元素进行相同处理的场景。

470-4、返回值

返回一个新的DataFrame，其结构(行和列的标签)与原始DataFrame相同，但每个元素都经过了传入函数的处理。

470-5、说明

使用场景：

470-5-1、数据清洗和预处理

470-5-1-1、批量转换数据格式：当你需要将DataFrame中的所有数据元素转换为某种格式(如将所有字符串转为小写，或将所有数字转为特定的单位)时，applymap是一个理想的选择。

470-5-1-2、数据修复：例如，将DataFrame中的所有None或空字符串转换为NaN，或者将数据中的特定字符(如$)去除。

470-5-2、特征工程

470-5-2-1、统一特征处理：在机器学习的特征工程中，常常需要对整个数据集的每个元素进行相同的转换操作，比如对每个元素进行标准化、归一化或取对数等操作。

470-5-2-2、特征衍生：可以使用applymap在数据的每个元素上执行数学运算或逻辑操作，从而生成新的特征。

470-5-3、字符串操作

470-5-3-1、批量处理字符串：如果DataFrame中的每个元素都是字符串，可以使用applymap来执行各种操作，如去除空格、替换字符、格式转换等。

470-5-4、数值计算

470-5-4-1、数学变换：如果需要对数据中的每个数值元素进行同样的数学变换，比如乘法、指数计算、对数计算等，applymap可以帮助你快速实现。

470-5-4-2、异常值处理：可以使用applymap识别和处理异常值，比如将所有超过某个阈值的元素替换为NaN或其他值。

470-5-5、条件逻辑

470-5-5-1、条件转换：你可以使用applymap根据某些条件对DataFrame中的每个元素进行转换，比如将所有负数替换为0，或者根据某些条件将元素分组。

470-5-6、数据格式化

470-5-6-1、输出格式调整：当需要对DataFrame中的数值进行格式化，比如将所有数值显示为两位小数，或者将布尔值转换为字符串"True"/"False"时，applymap能够有效实现这一点。

470-5-7、数据可视化前的准备

470-5-7-1、格式调整：在将DataFrame数据用于可视化之前，可能需要对数据进行格式化或转换，使其更易于解读或符合图表的要求，这时applymap也非常有用。

470-6、用法

470-6-1、数据准备

无

470-6-2、代码示例

# 470、pandas.DataFrame.applymap方法
# 470-1、数据清洗和预处理
# 470-1-1、批量转换数据格式
import pandas as pd
df = pd.DataFrame({
    'Name': ['Myelsa', 'Bryce', 'Jimmy'],
    'City': ['New York', 'Los Angeles', 'Chicago']
})
# 将所有字符串转换为小写
df = df.applymap(lambda x: x.lower() if isinstance(x, str) else x)
print(df)

# 470-1-2、数据修复
import pandas as pd
import numpy as np
df = pd.DataFrame({
    'A': [1, None, 3],
    'B': ['x', None, 'z']
})
# 将None替换为NaN
df = df.applymap(lambda x: np.nan if x is None else x)
print(df)

# 470-2、特征工程
import pandas as pd
import numpy as np
df = pd.DataFrame({
    'Feature1': [1, 10, 100],
    'Feature2': [0.5, 5, 50]
})
# 对每个元素取对数
df = df.applymap(lambda x: np.log(x))
print(df)

# 470-3、字符串操作
import pandas as pd
df = pd.DataFrame({
    'A': ['  hello', 'world  ', '  pandas  '],
    'B': ['  data  ', 'science  ', '  rocks!  ']
})
# 去除字符串两边的空格
df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
print(df)

# 470-4、数值计算
import pandas as pd
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6]
})
# 每个元素乘以10
df = df.applymap(lambda x: x * 10)
print(df)

# 470-5、条件逻辑
import pandas as pd
df = pd.DataFrame({
    'A': [1, -2, 3],
    'B': [-4, 5, -6]
})
# 将负数替换为0
df = df.applymap(lambda x: 0 if x < 0 else x)
print(df)

# 470-6、数据格式化
import pandas as pd
df = pd.DataFrame({
    'A': [1.2345, 2.3456, 3.4567],
    'B': [4.5678, 5.6789, 6.7890]
})
# 将数值格式化为两位小数
df = df.applymap(lambda x: f"{x:.2f}")
print(df)

# 470-7、数据可视化前的准备
import pandas as pd
df = pd.DataFrame({
    'A': [True, False, True],
    'B': [False, True, False]
})
# 将布尔值转换为字符串
df = df.applymap(lambda x: "True" if x else "False")
print(df)

470-6-3、结果输出

# 470、pandas.DataFrame.applymap方法
# 470-1、数据清洗和预处理
# 470-1-1、批量转换数据格式
#      Name         City
# 0  myelsa     new york
# 1   bryce  los angeles
# 2   jimmy      chicago

# 470-1-2、数据修复
#      A    B
# 0  1.0    x
# 1  NaN  NaN
# 2  3.0    z

# 470-2、特征工程
#    Feature1  Feature2
# 0  0.000000 -0.693147
# 1  2.302585  1.609438
# 2  4.605170  3.912023

# 470-3、字符串操作
#         A        B
# 0   hello     data
# 1   world  science
# 2  pandas   rocks!

# 470-4、数值计算
#     A   B
# 0  10  40
# 1  20  50
# 2  30  60

# 470-5、条件逻辑
#    A  B
# 0  1  0
# 1  0  5
# 2  3  0

# 470-6、数据格式化
#       A     B
# 0  1.23  4.57
# 1  2.35  5.68
# 2  3.46  6.79

# 470-7、数据可视化前的准备
#        A      B
# 0   True  False
# 1  False   True
# 2   True  False