深度总结一下这份海通证券的金融工程研究报告《寻找逐笔交易中的有效信息》，并提供策略复现的 Python 代码

本文链接：https://blog.csdn.net/zhangyunchou2015/article/details/147248630

好的，我们来深度总结一下这份海通证券的金融工程研究报告，并提供策略复现的 Python 代码。

报告来源与标题

来源: 海通证券 (Haitong Securities)
系列: 选股因子系列研究 (六十六)
标题: 寻找逐笔交易中的有效信息
发布日期: 2020年06月21日
作者: 冯佳睿, 余浩淼

报告核心观点总结
主要研究内容与发现
- 2.1 大单成交金额占比因子分析
  - 2.1.1 大买/大卖单因子及正交处理
  - 2.1.2 进一步拆分的大单因子
  - 2.1.3 因子在指数增强组合中的表现
- 2.2 基于大单信息重构K线因子
  - 2.2.1 逐笔信息过滤与K线重构方法
  - 2.2.2 重构K线因子的选股效果
  - 2.2.3 K线因子与大单因子的结合
策略复现思路与Python代码
- 3.1 数据准备 (假设)
- 3.2 大单因子计算
- 3.3 K线重构与因子计算
- 3.4 示例代码
风险提示

1. 报告核心观点总结

这份报告的核心目标是从逐笔成交数据中挖掘有效的选股信息，主要围绕“大单”（大额订单）行为展开。报告发现：

大额买单具有预测能力： 基于逐笔信息计算的“大买成交金额占比”（大额买单成交额占全天总成交额的比例）因子具有显著为正的选股效果，即该比例高的股票未来倾向于上涨。
买卖行为影响不对称： 大额买单的预测能力强于大额卖单。即使考虑了大额卖单占比，甚至将其与常用风格因子正交后，其预测能力也较弱，甚至有时呈现微弱正相关。这体现了市场上大资金买入和卖出意愿对股价影响的差异性。
K线重构提升因子效果： 利用大单参与信息（如，只保留有大单参与成交的逐笔数据）来过滤原始逐笔数据，然后重新构建分钟K线，可以提升部分基于K线计算的高频因子（如平均单笔流出金额占比、大单资金净流入率、大单推动涨幅）的表现，尤其是在中证500成分股内。
因子结合与应用： 将有效的大单因子（如正交后的大买占比、剔除小单对手方的大卖占比等）或重构后的K线因子加入传统的多因子模型（如 Barra 风格因子），可以提升组合的选股能力和指数增强效果。

2. 主要研究内容与发现

2.1 大单成交金额占比因子分析

2.1.1 大买/大卖单因子及正交处理:
- 定义: 报告首先定义了“大单”。一种方法是基于当日逐笔成交金额超过均值+k倍标准差（报告测试了k=0, 1, 2, 3）。
- 因子构建: 计算“大买成交金额占比”（Large Buy Amount Ratio, LBAR）和“大卖成交金额占比”（Large Sell Amount Ratio, LSAR）。
- 测试: 对因子进行行业中性化和常见风格因子（市值、估值、动量、波动率、换手率等9因子）正交处理。
- 发现: 正交后，LBAR 因子依然保持显著的正向选股能力（IC 均值高，ICIR 好），而 LSAR 因子的选股能力则非常弱，IC 接近0。这表明大额买单本身包含的增量信息更强。随着定义大单的阈值（k值）提高，因子稳定性有所下降。
2.1.2 进一步拆分的大单因子:
- 为了探究买卖双方力量对比，报告进一步拆分了成交类型：
  - 剔除大卖的大买成交金额占比: 成交中，买方是大单，卖方是小单。
  - 剔除大买的大卖成交金额占比: 成交中，卖方是大单，买方是小单。（报告发现该因子有显著负向选股能力）
  - 大买、大卖成交金额占比: 成交中，买卖双方都是大单。（报告发现该因子有显著正向选股能力）
- 发现: 大单卖出只有在对手方是小单时才显示出显著的负向预测能力。而大单之间的交易（大买vs大卖）反而预示着未来上涨。这更清晰地揭示了大单行为背后信息的复杂性。
2.1.3 因子在指数增强组合中的表现:
- 将表现较好的大单因子（如正交后的大买占比、剔除大买的大卖占比、大买大卖占比）分别加入包含9个基础风格因子的模型中，构建中证500和沪深300指数增强组合。
- 发现: 加入这些大单因子后，组合的复合因子 IC 和 ICIR 均有提升，尤其是在中证500指数上效果更明显。其中，“正交大买成交金额占比”和“剔除大买的大卖成交金额占比”因子在多头组合构建中表现突出。

2.2 基于大单信息重构K线因子

2.2.1 逐笔信息过滤与K线重构方法:
- 思路: 认为并非所有逐笔成交都包含同等有效信息，有大单参与的成交可能更重要。
- 方法:
  1. 过滤: 筛选出当日所有逐笔成交中，属于“大买单”或“大卖单”参与的成交记录（这里的大单阈值使用k=0，即高于当日均值）。
  2. 重构: 使用过滤后的逐笔数据，重新合成1分钟周期的K线数据（OHLC、成交量、成交额、成交笔数）。
2.2.2 重构K线因子的选股效果:
- 测试因子:
  - 平均单笔流出金额占比 (Avg Outflow Amt Ratio): 基于分钟K线计算，度量下跌分钟里平均每笔成交的金额占比。
  - 大单资金净流入率 (Large Order Net Inflow Rate): 基于分钟K线，识别出当日平均成交额最大的前10%的分钟K线作为“大单K线”，计算这些K线的净流入率。
  - 大单推动涨幅 (Large Order Momentum): 同样基于“大单K线”，计算这些K线累积的乘积涨幅。
- 对比: 比较使用原始分钟K线和重构后分钟K线计算出的上述三个因子的选股效果（IC、ICIR、分组收益）。
- 发现: 使用“大买或大卖订单参与”过滤后重构的K线计算出的三个因子，其选股效果（尤其是ICIR和多头组合表现）普遍优于使用原始K线计算的结果。在中证500内提升效果比沪深300更显著。
2.2.3 K线因子与大单因子的结合:
- 测试: 将重构后的K线因子与前述表现好的逐笔大单因子进行正交分析，并加入9因子模型进行组合测试。
- 发现: 重构后的K线因子与逐笔大单因子存在一定的相关性（高于原始K线因子），说明其效果提升部分来自于捕捉了大单信息。将重构K线因子叠加进已经包含大单因子的模型中，仍能带来一定的边际效果提升，尤其体现在“平均单笔流出金额占比”因子上。

3. 策略复现思路与Python代码

要复现报告中的策略，我们需要处理逐笔数据和分钟K线数据。以下是思路和基于假设数据的 Python 代码示例。

3.1 数据准备 (假设)

逐笔数据 (Tick Data): 需要包含时间戳、股票代码、价格、成交量、买卖方向（主动买/主动卖）、成交金额。 Pandas DataFrame 结构: ['datetime', 'ticker', 'price', 'volume', 'amount', 'bs_flag'] (bs_flag: ‘B’ for buy, ‘S’ for sell)。
分钟K线数据 (Minute Bar Data): 标准 OHLCV、成交额、成交笔数。 Pandas DataFrame 结构: ['datetime', 'ticker', 'open', 'high', 'low', 'close', 'volume', 'amount', 'trades']。
交易日期数据: 用于按天处理。

import pandas as pd
import numpy as np
from tqdm import tqdm

# --- 假设的数据加载函数 (需要自行实现) ---
def load_tick_data(ticker, date):
    """加载指定股票和日期的逐笔数据"""
    # Placeholder: 实际应从数据库或文件加载
    # 模拟数据结构：datetime, ticker, price, volume, amount, bs_flag
    print(f"模拟加载 {ticker} 在 {date} 的逐笔数据...")
    # 实际应用中需要处理时间格式和数据清洗
    dt_range = pd.date_range(f"{date} 09:30:00", f"{date} 15:00:00", freq='1s')
    count = len(dt_range)
    df = pd.DataFrame({
        'datetime': dt_range,
        'ticker': ticker,
        'price': np.random.uniform(9.8, 10.2, count).round(2),
        'volume': np.random.randint(100, 10000, count),
        'bs_flag': np.random.choice(['B', 'S'], count, p=[0.5, 0.5])
    })
    df['amount'] = df['price'] * df['volume']
    # 过滤非交易时间 (简化)
    df = df[(df['datetime'].dt.time >= pd.to_datetime('09:30:00').time()) &
            (df['datetime'].dt.time <= pd.to_datetime('11:30:00').time()) |
            (df['datetime'].dt.time >= pd.to_datetime('13:00:00').time()) &
            (df['datetime'].dt.time <= pd.to_datetime('15:00:00').time())]
    return df.set_index('datetime')


def load_minute_data(ticker, date):
    """加载指定股票和日期的原始分钟K线数据"""
    # Placeholder: 实际应从数据库或文件加载
    # 模拟数据结构：datetime, ticker, open, high, low, close, volume, amount, trades
    print(f"模拟加载 {ticker} 在 {date} 的原始分钟K线数据...")
    # 实际应用中需要处理时间格式和数据清洗
    dt_range = pd.date_range(f"{date} 09:31:00", f"{date} 15:00:00", freq='1min')
    count = len(dt_range)
    df = pd.DataFrame({
        'datetime': dt_range,
        'ticker': ticker,
        'open': np.random.uniform(9.9, 10.1, count).round(2),
        'high': lambda x: x['open'] + np.random.uniform(0, 0.1, count),
        'low': lambda x: x['open'] - np.random.uniform(0, 0.1, count),
        'close': lambda x: x['open'] + np.random.uniform(-0.05, 0.05, count),
        'volume': np.random.randint(10000, 100000, count),
        'trades': np.random.randint(50, 500, count)
    })
    df['high'] = df.apply(lambda row: max(row['high'], row['open'], row['close']), axis=1).round(2)
    df['low'] = df.apply(lambda row: min(row['low'], row['open'], row['close']), axis=1).round(2)
    df['amount'] = df['volume'] * df[['open', 'high', 'low', 'close']].mean(axis=1) # 估算成交额
     # 过滤非交易时间 (简化)
    df = df[(df['datetime'].dt.time >= pd.to_datetime('09:31:00').time()) &
            (df['datetime'].dt.time <= pd.to_datetime('11:30:00').time()) |
            (df['datetime'].dt.time >= pd.to_datetime('13:01:00').time()) &
            (df['datetime'].dt.time <= pd.to_datetime('15:00:00').time())]

    return df.set_index('datetime')

# --- 交易日期列表 (示例) ---
trading_dates = ['2020-06-18', '2020-06-19']
tickers = ['stock_A', 'stock_B']

# --- 存储因子值的 DataFrame ---
factor_results = pd.DataFrame(index=pd.MultiIndex.from_product([trading_dates, tickers], names=['date', 'ticker']))

3.2 大单因子计算

def calculate_large_order_ratios(tick_data, k=0):
    """
    计算大买、大卖、大单成交金额占比因子
    Args:
        tick_data (pd.DataFrame): 单只股票单日的逐笔数据
                                  (需包含 amount, bs_flag 列)
        k (int): 定义大单的阈值（标准差倍数）

    Returns:
        dict: 包含 LBAR, LSAR, LOR (Large Order Ratio) 的字典
    """
    if tick_data.empty:
        return {'LBAR': np.nan, 'LSAR': np.nan, 'LOR': np.nan}

    total_amount = tick_data['amount'].sum()
    if total_amount == 0:
        return {'LBAR': np.nan, 'LSAR': np.nan, 'LOR': np.nan}

    # 计算大单阈值
    mean_amount = tick_data['amount'].mean()
    std_amount = tick_data['amount'].std()
    large_order_threshold = mean_amount + k * std_amount

    # 标记大单
    tick_data['is_large'] = tick_data['amount'] > large_order_threshold

    # 计算因子
    large_buy_amount = tick_data[(tick_data['bs_flag'] == 'B') & tick_data['is_large']]['amount'].sum()
    large_sell_amount = tick_data[(tick_data['bs_flag'] == 'S') & tick_data['is_large']]['amount'].sum()
    large_order_amount = tick_data[tick_data['is_large']]['amount'].sum()

    LBAR = large_buy_amount / total_amount if total_amount else np.nan
    LSAR = large_sell_amount / total_amount if total_amount else np.nan
    LOR = large_order_amount / total_amount if total_amount else np.nan

    return {'LBAR': LBAR, 'LSAR': LSAR, 'LOR': LOR}

# --- 示例循环计算 ---
for date in tqdm(trading_dates, desc="Calculating Large Order Ratios"):
    for ticker in tickers:
        try:
            ticks = load_tick_data(ticker, date)
            ratios = calculate_large_order_ratios(ticks, k=0) # 使用 k=0 阈值
            factor_results.loc[(date, ticker), 'LBAR'] = ratios['LBAR']
            factor_results.loc[(date, ticker), 'LSAR'] = ratios['LSAR']
            factor_results.loc[(date, ticker), 'LOR'] = ratios['LOR']
        except Exception as e:
            print(f"Error calculating large order ratios for {ticker} on {date}: {e}")
            factor_results.loc[(date, ticker), ['LBAR', 'LSAR', 'LOR']] = np.nan

print("\n--- 大单成交金额占比因子 ---")
print(factor_results[['LBAR', 'LSAR', 'LOR']].head())

3.3 K线重构与因子计算

def reconstruct_minute_bars(tick_data, k=0):
    """
    根据大单过滤规则重构分钟K线
    Args:
        tick_data (pd.DataFrame): 单只股票单日的逐笔数据
                                  (需包含 amount, price, volume 列)
        k (int): 定义大单的阈值

    Returns:
        pd.DataFrame: 重构后的分钟K线数据
    """
    if tick_data.empty:
        return pd.DataFrame()

    # 计算大单阈值并标记
    mean_amount = tick_data['amount'].mean()
    std_amount = tick_data['amount'].std()
    large_order_threshold = mean_amount + k * std_amount
    tick_data['is_large'] = tick_data['amount'] > large_order_threshold

    # 报告中是“大买或大卖订单参与”，这里简化为该笔成交是“大单”
    filtered_ticks = tick_data[tick_data['is_large']].copy()

    if filtered_ticks.empty:
        return pd.DataFrame()

    # 重采样为分钟线
    ohlc_dict = {
        'price': 'ohlc', # pandas < 2.0 needs 'price' column for ohlc
        'volume': 'sum',
        'amount': 'sum',
        'ticker': 'first' # Keep ticker info
    }
    # Count trades per minute
    filtered_ticks['trades'] = 1
    ohlc_dict['trades'] = 'sum'

    # Ensure 'price' column exists for older pandas versions
    if 'price' not in filtered_ticks.columns:
         filtered_ticks['price'] = filtered_ticks['amount'] / filtered_ticks['volume']
         filtered_ticks['price'] = filtered_ticks['price'].fillna(method='ffill').fillna(method='bfill')


    reconstructed_bars = filtered_ticks.resample('1min').apply(ohlc_dict)
    reconstructed_bars.columns = ['open', 'high', 'low', 'close', 'volume', 'amount', 'ticker', 'trades'] # Adjust column names
    reconstructed_bars = reconstructed_bars.dropna(subset=['open']) # Remove minutes with no trades

    return reconstructed_bars

def calculate_avg_outflow_ratio(minute_bars):
    """计算平均单笔流出金额占比"""
    if minute_bars.empty or 'trades' not in minute_bars.columns or minute_bars['trades'].sum() == 0:
        return np.nan

    falling_bars = minute_bars[minute_bars['close'] < minute_bars['open']]
    if falling_bars.empty or falling_bars['trades'].sum() == 0:
        return 0.0 # No outflow

    sum_amount_falling = falling_bars['amount'].sum()
    sum_trades_falling = falling_bars['trades'].sum()

    avg_tick_amount_falling = sum_amount_falling / sum_trades_falling if sum_trades_falling else 0

    total_amount = minute_bars['amount'].sum()
    total_trades = minute_bars['trades'].sum()
    avg_tick_amount_total = total_amount / total_trades if total_trades else np.nan

    if pd.isna(avg_tick_amount_total) or avg_tick_amount_total == 0:
         return np.nan

    # 按照报告公式的结构
    # 注意：报告公式似乎是 (下跌bar的平均tick额) / (全天平均tick额) * (-1)
    # 检查分母是否为0
    return - (avg_tick_amount_falling / avg_tick_amount_total)


def calculate_large_kline_factors(minute_bars, top_n_pct=0.1):
    """计算大单资金净流入率和大单推动涨幅"""
    if minute_bars.empty or 'trades' not in minute_bars.columns or minute_bars['trades'].sum() == 0:
         return {'NetInflowRate': np.nan, 'Momentum': np.nan}

    minute_bars['avg_tick_amount'] = minute_bars['amount'] / minute_bars['trades']
    minute_bars['return'] = minute_bars['close'].pct_change().fillna(0) # 计算分钟收益率

    # 识别大单K线 (按平均单笔成交额)
    threshold = minute_bars['avg_tick_amount'].quantile(1 - top_n_pct)
    minute_bars['is_large_kline'] = minute_bars['avg_tick_amount'] >= threshold

    large_klines = minute_bars[minute_bars['is_large_kline']]
    if large_klines.empty:
        return {'NetInflowRate': 0.0, 'Momentum': 0.0} # No large klines identified

    # 大单资金净流入率
    inflow = large_klines[large_klines['return'] > 0]['amount'].sum()
    outflow = large_klines[large_klines['return'] < 0]['amount'].sum()
    total_large_kline_amount = large_klines['amount'].sum()

    net_inflow_rate = (inflow - outflow) / total_large_kline_amount if total_large_kline_amount else 0.0

    # 大单推动涨幅
    # prod(1 + r_i * I(is large kline)) - 1
    # 只在is_large_kline为True时应用收益率
    momentum_returns = (1 + minute_bars['return'] * minute_bars['is_large_kline'])
    # Handle potential zeros or NaNs if needed before product calculation
    momentum_returns = momentum_returns[momentum_returns > 0] # Avoid log(0) or negative values if using logs later
    large_order_momentum = momentum_returns.prod() - 1


    return {'NetInflowRate': net_inflow_rate, 'Momentum': large_order_momentum}

# --- 示例循环计算 ---
for date in tqdm(trading_dates, desc="Calculating Reconstructed K-line Factors"):
    for ticker in tickers:
        try:
            # 计算基于原始K线的因子
            original_min_bars = load_minute_data(ticker, date)
            avg_outflow_orig = calculate_avg_outflow_ratio(original_min_bars)
            large_kline_factors_orig = calculate_large_kline_factors(original_min_bars)

            factor_results.loc[(date, ticker), 'AvgOutflow_Orig'] = avg_outflow_orig
            factor_results.loc[(date, ticker), 'NetInflow_Orig'] = large_kline_factors_orig['NetInflowRate']
            factor_results.loc[(date, ticker), 'Momentum_Orig'] = large_kline_factors_orig['Momentum']

            # 计算基于重构K线的因子
            ticks = load_tick_data(ticker, date) # 重新加载或复用
            reconstructed_min_bars = reconstruct_minute_bars(ticks, k=0) # 使用 k=0 阈值重构
            avg_outflow_recon = calculate_avg_outflow_ratio(reconstructed_min_bars)
            large_kline_factors_recon = calculate_large_kline_factors(reconstructed_min_bars)

            factor_results.loc[(date, ticker), 'AvgOutflow_Recon'] = avg_outflow_recon
            factor_results.loc[(date, ticker), 'NetInflow_Recon'] = large_kline_factors_recon['NetInflowRate']
            factor_results.loc[(date, ticker), 'Momentum_Recon'] = large_kline_factors_recon['Momentum']

        except Exception as e:
            print(f"Error calculating k-line factors for {ticker} on {date}: {e}")
            factor_results.loc[(date, ticker), [
                'AvgOutflow_Orig', 'NetInflow_Orig', 'Momentum_Orig',
                'AvgOutflow_Recon', 'NetInflow_Recon', 'Momentum_Recon'
            ]] = np.nan


print("\n--- K线因子 (原始 vs 重构) ---")
print(factor_results[[
    'AvgOutflow_Orig', 'AvgOutflow_Recon',
    'NetInflow_Orig', 'NetInflow_Recon',
    'Momentum_Orig', 'Momentum_Recon'
]].head())

# --- 显示所有计算的因子 ---
print("\n--- 所有计算的因子结果 ---")
print(factor_results)

3.4 代码说明与注意事项

数据假设: 上述代码依赖于模拟/假设的数据加载函数 (load_tick_data, load_minute_data)。实际应用中，你需要根据你的数据源实现这些函数。逐笔数据的质量和字段（尤其是买卖标志 bs_flag）对结果至关重要。
大单定义: 代码中 calculate_large_order_ratios 和 reconstruct_minute_bars 使用了基于当日逐笔成交金额均值和标准差（k=0）来定义大单，与报告中测试的一种方式一致。你可以调整 k 值。
K线重构过滤: reconstruct_minute_bars 函数的过滤逻辑是保留成交金额大于阈值的单笔成交来重构。报告原文是“大买或大卖订单参与”，这可能需要更复杂的订单簿数据或匹配逻辑才能精确复现。当前代码是一个基于成交本身的简化。
因子计算细节: calculate_avg_outflow_ratio 和 calculate_large_kline_factors 尝试复现报告描述的因子计算逻辑。请注意公式细节，特别是 大单推动涨幅 的累乘计算。
性能: 处理逐笔数据计算量很大，按天按股票循环处理可能较慢。实际应用中可能需要并行计算或更高效的数据存储/查询方式（如 Dask, Polars, ClickHouse, DolphinDB 等）。
因子测试: 代码仅提供了因子值的计算。要验证因子有效性，还需要进行因子测试，包括：
- 中性化处理: 对行业和风格因子（市值、估值等）进行回归取残差。
- 标准化处理: 横截面去极值、标准化。
- IC 分析: 计算因子值与未来收益率（如下期收益率）的截面相关系数（IC），分析 IC 均值、标准差、ICIR、IC>0 比例等。
- 分层回测: 按因子值分组，构建多空组合或多头组合，观察组合表现（累计收益、年化收益、夏普比率、最大回撤等）。可以使用 Alphalens 等库进行分析。
复杂因子: 报告中提到的 剔除大卖的大买、剔除大买的大卖、大买大卖 因子，其定义依赖于判断成交双方是否为大单。这通常需要更高级别的数据（如 Level-2 行情快照关联逐笔成交）或特定的数据处理逻辑才能准确实现，因此在上述示例代码中未包含。