脚本更新--高精度空间转录组cell niches

作者,Evil Genius
今日目标,细胞niche

实现的方法,看来在细胞niche的分析上,不同细胞类型因为其细胞大小不同,niche的距离范围也会适当调整。

我们来实现一下,全python脚本。
linux、R、python的基础部分是一切分析的基石,也比较简单,认认真真过一遍就行了,目前2025番外--linux、R、python培训https://mp.weixin.qq.com/s/8E1vYJMNhe5m0DXieBHfzA?payreadticket=HM3XzM3dAiSdVyi83oiJ3UWd-zPIOKO1EMMnYwy9ASu_scWj_WEje8Xwiwq_z8JvgcLDfd0&scene=1&click_id=2仍然在进行中,linux部分刚讲完,R和python尚未开始。
尤其是封装类脚本的书写和运行,更是重点。
还是那句话,基础认认真真过一遍就够了,更多的时间去学习分析思路和脚本方法,不要被基础部分给绊住脚跟。
全流程python脚本
#!usr/bin/python
####zhaoyunfei
####20251102

"""
Main Components:
- `Neighbourhood`: A class to aggregate spatial marks based on proximity using KDTree.
- `run_neighbourhood_aggregation`: Function to apply neighbourhood aggregation for multiple WSI UUIDs.
- `cluster`: Function to perform MiniBatchKMeans clustering on aggregated neighbourhood data.
- `run_clustering`: Function to execute the clustering process for different entities.
- `run`: Main function that coordinates the neighbourhood aggregation and clustering workflow.
"""

import pathlib

import duckdb
import numpy as np
import pandas as pd
from scipy.spatial import KDTree
from sklearn.cluster import MiniBatchKMeans
from tqdm import tqdm

from src.utils import get_geometries


class Neighbourhood:
    """
    This class represents a Neighbourhood, which aggregates marks from spatial points within a certain radius.
    
    Attributes:
        radius (float): The radius within which neighbours are aggregated.
    """
    def __init__(self, radius=0.034):
        """
        Initializes the Neighbourhood with a specified radius.

        Args:
            radius (float): The distance within which to search for neighbours. Default is 0.034 µm
                and corresponds to the size of the radius of a lymphocyte added to the diameter of a large macrophage.
        """

        self.radius = radius

    def __call__(self, points, marks):
        """
        Aggregate marks for each point based on neighbouring points within the defined radius.

        Args:
            points (pd.DataFrame): DataFrame containing spatial points (includes columns like 'wsi_uuid', 'polygon_uuid', and 'geom').
            marks (pd.DataFrame): DataFrame containing marks corresponding to the points.

        Returns:
            pd.DataFrame: Aggregated marks with neighbourhood features.
        """
        # Merge points and marks on 'wsi_uuid' and 'polygon_uuid'
        df = points.merge(marks, on=["wsi_uuid", "polygon_uuid"], validate="one_to_one")
        # Extract x, y coordinates from geometry
        xy = np.c_[df.geom.apply(lambda point: point.x), df.geom.apply(lambda point: point.y)]

        # Drop geometry from the dataframe
        multihot = df.drop("geom", axis=1)
        # Build KDTree to find neighbours within the radius
        neighbours = KDTree(xy).query_ball_point(xy, r=self.radius)
        # Aggregate the marks for each point based on its neighbours
        aggregated = np.stack([multihot.iloc[n].sum(0) for n in neighbours])
        aggregated = pd.DataFrame(
            aggregated,
            index=multihot.index,
            columns=multihot.columns
        )
        return aggregated


def run_neighbourhood_aggregation():
    """
    Runs the neighbourhood aggregation for each whole slide image (WSI) UUID.
    
    This function reads geometries (marks, points, regions) and iteratively applies the Neighbourhood class 
    to aggregate marks for each point, saving the results to a parquet file.
    """
    # Retrieve geometry data (marks, points, regions)
    marks, points, regions = get_geometries()
    # Get unique WSI UUIDs
    wsi_uuids = points.wsi_uuid.drop_duplicates()

    outcome = []
    # Loop through each WSI UUID and aggregate the neighbourhood data
    for wsi_uuid in tqdm(wsi_uuids, desc="Aggregating neighbourhood of each cell"):
        outcome.append(Neighbourhood(0.034)(
            points.query("wsi_uuid==@wsi_uuid").set_index(["polygon_uuid", "wsi_uuid"]),
            marks.query("wsi_uuid==@wsi_uuid").set_index(["wsi_uuid", "polygon_uuid"])
        ))
    # Concatenate all results and save to parquet
    outcome = pd.concat(outcome)
    outcome.reset_index().to_parquet("/data/cell_neighbourhoods", partition_cols=["wsi_uuid"])


def gethist(df):
    """
    Log-compresses and normalizes the densities in the dataframe for further processing.
    
    Args:
        df (pd.DataFrame): The input DataFrame to be normalized and log-compressed.
    
    Returns:
        np.ndarray: The normalized and log-compressed values from the input DataFrame.
    """
    df = df.fillna(0) # Fill missing values with 0, if no cell is observed we call it 0 density
    df = np.log1p(df) # Apply log compression to mitigate distribution skew typically observed in densities
    vals = df.values # Keep an array of values only 
    vals = vals / (vals.sum(-1, keepdims=True) + 1e-6)  # Normalize (sum to 1)
    return vals


def columns2strings(df):
    """
    Convert column names of the DataFrame to string type.

    Args:
        df (pd.DataFrame): DataFrame whose columns need conversion to string.

    Returns:
        pd.DataFrame: DataFrame with columns converted to strings.
    """
    df.columns = df.columns.astype("str")
    return df


def cluster(df):
    """
    Perform clustering on the neighbourhoods using MiniBatchKMeans.
    
    Args:
        df (pd.DataFrame): The input DataFrame representing the neighbourhood features.

    Returns:
        pd.Series: Cluster assignments for each neighbourhood.
        pd.DataFrame: Prototypical neighbourhood features for each cluster.
    """
    # normalize (sum to one) and log-compress
    hists = gethist(df)
    # Initialize MiniBatchKMeans for clustering
    kms = MiniBatchKMeans(
        n_clusters=10,
        random_state=0,
        n_init="auto",
        verbose=10,
        batch_size=8000,
        max_no_improvement=200
    )
    # Fit the model and get cluster IDs
    cluster_ids = pd.Series(kms.fit_predict(hists), index=df.index)
    # Extract prototypical neighbourhoods (cluster centers)
    prototypes = pd.DataFrame(kms.cluster_centers_, columns=df.columns)
    return cluster_ids, prototypes


def run_clustering():
    """
    Executes the clustering process for each entity ('LUAD', 'LUSC') by joining metadata and neighbourhood data.
    
    Saves the cluster assignments, niche loading, and prototypical neighbourhoods as parquet files.
    """
    for entity in ["LUAD", "LUSC"]:
        # metadata contains the entity
        # join with neighbourhoods subset ("LUAD/LUSC")
        
        # Join metadata with neighbourhood data based on WSI UUID and entity
        neighbourhoods = duckdb.execute("""
        with wsi_uuid_to_entity as (select parent, first(ENTITY) as entity
        from read_csv('/data/tier1/metadata/he_to_mif.csv')
        group by parent)
        select cn.*
        from read_parquet('/data/cell_neighbourhoods/*/*.parquet') cn
        join wsi_uuid_to_entity w2e on cn.wsi_uuid = w2e.parent
        where entity=$entity
        """, parameters=dict(entity=entity))
        neighbourhoods = neighbourhoods.df().set_index(["polygon_uuid", "wsi_uuid"])

        # Perform clustering
        cluster_ids, prototypes = cluster(neighbourhoods)
        # Store the results
        # For each cell, store which niche it was assigned to
        (  
            cluster_ids
            .to_frame("niche_id")
            .reset_index()
            .to_parquet(f"/data/cell_niche_assignment", partition_cols=["wsi_uuid"])
        )
        # For each spot, count how many cells were assigned to each niche
        (
            cluster_ids
            .to_frame("niche_id")
            .groupby("wsi_uuid")
            .niche_id
            .value_counts()
            .unstack()
            .fillna(0)
            .to_parquet(f"/data/spot_niche_loading", partition_cols=["wsi_uuid"])
        )
        # Save the prototypical neighbourhoods for each niche
        (  
            prototypes
            .to_parquet(f"/data/{entity}_niche_prototypes.parquet")
        )


def run():
    """
    Main function to execute the full pipeline: 
    - Neighbourhood aggregation
    - Clustering
    """
    # Ensure the output directory exists
    pathlib.Path("/data/").mkdir(parents=True, exist_ok=True)
    # Run neighbourhood aggregation
    run_neighbourhood_aggregation()
    # Run clustering on the aggregated neighbourhoods
    run_clustering()


if __name__ == "__main__":
    run()

生活很好,有你更好
<think>嗯,用户这次聚焦在小生境方法中应用K-means的具体机制,结合之前提供的三个引用来看,需要突出“如何通过聚类划分小生境”这一核心。用户似乎正在研究进化算法中的多样性保持策略,可能是研究生或算法工程师。 注意到用户特别强调“参考站内引用”,说明ta需要理论支撑。引用[3]最相关,提到K-means将种群划分为K个子种群;引用[1]定义了聚类中心即小生境区域;引用[2]虽未直接关联,但暗示初始中心选择的重要性——这三者正好构成逻辑链:定义(小生境即簇)→ 实现(聚类过程)→ 优化(初始中心策略)。 需要重点解释三个技术细节:1) 个体到簇中心的分配如何对应小生境形成;2) 目标函数值作为特征向量的具体含义;3) 子种群独立进化的优势。用户隐含需求可能是想复现类似CSCD-MMODE算法的设计,因此要强调工程实现要点,比如特征向量的构建方式。 突然想到个潜在问题:用户可能混淆了“聚类中心”和“局部最优点”的概念。在回答里需要区分——聚类中心只是搜索空间区域的代表点,未必是最优点,这点在引用[1]最后一句有暗示但不够明确。补充说明这点能避免误解。 最后生成的问题要引导用户深入技术实现层面,比如K值选取对多峰优化的影响(引用[1]提到的多峰问题)、聚类频率等实操细节。用户显然已超出基础概念阶段,问题可以更具挑战性。 </think>在小生境方法中利用K-means聚类主要通过以下步骤实现,该方法常用于进化算法中以维持种群多样性和处理多峰优化问题: --- ### **核心方法流程** 1. **种群表示** 将进化算法中的每个个体视为一个数据点,其特征向量通常包含: - **解的位置信息**:决策变量值$(x_1, x_2, \ldots, x_n)$ - **适应度特征**:目标函数值$(f_1, f_2)$(多目标优化时)[^3] 2. **聚类划分小生境** 应用K-means算法将种群划分为$K$个子群(小生境): - **输入**:$N$个个体(数据点),预设小生境数量$K$ - **输出**:$K$个簇$C_1, C_2, \ldots, C_K$,每个簇对应一个小生境 - **簇中心计算**: $$ \mu_k = \frac{1}{|C_k|} \sum_{x_i \in C_k} \mathbf{x}_i $$ 其中$\mu_k$为第$k$个小生境的中心[^1][^3] 3. **小生境内独立进化** - 每个子群在自身小生境内执行选择、交叉、变异等操作 - 关键策略: - **资源分配**:根据小生境规模或质量分配计算资源 - **隔离进化**:避免不同小生境个体直接竞争,保留局部最优解特征[^1] 4. **动态更新机制** 每经过$G$代重新聚类,适应环境变化: ```python if generation % G == 0: niches = kmeans(population, K) # 重新划分小生境 reset_subpopulations(niches) # 重置子群 ``` --- ### **技术优势与作用** | **优势** | **作用机制** | |-------------------------|-----------------------------------------------------------------------------| | **保持多样性** | 强制将种群分散到$K$个区域,避免聚集到单一峰值[^1] | | **多峰优化能力** | 每个小生境独立探索局部最优解,增加找到全局最优概率 | | **解空间覆盖** | 簇中心$\mu_k$自动定位潜在最优区域[^3] | | **计算效率** | 仅需周期性聚类(如每10代),避免全程高开销 | --- ### **实例应用(多目标优化)** 在CSCD-MMODE算法中: 1. 使用K-means将种群按目标函数值$(f_1, f_2)$聚类 2. 每个子群专注探索Pareto前沿的一个区域 3. 结果:提升解集多样性和收敛性[^3] > **关键设计**:聚类时**目标函数值作为特征向量**,使小生境对应Pareto前沿的不同区段。 --- ### **注意事项** 1. **K值选择**:需平衡探索能力($K$较大)与计算成本($K$较小),可通过轮廓系数或肘部法则确定 2. **初始中心敏感**:采用K-means++初始化提升稳定性[^2] 3. **高维问题**:特征维度较高时建议先降维(如PCA) 该方法通过空间分区机制,有效解决了进化算法早熟收敛问题,特别适用于多模态函数优化和多目标优化[^1][^3]。 --- **相关问题** 1. K-means聚类在小生境方法中应该如何选择$K$值? 2. 如何处理K-means对初始聚类中心敏感的问题? 3. 相比基于距离的小生境技术,基于聚类的方案有何优势?
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值