拓扑聚类算法

Identifying homogeneous subgroups of patients and important features: a topological machine learning approach

本文的算法来源为以上论文。

 如果对拓扑数据分析有兴趣的话,可以看我这篇文章

目录

第一步:环境的搭建

需要的材料有 :

Linux环境的配置:

第二步:代码模块解说

prepare_inputs.py整体代码

prepare_inputs.py代码拆解分析

test_single_graph.py整体代码

test_single_graph.py拆解与修改

process_outputs.py代码


第一步:环境的搭建

需要的材料有 :

  1. linux服务器一台(不用性能太高,如果仅仅用来测试该代码的话)。
  2. 建议有一个好的shell工具(推荐finalshell,免费)。

Linux环境的配置:

GitHub - kcl-bhi/mapper-pipeline: Identifying homogeneous subgroups of patients and important features: a topological machine learning approach

进入该论文的代码开源网址。将代码下载到本地,解压然后上传到自己的linux服务器上。 

我把代码都放在了自己创建的tda文件夹下

 当然,在Linux服务器上需要下载python。然后下载一个虚拟环境(方便不同的项目同时进行)。

//在安装完python后,安装虚拟环境
pip install virtualenv

//进入自己下载的文件目录后,创建虚拟环境(tda_env是自己起虚拟文件夹名字)
virtualenv tda_env
创建完成后,工程目录里会有一个该文件夹
//启动虚拟环境
source tda_env/bin/activate


//如需退出虚拟环境,则
deactivate

 进入虚拟环境后就开始安装对应的python包

直接:(在你下载的工程目录下)

pip install -r requirments.txt
国内服务器在下载这些包时应该会有问题

 如果这几个包安装报错,需要自己想办法下载到本地,再自己上传到Linux里,具体例子如下:

//在该目录下下载git包
git clone https://github.com/MathieuCarriere/statmapper.git 

//进入该包目录
cd statmapper 

//pip本地安装
pip install .

 到这里,如果没有啥问题的话,环境就配置好了,应该是可以开始跑代码了,如果有问题可以私信我(估计也没啥人哈哈)。

第二步:代码模块解说

核心算法部分就是这4份代码

 由于数据还没有公开的原因,使得该论文用到的数据无法共享给我们,所以该论文发布者写了个prepare_inputs.py来模拟数据。(中文注释是我加上的)

prepare_inputs.py整体代码

# Title:        Prepare pipeline inputs for different parameter combinations
# Author:       Ewan Carr (@ewancarr)
# Updated:      2021-02-10

import os
import shutil
import pickle
import csv
import pandas as pd
import numpy as np
from sklearn.manifold import MDS
from sklearn.cluster import DBSCAN, AgglomerativeClustering
from sklearn.datasets import make_multilabel_classification
import gower
import urllib.request

#用来调试
debug=1

# Check if input dataset is available
data_avail = os.path.exists('input.csv')

# Load dataset if available; otherwise simulate, 即如果没有则模仿(随机生成)一个输入
# n_samples的初始值为430
if data_avail:
    data = pd.read_csv('input.csv').dropna()
else:
    sim = make_multilabel_classification(n_samples = 430,
                                         n_features = 137,
                                         n_classes = 8)
    data = pd.DataFrame(sim[0])
    data.columns = ['X' + str(i + 1) for i in data.columns]
    data['fila'] = np.random.uniform(-5, 5, len(data.index))
    data['filb'] = np.random.uniform(-5, 5, len(data.index))
    data['ycont'] = np.random.uniform(30, 80, len(data.index))
    data['ybin'] = data['ycont'] > 60

if debug==1:
    print(data)

# Identify categorical items

"""
For computing the Gower matrix (this file) and descriptive statistics
(later) we need to specify which variables in the input dataset should
be treated as categorical. This can be specified with a CSV file
containing the columns 'index' and 'categorical':

即使用categotical_items.csv告诉系统哪个是类别变量

index:          A column of variable names matching those in 'input.csv'
categorical:    A column of 0 or 1 indicating wether each variable should
                be treated as categorical.
"""

if data_avail:
    categ = pd.read_csv('categorical_items.csv')[['index', 'categorical']]
else:
    categ = pd.DataFrame({'index': list(data),
                          'categorical': np.concatenate([np.repeat(1, 137),
                                                         np.array([0, 0, 0, 1])])})
    categ.to_csv('categorical_items.csv')

#如果csv中的值为1,则该属性为类别变量
is_categ = categ['categorical'].values == 1

if debug==1:
    print(is_categ)

# Compute Gower distance matrix, 即任何两点之间的距离
gmat = gower.gower_matrix(data, cat_features=is_categ)

if debug==1:
    print(np.shape(gmat))


# ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
# ┃                                                                           ┃
# ┃                         Define parameter settings                         ┃
# ┃                                                                           ┃
# ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛

# Define filters ==============================================================
fil = {}

# MDS, with Gower distance matrix, first two components,对距离矩阵MDS进行降维到2维
# n_components,它确定了降维的维数
fil['mds'] = MDS(n_components=2,
                 dissimilarity='precomputed').fit_transform(gmat)


# MADRS, BDI, PRS, 就下面这三个是连续型变量
fil['fila'] = data['fila'].values
fil['filb'] = data['filb'].values
fil['ycont'] = data['ycont'].values

# Create combinations of the above
for j in ['fila', 'filb', 'ycont']:
    fil['mds' + '_' + j] = np.column_stack([fil['mds'], fil[j]])

if debug==1:
    print("fil:")
    print(fil)
    print("fill.item:")
    print(fil.items())

#fil中的列为,mds, fila, filb, ycont, mds_fila, mds_filb, mds_ycont

# Define clustering algorithm =================================================
# 使用DBSCAN密度聚类算法
cluster = [DBSCAN(eps=i, metric='precomputed')
           for i in[0.5, 1, 5, 10, 50, 100]]
cluster.append(AgglomerativeClustering(affinity='precomputed',
                                       linkage='average'))
cluster.append('auto')

# Define resolutions ==========================================================
resolutions = [(res, c)
               for res in [1, 3, 5, 10, 30, 50]
               for c in cluster]
resolutions.append((np.nan, None))

# if debug==1:
#     print("resolutions:")
#     print(resolutions)

# Create dictionary containing all combinations of input parameters ===========
# 得到一个参数的组合
params = [{'fil': f,
           'res': r,
           'gain': gain}
          for f in fil.items()
          for r in resolutions
          for gain in [0.1, 0.2, 0.3, 0.4]]

# ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
# ┃                                                                           ┃
# ┃                 Generate input files for all combinations                 ┃
# ┃                                                                           ┃
# ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛

# Set folder to store input files
inp = 'inputs'
out = 'outputs'

# Create folder for inputs; delete if exists already
if os.path.isdir(inp):
    shutil.rmtree(inp) 
os.mkdir(inp)

if not os.path.isdir(out):
    os.mkdir(out)

for i, p in enumerate(params, 1):
    gid = f'{i:04}'
    # Save params as pickle
    with open(os.path.join(inp, gid + '.pickle'), 'wb') as f:
        pickle.dump(p, f) #将每一组参数储存为2进制对象
    # Add to index CSV,同时保存在csv文件里
    with open(os.path.join(inp, 'index.csv'), 'a') as f:
        writer = csv.writer(f)
        writer.writerow([gid,
                         str(p['fil'][0]),
                         str(p['gain']),
                         str(p['res'][0]),
                         str(p['res'][1])])
    # Add to job list for GNU parallel
    with open('jobs', 'a+') as f:
        f.write('python test_single_graph.py ' + gid + ' inputs outputs\n')

# ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
# ┃                                                                           ┃
# ┃                      Export other required datasets                       ┃
# ┃                                                                           ┃
# ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛

# Binary 12-week outcome ------------------------------------------------------
data['ybin'].to_csv(os.path.join(inp, 'ybin.csv'), index=False, header=False)
data['ycont'].to_csv(os.path.join(inp, 'ycont.csv'), index=False, header=False)

# Baseline clinical features (excluding any outcomes) -------------------------
to_drop = ['fila', 'filb', 'ycont', 'ybin']
data.drop(columns=to_drop, axis=1) \
    .to_csv(os.path.join(inp, 'X.csv'), index=False, header=True)

# Gower distance matrix
pd.DataFrame(gmat).to_csv(os.path.join(inp, 'gower.csv'),
                          index=False,
                          header=False)

 上面这一堆代码只有一个目的,就是生成训练模型用的测试数据。

运行完后会生成1372个pickle文件,以及5个csv文件。

 其中pickle文件就是数据和参数的全排列组合数据,只不过以pickle文件的形式保存下来,方便后续操作。

prepare_inputs.py代码拆解分析

if data_avail:
    data = pd.read_csv('input.csv').dropna()
else:
    sim = make_multilabel_classification(n_samples = 430,
                                         n_features = 137,
                                         n_classes = 8)
    data = pd.DataFrame(sim[0])
    data.columns = ['X' + str(i + 1) for i in data.columns]
    data['fila'] = np.random.uniform(-5, 5, len(data.index))
    data['filb'] = np.random.uniform(-5, 5, len(data.index))
    data['ycont'] = np.random.uniform(30, 80, len(data.index))
    data['ybin'] = data['ycont'] > 60

如果没有input.csv的话,就是自动生成一个430行,137+4列的一个数据。即每一行就是一个样本,每一列就是一个属性。

if data_avail:
    categ = pd.read_csv('categorical_items.csv')[['index', 'categorical']]
else:
    categ = pd.DataFrame({'index': list(data),
                          'categorical': np.concatenate([np.repeat(1, 137),
                                                         np.array([0, 0, 0, 1])])})
    categ.to_csv('categorical_items.csv')

#如果csv中的值为1,则该属性为类别变量
is_categ = categ['categorical'].values == 1

if debug==1:
    print(is_categ)

# Compute Gower distance matrix, 即任何两点之间的距离
gmat = gower.gower_matrix(data, cat_features=is_categ)

if debug==1:
    print(np.shape(gmat))

然后这段代码就是告诉gower函数(一个距离计算函数),哪些属性是类别属性,哪些是不是。得到的结果就是430*430的一个距离矩阵。

# Define filters ==============================================================
fil = {}

# MDS, with Gower distance matrix, first two components,对距离矩阵MDS进行降维到2维
# n_components,它确定了降维的维数
fil['mds'] = MDS(n_components=2,
                 dissimilarity='precomputed').fit_transform(gmat)


# MADRS, BDI, PRS, 就下面这三个是连续型变量
fil['fila'] = data['fila'].values
fil['filb'] = data['filb'].values
fil['ycont'] = data['ycont'].values

# Create combinations of the above
for j in ['fila', 'filb', 'ycont']:
    fil['mds' + '_' + j] = np.column_stack([fil['mds'], fil[j]])

我认为这一段是存放filter函数(论文中的)的结果的值,然后这里的值就是之前自己随机生成的。得到的fil值如下:

mdsfilafilbycontmds_filamds_filbmds_ycont
2列  1列1列1列3维3列       3列
# Define clustering algorithm =================================================
# 使用DBSCAN密度聚类算法
cluster = [DBSCAN(eps=i, metric='precomputed')
           for i in[0.5, 1, 5, 10, 50, 100]]
cluster.append(AgglomerativeClustering(affinity='precomputed',
                                       linkage='average'))
cluster.append('auto')

定义使用的密度聚类算法参数(可以自学一下密度聚类DBSCAN)。

# Define resolutions ==========================================================
resolutions = [(res, c)
               for res in [1, 3, 5, 10, 30, 50]
               for c in cluster]
resolutions.append((np.nan, None))

这两个for循环的目的就是让res(论文中的分辨率)参数和之前定义的聚类参数进行一个全排列组合。

params = [{'fil': f,
           'res': r,
           'gain': gain}
          for f in fil.items()
          for r in resolutions
          for gain in [0.1, 0.2, 0.3, 0.4]]

这三个for循环就是对fil数据,之前分辨率和聚类组合,以及gain(论文中的重叠区域)进行一个排列组合。(论文中说的很清楚为什么需要这样做)。

之后的代码就是将这些组合参数保存到pickle文件里,一些其他数据都保存在对应的文件中。

test_single_graph.py整体代码

# Title:        Test single Mapper graph on Rosalind
# Author:       Ewan Carr
# Started:      2020-05-06

import matplotlib.pyplot as plt
import os
import sys
import pickle
import pandas as pd
import numpy as np
from sklearn_tda import MapperComplex
from functions import (representative_features,
                       identify_membership,
                       count_features)

from functions import dg
debug = False

'''
This script three inputs:
    1. A index number (0001-9999).
    2. An input directory (e.g. 'inputs')
    3. An output directory (e.g. ('outputs')

It then:
    1. Loads the corresponding data/parameters (e.g. from inputs/0001.pickle).
    2. Runs Mapper with these parameters.
    3. Identifies significant, representative topological features.
    4. Extracts information on the number and type of features.
'''

# ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
# ┃                                                                           ┃
# ┃                            STEP 1: Load inputs                            ┃
# ┃                                                                           ┃
# ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛

print("Loading inputs...")
if debug:
    i = '0333'
    # inputs = 'inputs/'
    inputs = "./inputs/"
    outputs = './outputs/'
else:
    i = sys.argv[1]
    inputs = sys.argv[2] + '/'
    outputs = sys.argv[3] + '/'
if debug:
    print(os.getcwd())

# Load inputs/datasets
current = pickle.load(open(inputs + i + '.pickle', 'rb'))
X = pd.read_csv(inputs + 'X.csv')
ybin = pd.read_csv(inputs + 'ybin.csv', header=None, names=['hdremit.all'])
gower = pd.read_csv(inputs + 'gower.csv', header=None)

print(current)

# ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
# ┃                                                                           ┃
# ┃                            STEP 2: Run Mapper                             ┃
# ┃                                                                           ┃
# ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛

print("Running Mapper...")
fil_label, fil = current['fil']
print('fil_label:',fil_label)
print('len(fil):',len(fil))


if fil.ndim == 1:
    fil = fil.reshape(-1, 1)
n_filters = np.shape(fil)[1]
res, clust = current['res']
gain = current['gain']
params = {'inp': 'distance matrix',  
          'filters': fil,
          'filter_bnds': np.array(np.nan * n_filters),
          'colors': fil,
          'resolutions': np.array([res] * n_filters),
          'gains': np.array([gain] * n_filters)}
if clust == 'auto':
    m = MapperComplex(**params).fit(gower.values)
else:
    m = MapperComplex(**params,
                      clustering=clust).fit(gower.values)
M = {'fil': fil,
     'X': gower.values,
     'map': m,
     'params': params}

# print('m:')
# print(m)
# plt.plot(m)
# plt.show()
print('m的描述:')
dg(m)

# ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
# ┃                                                                           ┃
# ┃                   STEP 3: Identify significant features                   ┃
# ┃                                                                           ┃
# ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛

# confidence的初始值为0.90, bootstrap为100
print("Identifying significant features...")
# print("M['params'].columns", M['params'].columns)
# print(M['params'])

print("初始M['params'].keys():",M['params'].keys())
print("初始len(M['X']): ",len(M['X']))
print("初始len(M['params']['filters']):",len(M['params']['filters']))


rep_feat = representative_features(M,
                                   confidence=0.9,
                                   bootstrap=100,
                                   inp='distance matrix')
# print("rep_feat:")
# print(rep_feat)
print("初始len(rep_feat['params']['filters']:",len(rep_feat['params']['filters']))

# membership = identify_membership(rep_feat) 

#test
membership = identify_membership(rep_feat, clust) 
# membership = identify_membership(M) 
# m = MapperComplex(**params).fit(gower.values)
# print('没有指定cluster,m的描述:')
# dg(m)
# m = MapperComplex(**params, clustering=clust).fit(gower.values)
# print('指定cluster,m的描述:')
# dg(m)
#test
print(np.shape(membership))

###test
# print('membership: ')
# for kkk in range(len(membership['nodes'])):
#     print(membership['nodes'][kkk])
#     if membership['D1'] is not None:
#         print('D1:', membership['D1'][kkk])
#         print('U1:', membership['U1'][kkk])
# print(membership)
# print('D1.sum:', membership['D1'].sum())
# print('U1.sum:', membership['U1'].sum())
###test

# ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
# ┃                                                                           ┃
# ┃                     STEP 4: Extract required outputs                      ┃
# ┃                                                                           ┃
# ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛

feature_counts = count_features(rep_feat)
feature_labels = ['n_feat',
                  'n_sig',
                  'downbranch',
                  'upbranch',
                  'conn',
                  'loop',
                  'sig_downbranch',
                  'sig_upbranch',
                  'sig_conn',
                  'sig_loop']
results = {}
for c, l in zip(feature_counts, feature_labels):
    results[l] = c
results['n_nodes'] = len(M['map'].node_info_.items())
results['memb'] = membership
results['params'] = current

with open(os.path.join(outputs, i + '.pickle'), 'wb') as f:
    pickle.dump(results, f)

 这段代码的主要目的就是加载相关的数据和参数集,就是之前保存在pickle中的数据,然后用这些参数和数据跑mapper算法,再从mapper的结果中识别有意义的特征,最后对这些特征做一些统计。

test_single_graph.py拆解与修改

print("Loading inputs...")
if debug:
    i = '0333'
    # inputs = 'inputs/'
    inputs = "./inputs/"
    outputs = './outputs/'
else:
    i = sys.argv[1]
    inputs = sys.argv[2] + '/'
    outputs = sys.argv[3] + '/'
if debug:
    print(os.getcwd())

# Load inputs/datasets
current = pickle.load(open(inputs + i + '.pickle', 'rb'))
X = pd.read_csv(inputs + 'X.csv')
ybin = pd.read_csv(inputs + 'ybin.csv', header=None, names=['hdremit.all'])
gower = pd.read_csv(inputs + 'gower.csv', header=None)

这段代码的作用就是读取之前保存的数据,这是根据输入的pickle文件名进行指定pickle文件的读取。比如:

python3 test_single_graph.py '0333' 'inputs' 'outputs'

 就是对inputs文件下的0333.pickle读取,然后结果存放到outputs文件夹下。

print("Running Mapper...")
fil_label, fil = current['fil']
print('fil_label:',fil_label)
print('len(fil):',len(fil))


if fil.ndim == 1:
    fil = fil.reshape(-1, 1)
n_filters = np.shape(fil)[1]
res, clust = current['res']
gain = current['gain']
params = {'inp': 'distance matrix',  
          'filters': fil,
          'filter_bnds': np.array(np.nan * n_filters),
          'colors': fil,
          'resolutions': np.array([res] * n_filters),
          'gains': np.array([gain] * n_filters)}
if clust == 'auto':
    m = MapperComplex(**params).fit(gower.values)
else:
    m = MapperComplex(**params,
                      clustering=clust).fit(gower.values)
M = {'fil': fil,
     'X': gower.values,
     'map': m,
     'params': params}

# print('m:')
# print(m)
# plt.plot(m)
# plt.show()
print('m的描述:')
dg(m)

这段代码就是对读取的数据进行参数提取,然后跑mapper算法。 

# confidence的初始值为0.90, bootstrap为100
print("Identifying significant features...")
# print("M['params'].columns", M['params'].columns)
# print(M['params'])

print("初始M['params'].keys():",M['params'].keys())
print("初始len(M['X']): ",len(M['X']))
print("初始len(M['params']['filters']):",len(M['params']['filters']))


rep_feat = representative_features(M,
                                   confidence=0.9,
                                   bootstrap=100,
                                   inp='distance matrix')
# print("rep_feat:")
# print(rep_feat)
print("初始len(rep_feat['params']['filters']:",len(rep_feat['params']['filters']))

# membership = identify_membership(rep_feat) 

#test
membership = identify_membership(rep_feat, clust) 
# membership = identify_membership(M) 
# m = MapperComplex(**params).fit(gower.values)
# print('没有指定cluster,m的描述:')
# dg(m)
# m = MapperComplex(**params, clustering=clust).fit(gower.values)
# print('指定cluster,m的描述:')
# dg(m)
#test
print(np.shape(membership))

###test
# print('membership: ')
# for kkk in range(len(membership['nodes'])):
#     print(membership['nodes'][kkk])
#     if membership['D1'] is not None:
#         print('D1:', membership['D1'][kkk])
#         print('U1:', membership['U1'][kkk])
# print(membership)
# print('D1.sum:', membership['D1'].sum())
# print('U1.sum:', membership['U1'].sum())
###test

这段代码小编着实没有看懂,论文中也说不明白,大概意思就是提取出有意义的(有代表性的)特征。然后我感觉源代码是有点问题的,如果直接用源代码的话,后面那部分的代码是无法出结果的,但是我修改后,虽然可以出结果,但是感觉有点问题。我修改的地方在functions.py里面(当然test_single_graph.py中调用时的参数也要改一下)在感叹号附近

#原始模样
# def identify_membership(pick):
#########################!!!!!!!!!!!!!!!
def identify_membership(pick, clust):
    """
    Given a set of input parameters, re-run Mapper and create a dataframe that
    identifies which participants belong to which features. Note that this
    function retains significant features only.
    """
    # Re-run Mapper
    print('Re-run Mapper...')
    print("functions中pick['params'].keys():",pick['params'].keys())
    print("functions中len(pick['X']):",len(pick['X']))
    print("functions中len(pick['params']['filters']:",len(pick['params']['filters']))
    #原始模样!!!!!!!!!!!!!
    # mapper = MapperComplex(**pick['params']).fit(pick['X'])
    #end模样!!!!!!!!!!!!!!
    #test!!!!!!!!!!!!!!!!
    mapper = MapperComplex(**pick['params'], clustering=clust).fit(pick['X'])
    #test!!!!!!!!!!!!!!!
    # Get nodes for each feature
    print('Get nodes for each feature...')
    nodes = {k: v['indices'] for k, v in mapper.node_info_.items()}
    print('mapper再生的描述:')
    dg(mapper)
    print('nodes: ')
    print(nodes)
    # Get node memberships for each participant
    pid = {}
    for i in range(np.shape(pick['X'])[0]):
        pid[i] = {}
        pid[i]['nodes'] = []
        for k, v in nodes.items():
            if i in v:
                pid[i]['nodes'].append(k)
    # Get feature memberships
    feature_types = ['downbranch', 'upbranch', 'connected_component', 'loop']
    for i in feature_types:
        id = 1
        for f in pick['sbnd'][i]:
            if not all([len(i) == 0 for i in pick['sbnd'][i]]):
                # Only compute feature types that exist
                feature = i[0].upper() + str(id)
                id += 1
                for pk, pv in pid.items():
                    if len(set(f).intersection(pv['nodes'])) > 0:
                        pid[pk][feature] = True
                    else:
                        pid[pk][feature] = False
    membership = pd.DataFrame.from_dict(pid, orient='index')
    return(membership)
feature_counts = count_features(rep_feat)
feature_labels = ['n_feat',
                  'n_sig',
                  'downbranch',
                  'upbranch',
                  'conn',
                  'loop',
                  'sig_downbranch',
                  'sig_upbranch',
                  'sig_conn',
                  'sig_loop']
results = {}
for c, l in zip(feature_counts, feature_labels):
    results[l] = c
results['n_nodes'] = len(M['map'].node_info_.items())
results['memb'] = membership
results['params'] = current

with open(os.path.join(outputs, i + '.pickle'), 'wb') as f:
    pickle.dump(results, f)

最后一部分代码就是对之前的特征的一些统计量(前一步的一些指标计算结果)的保存,输出到outputs文件夹下。

process_outputs.py代码

# Title:        Process outputs each Mapper graph
# Author:       Ewan Carr

import os
import pickle
import numpy as np
import pandas as pd
from tqdm import tqdm
from sklearn_tda import MapperComplex
from functions import gini
from statsmodels.stats.proportion import proportion_confint
import statsmodels.stats.api as sms
from joblib import Parallel, delayed, dump, load
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score, make_scorer
import xgboost as xgb
from scipy.stats import ks_2samp, chi2_contingency
from sklearn.preprocessing import scale
import xlsxwriter
import tda.helpers as th
import tda.mapper as tm
import tda.plotting as tp

debug=1


# Load baseline/outcome variables ---------------------------------------------
X = pd.read_csv(os.path.join('inputs', 'X.csv'))
gower = pd.read_csv(os.path.join('inputs', 'gower.csv'), header=None)
ycont = pd.read_csv(os.path.join('inputs', 'ycont.csv'),
                    header=None, names=['ycont'])
ybin = pd.read_csv(os.path.join('inputs', 'ybin.csv'),
                   header=None, names=['ybin'])

# Load all Mapper graphs ------------------------------------------------------
outdir = 'outputs'
res_all = {}
for f in tqdm(os.listdir(outdir)):
    p = os.path.join(outdir, f)
    res_all[f[0:4]] = pickle.load(open(p, 'rb'))

# print('res_all:')
# print(res_all)

# Remove graphs with no significant features ----------------------------------
print('All graphs:', len(res_all))
res = {k: v for k, v in res_all.items() if v['n_sig'] > 0}
print('Graphs with >0 significant features:', len(res))

###test
# print('res.item:')
# print(res.items())
# for k, v in res.items():
#     feat = v['memb'].columns[1:]
#     print('feat: ',feat)
    # print('D1.sum:', v['memb']['D1'].sum())
    # print('U1.sum:', v['memb']['U1'].sum())
    # print("v['memb']['D1']:")
    # print(v['memb']['D1'])
###test

# Split graphs into separate features -----------------------------------------
features = []
for k, v in res.items():
    feat = v['memb'].columns[1:]
    for f in feat:
        other_features = [x for x in feat
                          if x != f
                          if v['memb'][x].sum() > 21]
        if v['params']['res'][1] == 'auto':
            clust = 'DBSCAN (auto)'
        else:
            clust = type(v['params']['res'][1]).__name__
        features.append({'graph': k,
                         'feat': f,
                         'memb': v['memb'][f],
                         'other': v['memb'][other_features],
                         'n': v['memb'][f].sum(),
                         'fil': v['params']['fil'][0],
                         'res': v['params']['res'][0],
                         'gain': v['params']['gain'],
                         'cluster': clust})
print('Number of features =', len(features))

# Remove features with <5% or >95% of the sample ------------------------------
features = [f for f in features
            if (f['n'] > (0.05 * len(f['memb'])))
            if (f['n'] < (0.95 * len(f['memb'])))]
print('Number of features(<5% or >95%) =', len(features))
# print(features)

# Calculate homogeneity -------------------------------------------------------

# Homogeneity for sample
gini_samp = gini(ybin.mean())[0]
print('gini_samp: ', gini_samp)
std_samp = ycont.std()[0] #计算标准差
print('std_samp: ', std_samp)

# Homogeneity for each feature
for f in tqdm(features):
    if debug:
        print('f')
        print(f)
    # Binary outcome (ybin)
    p = pd.concat([f['memb'], ybin], axis=1) \
        .groupby(f['feat'])['ybin'] \
        .mean()[1]
    gini_feat = gini(p)
    gini_pct = ((gini_samp - gini_feat) / gini_samp) * 100
    # Continuous outcome (ycont)
    std_feat = pd.concat([f['memb'],
                          ycont],
                         axis=1).groupby(f['feat'])['ycont'].std()[1]
    std_pct = ((std_samp - std_feat) / std_samp) * 100
    # Store
    f['homog'] = {'gini': gini_feat,
                  'gini_pct': gini_pct,
                  'std': std_feat,
                  'std_pct': std_pct}


# Calculate differences -------------------------------------------------------
def calculate_distances(f, y):
    d = {}
    # Get mean/proportion of sample and feature
    d_feat = float(y[f['memb']].mean())
    d_restofsamp = float(y[~f['memb']].mean())
    d['mean_feat'] = d_feat
    d['mean_samp'] = d_restofsamp
    # Calculate difference
    d['diff_samp'] = float(d_feat - d_restofsamp)
    # Calculate difference to each other feature
    for o in f['other']:
        d['diff_' + o] = float(d_feat - y[f['other'][o]].mean())
    # Calculate maximum distance to all other features
    dmax = 0
    for k in ['diff_' + x for x in list(f['other'])]:
        if abs(d[k]) > dmax:
            dmax = abs(d[k])
    d['diff_max'] = dmax
    return(d)


def get_dist(f):
    dist = {}
    dist['ybin'] = calculate_distances(f, ybin)
    dist['ycont'] = calculate_distances(f, ycont)
    dist['n_other'] = len(list(f['other']))
    f['dist'] = dist
    return(f)


features = Parallel(n_jobs=8)(delayed(get_dist)(f) for f in features)
dump(features, 'features.joblib')

# Remove duplicate features ---------------------------------------------------
for f, i in zip(features, range(len(features))):
    f['n_other'] = f['dist']['n_other']
    f['max_ybin'] = f['dist']['ybin']['diff_max']
    f['max_ycont'] = f['dist']['ycont']['diff_max']
    f['gini'] = f['homog']['gini']
    f['gini_pct'] = f['homog']['gini_pct']
    f['std'] = f['homog']['std']
    f['std_pct'] = f['homog']['std_pct']
    f['id'] = i
feature_summary = pd.DataFrame.from_dict(features)[['id',
                                                    'graph', 'feat',
                                                    'n', 'gini', 'std',
                                                    'max_ybin',
                                                    'max_ycont']]
unique = feature_summary.drop_duplicates(subset=['n', 'gini', 'std'])
unique_features = [f for f in features if f['id'] in unique['id']]
print('features:', len(features))
print('unique:', len(unique_features))

# Function to create summary table listing all features -----------------------
def tab(lof, sort=None, ascending=False):
    cols = ['id', 'graph', 'feat', 'n', 'gini', 'gini_pct',
            'std', 'std_pct', 'max_ybin', 'max_ycont', 'n_other',
            'fil', 'res', 'gain', 'cluster']
    cols = cols + ['overlap' + str(i) for i in range(5)]
    sr = []
    for l in lof:
        row = []
        for k in cols:
            row.append(l.get(k))
        sr.append(row)
    results_table = pd.DataFrame(sr, columns=cols)
    to_round = ['gini', 'gini_pct', 'std',
                'std_pct', 'max_ybin', 'max_ycont']
    results_table[to_round] = results_table[to_round].round(2)
    results_table.set_index(['graph', 'feat'], inplace=True)
    results_table.columns = pd.MultiIndex.from_product([['summary'],
                                                        results_table.columns])
    if sort:
        return(results_table.sort_values(('summary', sort),
                                         ascending=ascending))
    else:
        return(results_table)

# ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
# ┃                                                                           ┃
# ┃                           Select top N features                           ┃
# ┃                                                                           ┃
# ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛

# Outcome 1: continuous -------------------------------------------------------
gini = [f['homog']['gini'] for f in features]
gini_pct = [f['homog']['gini_pct'] for f in features]
# NOTE: gini_pct refers to the percentage *reduction* in Gini

# Sort by homogeneity
top20 = np.sort(gini_pct)[::-1][np.min([len(gini_pct) - 1, 20])]
top_ybin = [f for f in features if f['homog']['gini_pct'] >= top20]

# Outcome 2: binary -----------------------------------------------------------
std = [f['homog']['std'] for f in features]
std_pct = [f['homog']['std_pct'] for f in features]

# Sort by homogeneity
top20 = np.sort(std_pct)[::-1][np.min([len(gini_pct) - 1, 20])]
top_ycont = [f for f in features if f['homog']['std_pct'] >= top20]

tab(features, 'gini_pct')
tab(top_ybin, 'gini_pct')
tab(top_ycont, 'std_pct')

# Make table summarising all features -----------------------------------------
af = tab(features, 'gini_pct')
af.columns = [i[1] for i in af.columns]
af['gini_samp'] = gini_samp
af['std_samp'] = std_samp
af.to_excel('all_features.xlsx', merge_cells=False)

# Tabulate distances for each feature -----------------------------------------
diffs = []
for f in features:
    tf = []
    for o in ['ybin', 'ycont']:
        d = pd.DataFrame({'graph': f['graph'],
                          'feat': f['feat'],
                          **f['dist'][o]}, index=[0])
        d.set_index(['graph', 'feat'], inplace=True)
        d.columns = pd.MultiIndex.from_product([[o], d.columns])
        tf.append(d)
    diffs.append(pd.concat(tf, axis=1))
diffs = pd.concat(diffs, axis=0)


# Get mean differences in baseline clinical variables -------------------------
def get_stat(x):
    if len(np.unique(x)) > 2:
        low, high = sms.DescrStatsW(x).tconfint_mean()
        return('{:.2f}'.format(np.mean(x)) + ' [' +
               '{:.2f}'.format(low) + ', ' +
               '{:.2f}'.format(high) + ']')
    else:
        x = x - (np.max(x) - 1)
        low, high = proportion_confint(sum(x), len(x))
        return('{:.1f}'.format(np.mean(x) * 100) + '% [' +
               '{:.1f}'.format(low * 100) + ', ' +
               '{:.1f}'.format(high * 100) + ']')


def chi2(df, col1, col2):
    return chi2_contingency([
        [
            len(df[(df[col1] == cat) & (df[col2] == cat2)])
            for cat2 in range(int(df[col1].min()), int(df[col1].max()) + 1)
        ]
        for cat in range(int(df[col2].min()), int(df[col2].max()) + 1)
    ])


def get_fi(X, y):
    clf = xgb.XGBClassifier(eval_metric='logloss',
                            use_label_encoder=False)
    clf.fit(scale(X), y)
    auc = np.mean(cross_val_score(clf, X, y, cv=10,
                                  scoring=make_scorer(roc_auc_score)))
    return((auc,
            pd.Series(clf.feature_importances_,
                      name='xgb_fi', index=list(X))))


def draw_graph(i, path):
    print('Running Mapper for ' + i + '...')
    inputs = res_all[i]['params']
    fil_label, fil = inputs['fil']
    if fil.ndim == 1:
        fil = fil.reshape(-1, 1)
    n_filters = np.shape(fil)[1]
    res, clust = inputs['res']
    print(fil_label, np.shape(fil))
    gain = inputs['gain']
    params = {'inp': 'distance matrix',
              'filters': fil,
              'filter_bnds': np.array(np.nan * n_filters),
              'colors': fil,
              'resolutions': np.array([res] * n_filters),
              'gains': np.array([gain] * n_filters)}
    if clust == 'auto':
        m = MapperComplex(**params).fit(gower.values)
    else:
        m = MapperComplex(**params, clustering=clust).fit(gower.values)
    M = {'fil': fil,
         'X': gower.values,
         'map': m,
         'params': params}
    bootstrap = th.representative_features(M,
                                           confidence=0.9,
                                           bootstrap=100,
                                           inp="distance matrix")
    M['bootstrap'] = bootstrap
    P = tp.graph_features(M['bootstrap'])
    P.graph_attr['overlap'] = 'scale'
    P.graph_attr['font'] = 'Calibri'
    P.graph_attr['sep'] = 1
    P.graph_attr['splines'] = 'true',
    P.draw(path, prog="neato")


# Combine distances with feature summaries ------------------------------------
opts = {'ybin': {'top': top_ybin,
                 'sort': 'gini_pct'},
        'ycont': {'top': top_ycont,
                  'sort': 'std_pct'}}

# Load CSV specifying which baseline variables are categorical ----------------
categ = pd.read_csv('categorical_items.csv')[['index', 'categorical']]
categ.set_index('index', inplace=True)
categ = categ['categorical'].apply(lambda x: True if x == 1 else False)

# Get required summaries for each of the top-ranked features ------------------
for k, v in opts.items():
    # Summary of this set of features
    v['sumstat'] = tab(v['top'], v['sort']) \
        .merge(diffs,
               left_index=True,
               right_index=True,
               how='inner').round(2).reset_index().fillna('--') \
        .sort_values(('summary', v['sort']), ascending=False)
    # For each feature in this top 20
    for f in tqdm(v['top']):
        pd.concat([f['memb'], ybin, ycont], axis=1)
        # Get means/proportions for baseline variables
        baseline_diffs = pd.concat([f['memb'], X], axis=1) \
            .groupby(f['feat']) \
            .agg(get_stat).T
        # Add means of outcomes (ybin, ycont)
        oc = pd.concat([f['memb'], ybin, ycont], axis=1)
        oc = oc.groupby(oc.iloc[:, 0])[['ybin', 'ycont']] \
            .agg(get_stat).T
        baseline_diffs = pd.concat([baseline_diffs, oc])
        # Get two-sample KS test (continuous)/ chi-square test (categorical)
        XM = pd.concat([X, f['memb'], ybin, ycont], axis=1)
        XM.rename(columns={f['feat']: 'memb'}, inplace=True)
        for i in XM:
            if i != 'memb':
                if categ[i]:
                    c, p = chi2_contingency(pd.crosstab(XM[i],
                                                        XM['memb']))[0:2]
                    cell = 'Chi2 = {:.5f}; pval={:.5f}'.format(c, p)
                    baseline_diffs.loc[i, 'ks_chi'] = cell
                else:
                    d1 = XM[~XM.memb][i].values
                    d2 = XM[XM.memb][i].values
                    stat, pval = ks_2samp(d1, d2)
                    cell = 'KS = {:.5f}; pval={:.5f}'.format(stat, pval)
                    baseline_diffs.loc[i, 'ks_chi'] = cell
        # Get XGB AUC and feature importances
        auc, fi = get_fi(X, f['memb'])
        f['auc'] = auc
        # Combine with baseline differences
        fi = pd.concat([fi], axis=1)
        fi.columns = ['fi']
        with_fi = baseline_diffs \
            .merge(fi,
                   left_index=True,
                   right_index=True,
                   how='left').reset_index()
        f['with_fi'] = with_fi
        # Get list of feature members (based on row number of input dataset)
        f['memb_col'] = pd.DataFrame({'row ID': f['memb'].index[f['memb']]})
    # Sort the features
    v['top'] = sorted(v['top'], key=lambda k: k[v['sort']], reverse=True)

# Generate Excel documents ----------------------------------------------------
if not os.path.exists('examples'):
    os.mkdir('examples')
for k, v in opts.items():
    wb = xlsxwriter.Workbook(os.path.join('examples', 
                                          k + '.xlsx'),
                             {'nan_inf_to_errors': True})
    bigfont = wb.add_format()
    bigfont.set_font_size(18)
    bold = wb.add_format()
    bold.set_bold()
    for n, f in enumerate(v['top']):
        sid = '_'.join([f['graph'], f['feat']])
        ws = wb.add_worksheet(str(n + 1) + ' | ' + sid)
        ws.write(0, 0, sid, bigfont)
        ws.set_row(0, 25)
        for i, c in enumerate(f['with_fi']):
            ws.write_column(3, i, f['with_fi'][c])
        ws.write_row(2, 0, list(with_fi))
        # Write AUC
        ws.write('E2', 'AUC: ' + '{:.10f}'.format(f['auc']), bold)
        # Add feature summaries
        ss = v['sumstat']
        for s, r in zip(['summary', 'ybin', 'ycont'],
                        [2, 25, 48]):
            ws.write_column(r + 1, 7, ss['graph'])
            ws.write_column(r + 1, 8, ss['feat'])
            for i, col in enumerate(ss[(s, )]):
                ws.write_column(r + 1, 8 + i, ss[s][col])
            ws.write_row(r, 8, list(ss[(s, )]))
        # Add formatting/labels
        ws.conditional_format('E4:E141', {'type': '3_color_scale'})
        ws.set_column('A:E', 15)
        ws.set_column('D:E', 25)
        ws.write('A2', 'Baseline differences', bold)
        ws.write('H2', 'Feature summaries', bold)
        ws.write('H25', 'ybin: Distances to other features, means',
                 bold)
        ws.write('H48', 'ycont: Distance to other features, means', bold)
        ws.write_row(2, 0,
                     [' ', 'Non-member',
                      'Member', 'KS/Chi2',
                      'XGB gain'],
                     bold)
        # Add Mapper graph
        figdir = os.path.join('figures', 'for_excel')
        if not os.path.exists(figdir):
            os.mkdir(figdir)
        p = os.path.join(figdir, f['graph'] + '.png')
        if not os.path.exists(p):
            draw_graph(f['graph'], p)
        ws.insert_image('H71', p)
        # Add column with IDs of cluster members
        ws.write(2, 28, 'Cluster members (based on row numbers of input dataset')
        ws.write_column(3, 28, f['memb_col'].values)
    wb.close()

上面这一段代码每部分都有说明,就不进行拆解说明了。

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值