MachineLearning 31. 机器学习之基于RNA-seq的基因特征筛选 (GeneSelectR)

最新推荐文章于 2024-05-15 11:18:06 发布

桓峰基因

最新推荐文章于 2024-05-15 11:18:06 发布

阅读量2.7k

点赞数 10

文章标签：机器学习人工智能

本文链接：https://blog.csdn.net/weixin_41368414/article/details/137096336

版权

简介

RNA-seq 数据集在识别下游分析和数据挖掘工作的生物学相关特征方面提出了相当大的挑战。标准方法涉及差异基因表达 (DGE) 分析，但由于其单变量性质，其有效性可能受到数据的限制。在复杂的数据集中，另一种方法涉及使用各种机器学习 (ML) 工具，这些工具试图理解特征之间的非线性关系，并专注于概括性而不是统计显著性。这种方法将导致生成多个特征列表，这些特征列表可能在分类性能指标方面表现出相似性。因此，迫切需要一个内聚的工作流程，使用不同的机器学习方法无缝集成鲁棒的特征选择，同时评估结果特征列表的生物学相关性。考虑到两组标准，这种组合方法将能够确定最佳执行列表的优先级。

今天介绍一下 GeneSelectR 软件包，创新地结合了机器学习和生物信息学数据挖掘方法，以增强特征选择。使用 GeneSelectR，可以使用各种ML方法和用户定义的参数从规范化的 RNA-seq 数据集中选择特征。接下来是评估与基因本体 (GO) 富集分析的生物学相关性，以及对结果 GO 术语的语义相似性分析。此外，计算相似系数和 GO 感兴趣项的分数。

因此，GeneSelectR 优化了机器学习性能，并严格评估了各种列表的生物学相关性，提供了一种根据生物学问题优先考虑特征列表的方法。当应用于 TCGA-BRCA 数据集时，GeneSelectR工作流使用不同的 ML 方法和 DGE 分析生成了几个特征列表。通过利用 GeneSelectR 中的各种功能，可以根据 ML 性能和生物学相关性评估不同的列表。这种全面的评估有助于选择表现最好的列表，这些列表既表现出强大的机器学习性能，又与生物学问题高度相关，同时保持了可管理的高度具体的特征。

特征选择过程

核心功能GeneSelectR使用各种方法进行基因选择，并通过交叉验证评估其性能。还支持超参数调优、排列特征重要性计算等。

软件包安装

先安装 GeneSelectR 软件包

# install.packages("devtools")
devtools::install_github("dzhakparov/GeneSelectR")

window 11 上安装 Anaconda3

下载地址：https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-2019.10-Windows-x86_64.exe

配置 GeneSelectR 环境

安装python的软件包有点多，需要挺长时间，耐心等待。

GeneSelectR::configure_environment()

Conda is installed.
The conda environment GeneSelectR_env does not exist. Do you want to create it? 

1: yes
2: no

Selection: 1
Creating conda environment and installing required packages
+ "D:/Program Files/Anaconda3/condabin/conda.bat" "create" "--yes" "--name" "GeneSelectR_env" "python=3.8" "numpy <= 1.19" "scikit-learn <= 0.22.1" "pandas" "boruta_py" "scikit-optimize" "--quiet" "-c" "conda-forge"
WARNING: A space was detected in your requested environment path
'D:\Program Files\Anaconda3\envs\GeneSelectR_env'
Spaces in paths can sometimes be problematic.
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: D:\Program Files\Anaconda3\envs\GeneSelectR_env

  added / updated specs:
    - boruta_py
    - numpy[version='<=1.19']
    - pandas
    - python=3.8
    - scikit-learn[version='<=0.22.1']
    - scikit-optimize

  The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    boruta_py-0.3              |             py_0          51 KB  conda-forge
    brotli-1.1.0               |       hcfcfb64_1          19 KB  conda-forge
    brotli-bin-1.1.0           |       hcfcfb64_1          20 KB  conda-forge
    bzip2-1.0.8                |       hcfcfb64_5         122 KB  conda-forge
    ca-certificates-2024.2.2   |       h56e8100_0         152 KB  conda-forge
    certifi-2024.2.2           |     pyhd8ed1ab_0         157 KB  conda-forge
    cycler-0.12.1              |     pyhd8ed1ab_0          13 KB  conda-forge
    fonttools-4.50.0           |   py38h91455d4_0         1.8 MB  conda-forge
    freetype-2.12.1            |       hdaf720e_2         498 KB  conda-forge
    intel-openmp-2024.0.0      |   h57928b3_49841         2.2 MB  conda-forge
    joblib-1.3.2               |     pyhd8ed1ab_0         216 KB  conda-forge
    kiwisolver-1.4.5           |   py38hb1fd069_1          54 KB  conda-forge
    lcms2-2.16                 |       h67d730c_0         496 KB  conda-forge
    lerc-4.0.0                 |       h63175ca_0         190 KB  conda-forge
    libblas-3.9.0              |     21_win64_mkl         4.8 MB  conda-forge
    libbrotlicommon-1.1.0      |       hcfcfb64_1          69 KB  conda-forge
    libbrotlidec-1.1.0         |       hcfcfb64_1          32 KB  conda-forge
    libbrotlienc-1.1.0         |       hcfcfb64_1         241 KB  conda-forge
    libcblas-3.9.0             |     21_win64_mkl         4.8 MB  conda-forge
    libdeflate-1.20            |       hcfcfb64_0         152 KB  conda-forge
    libffi-3.4.2               |       h8ffe710_5          41 KB  conda-forge
    libhwloc-2.9.3             |default_haede6df_1009         2.5 MB  conda-forge
    libiconv-1.17              |       hcfcfb64_2         621 KB  conda-forge
    libjpeg-turbo-3.0.0        |       hcfcfb64_1         804 KB  conda-forge
    liblapack-3.9.0            |     21_win64_mkl         4.8 MB  conda-forge
    libpng-1.6.43              |       h19919ed_0         339 KB  conda-forge
    libsqlite-3.45.2           |       hcfcfb64_0         849 KB  conda-forge
    libtiff-4.6.0              |       hddb2be6_3         769 KB  conda-forge
    libwebp-base-1.3.2         |       hcfcfb64_0         263 KB  conda-forge
    libxcb-1.15                |       hcd874cb_0         947 KB  conda-forge
    libxml2-2.12.6             |       hc3477c8_1         1.6 MB  conda-forge
    libzlib-1.2.13             |       hcfcfb64_5          54 KB  conda-forge
    m2w64-gcc-libgfortran-5.3.0|                6         342 KB  conda-forge
    m2w64-gcc-libs-5.3.0       |                7         520 KB  conda-forge
    m2w64-gcc-libs-core-5.3.0  |                7         214 KB  conda-forge
    m2w64-gmp-6.1.0            |                2         726 KB  conda-forge
    m2w64-libwinpthread-git-5.0.0.4634.697f757|                2          31 KB  conda-forge
    matplotlib-base-3.5.1      |   py38h1f000d6_0         7.3 MB  conda-forge
    mkl-2024.0.0               |   h66d3029_49657       103.5 MB  conda-forge
    msys2-conda-epoch-20160418 |                1           3 KB  conda-forge
    munkres-1.1.4              |     pyh9f0ad1d_0          12 KB  conda-forge
    numpy-1.19.0               |   py38h72c728b_0         4.9 MB  conda-forge
    openjpeg-2.5.2             |       h3d672ee_0         232 KB  conda-forge
    openssl-3.2.1              |       hcfcfb64_1         7.8 MB  conda-forge
    packaging-24.0             |     pyhd8ed1ab_0          49 KB  conda-forge
    pandas-1.4.1               |   py38h5d928e2_0        11.0 MB  conda-forge
    pillow-10.2.0              |   py38hc375fad_0        39.7 MB  conda-forge
    pip-24.0                   |     pyhd8ed1ab_0         1.3 MB  conda-forge
    pthread-stubs-0.4          |    hcd874cb_1001           6 KB  conda-forge
    pthreads-win32-2.9.1       |       hfa6e2cd_3         141 KB  conda-forge
    pyaml-23.12.0              |     pyhd8ed1ab_0          26 KB  conda-forge
    pyparsing-3.1.2            |     pyhd8ed1ab_0          87 KB  conda-forge
    python-3.8.19              |h4de0772_0_cpython        15.3 MB  conda-forge
    python-dateutil-2.9.0      |     pyhd8ed1ab_0         218 KB  conda-forge
    python_abi-3.8             |           4_cp38           7 KB  conda-forge
    pytz-2024.1                |     pyhd8ed1ab_0         184 KB  conda-forge
    pyyaml-6.0.1               |   py38h91455d4_1         148 KB  conda-forge
    scikit-learn-0.22.1        |   py38h7208079_1         6.2 MB  conda-forge
    scikit-optimize-0.9.0      |     pyhd8ed1ab_1          74 KB  conda-forge
    scipy-1.8.0                |   py38ha1292f7_1        27.2 MB  conda-forge
    setuptools-69.2.0          |     pyhd8ed1ab_0         460 KB  conda-forge
    six-1.16.0                 |     pyh6c4a22f_0          14 KB  conda-forge
    tbb-2021.11.0              |       h91493d7_1         158 KB  conda-forge
    tk-8.6.13                  |       h5226925_1         3.3 MB  conda-forge
    ucrt-10.0.22621.0          |       h57928b3_0         1.2 MB  conda-forge
    unicodedata2-15.1.0        |   py38h91455d4_0         362 KB  conda-forge
    vc-14.3                    |      hcf57466_18          17 KB  conda-forge
    vc14_runtime-14.38.33130   |      h82b7239_18         732 KB  conda-forge
    vs2015_runtime-14.38.33130 |      hcb4865c_18          17 KB  conda-forge
    wheel-0.43.0               |     pyhd8ed1ab_0          57 KB  conda-forge
    xorg-libxau-1.0.11         |       hcd874cb_0          50 KB  conda-forge
    xorg-libxdmcp-1.1.3        |       hcd874cb_0          66 KB  conda-forge
    xz-5.2.6                   |       h8d14728_0         213 KB  conda-forge
    yaml-0.2.5                 |       h8ffe710_2          62 KB  conda-forge
    zstd-1.5.5                 |       h12be248_0         335 KB  conda-forge
    ------------------------------------------------------------
                                           Total:       263.7 MB

The below NEW packages will be INSTALLED:

Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
Please restart your R session for the changes to take effect.

每次分析需要通过设置正确的 conda 工作环境来重新启动 GeneSelectR 分析。

GeneSelectR::set_reticulate_python()
# Set RETICULATE_PYTHON to D:\Program
# Files\Anaconda3\envs\GeneSelectR_env/python.exe for the current R session.
library(GeneSelectR)
# rest of your code

加载 GeneSelectR 软件包，发现还需要安装其他的依赖包，那就选择1. Yes 继续安装好了。

The following Bioconductor packages are required for full functionality of GeneSelectR: simplifyEnrichment.
Do you want to install them now? 

1: Yes
2: No
Selection: 1
'getOption("repos")' replaces Bioconductor standard repositories, see 'help("repositories",
package = "BiocManager")' for details.
Replacement repositories:
    CRAN: https://mirrors.tuna.tsinghua.edu.cn/CRAN/
Bioconductor version 3.18 (BiocManager 1.30.22), R 4.3.1 (2023-06-16 ucrt)
Installing package(s) 'simplifyEnrichment'
还安装相依关系‘NLP’, ‘tm’, ‘org.Hs.eg.db’, ‘slam’, ‘proxyC’

准备足够的资源

我是用的是工作站，all cores will be used，大概是1小时跑出结果，这还只是个测试数据而已，因此若要使用这个软件包记得保证有足够的资源哦。

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:   44.2s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:  2.2min finished
Performing Permuation Importance Calculation
Fitting the data split: 5 
Fitting pipeline for Lasso feature selection method

数据读取

数据矩阵应该是一个数据框架，以样本为行，以基因为列。

该数据集是从149名非洲儿童的血液样本中获得的大量RNAseq数据集，这些儿童被分为患有特应性皮炎(AD)和健康对照(HC)的儿童。此外，整个数据集包含儿童所在位置(城市和农村)的分层变量。此数据片段仅包含Urban示例。列代表基因，行代表样本。

load("./GeneSelectR-master/tests/testthat/fixtures/UrbanRandomSubset.rda")
head(UrbanRandomSubset[, 1:10])
##         treatment ENSG00000174371__EXO1 ENSG00000123600__METTL8
## CA002YF  Urban_AD          2.984508e+00            1.816292e+00
## CA009ST  Urban_AD           0.565690377             2.010920739
## CA010EB  Urban_AD          3.6708983671            3.1615995794
## CA011LQ  Urban_AD          1.408499e+00            2.674512e+00
## CA014LB  Urban_AD          2.5183409223            1.8185148529
## CA015AM  Urban_AD          2.6664729139            3.0494281366
##         ENSG00000154124__OTULIN ENSG00000006607__FARP2 ENSG00000135686__KLHL36
## CA002YF            4.959833e+00           4.125659e+00            6.584945e+00
## CA009ST             4.601555569            4.873391176             6.677867796
## CA010EB            4.6202078459           4.8954199262            6.0682238886
## CA011LQ            3.909325e+00           3.852403e+00            6.041770e+00
## CA014LB            3.8273512783           4.6382172341            6.3209867192
## CA015AM            4.4946295998           4.4197758295            5.9998300955
##         ENSG00000130348__QRSL1 ENSG00000268041__AC010616.1
## CA002YF           4.298858e+00                3.150753e+00
## CA009ST            4.043031406                 3.090960290
## CA010EB           4.4417887589                3.3511595489
## CA011LQ           4.185302e+00                3.049633e+00
## CA014LB           4.3729324835                2.6071577494
## CA015AM           4.6819749406                2.6885696358
##         ENSG00000163812__ZDHHC3 ENSG00000233041__PHGR1
## CA002YF            4.617841e+00          -2.934457e+00
## CA009ST             5.161640677           -2.934457404
## CA010EB            4.9131957592          -2.9344574043
## CA011LQ            5.061451e+00          -2.318961e+00
## CA014LB            5.1488088821          -2.9344574043
## CA015AM            5.0960550151          -2.9344574043
table(UrbanRandomSubset$treatment)
## 
##      Urban_AD Urban_Healthy 
##            31            29
library(dplyr)
### Feature Selection Procedure Basic Usage
X <- UrbanRandomSubset %>%
    dplyr::select(-treatment)  # get the feature matrix
y <- UrbanRandomSubset["treatment"]  # store the data point label in a separate vector
y <- as.factor(y[, 1])

实例操作

这个需要运行很久，注意时间问题，njobs = -1 表示所有的cores都用上了，我做的时候是16个，大概1小时，没有资源的就别尝试了，多次卡断了哦。

或者出结果后立刻保持结果对象，以方便后面即使调取，不用反复等待。

默认设置:如果不提供，则建立默认的特征选择方法和超参数网格。默认情况下，有四种选择特征的方法:

1. 单变量特征选择(Univariate feature selection);

2. L1惩罚逻辑回归(Logistic regression with L1 penalty);

3. 森林之神(Boruta);

4. 随机森林(Random Forest)。

selection_results <- GeneSelectR(X = X, y = y,njobs = -1,
                                 perform_test_split = FALSE,
                                 calculate_permutation_importance=TRUE,
                                 perform_test_split=TRUE) # all cores will be used used
saveRDS(selection_results,file = 'selection_results.rds') #保存 rds

保证好之后直接调取结果对象即可：

selection_results <- readRDS("selection_results.rds")  # 读取 rd

结果解读

结果对象的解读

结果对象selection_results 包括 6 部分内容，如下：

The function returns an object of class “PipelineResults”, containing:
‘best_pipeline’: A named list containing parameters of the best performer pipeline.

‘cv_results’: 每个管道的交叉验证结果.

‘inbuilt_feature_importance’: 聚合的内置特性重要性得分.

‘test_metrics’: 如果perform_test_split参数设置为TRUE，则返回每个管道的测试指标的数据框.

‘cv_mean_score’: 汇总交叉验证平均分数的数据框.

‘permutation_importance’: 如果计算了排列重要性，则返回其平均值.

特征重要性绘图

最后，您可以通过调用绘图函数来可视化每个特征重要性方法计算的最重要特征:

plot_feature_importance(selection_results, top_n_features = 10)
## $Lasso

## 
## $Univariate

## 
## $RandomForest

## 
## $boruta

机器学习的矩阵

您可以使用以下函数绘制特征选择过程度量:

plot_metrics(selection_results)

# or access it as a dataframe
selection_results@test_metrics
## # A tibble: 4 × 9
##   method       f1_mean  f1_sd recall_mean recall_sd precision_mean precision_sd
##   <chr>          <dbl>  <dbl>       <dbl>     <dbl>          <dbl>        <dbl>
## 1 Lasso          0.697 0.127        0.7      0.112           0.725       0.143 
## 2 RandomForest   0.694 0.103        0.683    0.109           0.761       0.0901
## 3 Univariate     0.762 0.146        0.75     0.156           0.811       0.135 
## 4 boruta         0.691 0.0837       0.683    0.0913          0.719       0.0683
## # ℹ 2 more variables: accuracy_mean <dbl>, accuracy_sd <dbl>
selection_results@cv_mean_score
##         method mean_score   sd_score
## 1       boruta  0.6285714 0.06298283
## 2        Lasso  0.6650000 0.05293672
## 3 RandomForest  0.6714286 0.03144074
## 4   Univariate  0.7014286 0.04755502

不同算法中的基因列表重叠

此外，您还可以检查特征选择列表中的基因是否具有重叠的特征。要做到这一点，请使用以下命令:

overlap <- calculate_overlap_coefficients(selection_results)
overlap
## $inbuilt_feature_importance_coefficient
## $inbuilt_feature_importance_coefficient$overlap
##              Lasso Univariate RandomForest boruta
## Lasso            1       1.00         1.00   1.00
## Univariate       1       1.00         0.96   0.78
## RandomForest     1       0.96         1.00   1.00
## boruta           1       0.78         1.00   1.00
## 
## $inbuilt_feature_importance_coefficient$jaccard
##              Lasso Univariate RandomForest boruta
## Lasso         1.00       0.52         0.94   0.18
## Univariate    0.52       1.00         0.52   0.25
## RandomForest  0.94       0.52         1.00   0.19
## boruta        0.18       0.25         0.19   1.00
## 
## $inbuilt_feature_importance_coefficient$soerensen
##              Lasso Univariate RandomForest boruta
## Lasso         1.00       0.68         0.97   0.31
## Univariate    0.68       1.00         0.68   0.40
## RandomForest  0.97       0.68         1.00   0.32
## boruta        0.31       0.40         0.32   1.00
## 
## 
## $permutation_importance_coefficients
## $permutation_importance_coefficients$overlap
##              Lasso Univariate RandomForest boruta
## Lasso         1.00       0.37         0.47   0.33
## Univariate    0.37       1.00         0.47   0.33
## RandomForest  0.47       0.47         1.00   0.33
## boruta        0.33       0.33         0.33   1.00
## 
## $permutation_importance_coefficients$jaccard
##              Lasso Univariate RandomForest boruta
## Lasso         1.00       0.17         0.18   0.03
## Univariate    0.17       1.00         0.26   0.05
## RandomForest  0.18       0.26         1.00   0.06
## boruta        0.03       0.05         0.06   1.00
## 
## $permutation_importance_coefficients$soerensen
##              Lasso Univariate RandomForest boruta
## Lasso         1.00       0.29         0.31   0.06
## Univariate    0.29       1.00         0.41   0.09
## RandomForest  0.31       0.41         1.00   0.11
## boruta        0.06       0.09         0.11   1.00

这将返回一个数据框，其中演示了内置特征重要性和排列重要性(如果计算的话)的三种类型的重叠系数:Soerensen-Dice, overlap和Jaccard。这些系数也可以可视化为重叠热图。要做到这一点，请做到以下几点:

plot_overlap_heatmaps(overlap)

此外，如果您有任何自定义列表(例如差异基因表达列表)，您可以将其作为这样的参数传递:

custom_list <- list(custom_list = c("char1", "char2", "char3", "char4", "char5"),
    custom_list2 = c("char1", "char2", "char3", "char4", "char5"))
overlap1 <- calculate_overlap_coefficients(selection_results, custom_lists = custom_list)
plot_overlap_heatmaps(overlap1)

Upset plot

要获得特征列表之间交点的确切数量，您可以使用upset plot函数:

plot_upset(selection_results)
## $inbuilt_importance

## 
## $permutation_importance

# plot upset with custom lists
plot_upset(selection_results, custom_lists = custom_list)
## $inbuilt_importance

## 
## $permutation_importance

GO 富集分析

为了方便起见，实现了一个用于clusterprofiler(链接)GO富集的包装器函数，以及一个获取基因注释的函数。运行GeneSelectR后，要获取基因注释，请执行以下操作:

proxy <- httr::use_proxy(Sys.getenv("http_proxy"))
httr::set_config(proxy)
AnnotationHub::setAnnotationHubOption("PROXY", proxy)  ## 添加以上三句
ah <- AnnotationHub::AnnotationHub()
# Assuming valid proxy connection through ':1' If you experience connection
# issues consider using 'localHub=TRUE'
# |===================================================================| 100%

human_ens <- AnnotationHub::query(ah, c("Homo sapiens", "EnsDb"))
human_ens <- human_ens[["AH98047"]]
# BiocManager::install('ensembldb')
annotations_ahb <- ensembldb::genes(human_ens, return.type = "data.frame") %>%
    dplyr::select(gene_id, gene_name, entrezid, gene_biotype)

在做注释的时候发现生产结果变量selection_results中的feature为ENSG00000196405__EVL格式，所有我们需要将其分割为 ENSG00000196405或 EVL，这里面支持三种类型的基因ID 为："ENTREZ", "ENSEMBL", "SYMBOL"。

selection_results@inbuilt_feature_importance$Lasso$feature = substr(selection_results@inbuilt_feature_importance$Lasso$feature,
    1, 15)
selection_results@inbuilt_feature_importance$Univariate$feature = substr(selection_results@inbuilt_feature_importance$Univariate$feature,
    1, 15)
selection_results@inbuilt_feature_importance$RandomForest$feature = substr(selection_results@inbuilt_feature_importance$RandomForest$feature,
    1, 15)
selection_results@inbuilt_feature_importance$boruta$feature = substr(selection_results@inbuilt_feature_importance$boruta$feature,
    1, 15)

selection_results@permutation_importance$Lasso$feature = substr(selection_results@permutation_importance$Lasso$feature,
    1, 15)
selection_results@permutation_importance$Univariate$feature = substr(selection_results@permutation_importance$Univariate$feature,
    1, 15)
selection_results@permutation_importance$RandomForest$feature = substr(selection_results@permutation_importance$RandomForest$feature,
    1, 15)
selection_results@permutation_importance$boruta$feature = substr(selection_results@permutation_importance$boruta$feature,
    1, 15)

有一个包装器函数可以使用clusterprofiler包运行GO富集分析。要使用默认设置运行GO富集分析，只需运行:

annotations_df <- annotate_gene_lists(pipeline_results = selection_results, annotations_ahb = annotations_ahb,
    format = "ENSEMBL")

annotated_GO <- GO_enrichment_analysis(annotations_df)
## Visualization of Parent Term Fractions
annot_child_fractions <- compute_GO_child_term_metrics(GO_data = annotated_GO, GO_terms = c("GO:0002376",
    "GO:0044419"), plot = TRUE)

Semantic Similarity Analysis

分析的最后一步是对每个列表中的GO术语进行聚类和语义相似度分析。这是通过simplifyenrichment R包完成的。为了方便数据输入，实现了simplifyGOFromMultipleLists()函数的包装器:

#install.packages("magick")
pdf("simplify_enrichment.pdf",h=8,w=10)
hmap <- run_simplify_enrichment(annotated_GO,
                                method = 'louvain',
                                measure = 'Resnik',
                                padj_cutoff=0.05,
                                ont = "BP")
dev.off()