MachineLearning 31. 机器学习之基于RNA-seq的基因特征筛选 (GeneSelectR)

f5c6f7f33d4e10705c7a1d8081cf1da0.png


简       介

RNA-seq 数据集在识别下游分析和数据挖掘工作的生物学相关特征方面提出了相当大的挑战。标准方法涉及差异基因表达 (DGE) 分析,但由于其单变量性质,其有效性可能受到数据的限制。在复杂的数据集中,另一种方法涉及使用各种机器学习 (ML) 工具,这些工具试图理解特征之间的非线性关系,并专注于概括性而不是统计显著性。这种方法将导致生成多个特征列表,这些特征列表可能在分类性能指标方面表现出相似性。因此,迫切需要一个内聚的工作流程,使用不同的机器学习方法无缝集成鲁棒的特征选择,同时评估结果特征列表的生物学相关性。考虑到两组标准,这种组合方法将能够确定最佳执行列表的优先级。

今天介绍一下 GeneSelectR 软件包,创新地结合了机器学习和生物信息学数据挖掘方法,以增强特征选择。使用 GeneSelectR,可以使用各种ML方法和用户定义的参数从规范化的 RNA-seq 数据集中选择特征。接下来是评估与基因本体 (GO) 富集分析的生物学相关性,以及对结果 GO 术语的语义相似性分析。此外,计算相似系数和 GO 感兴趣项的分数。

因此,GeneSelectR 优化了机器学习性能,并严格评估了各种列表的生物学相关性,提供了一种根据生物学问题优先考虑特征列表的方法。当应用于 TCGA-BRCA 数据集时,GeneSelectR工作流使用不同的 ML 方法和 DGE 分析生成了几个特征列表。通过利用 GeneSelectR 中的各种功能,可以根据 ML 性能和生物学相关性评估不同的列表。这种全面的评估有助于选择表现最好的列表,这些列表既表现出强大的机器学习性能,又与生物学问题高度相关,同时保持了可管理的高度具体的特征。

a379260ad5f4a8183b0b4de1dece63a3.png

特征选择过程

核心功能GeneSelectR使用各种方法进行基因选择,并通过交叉验证评估其性能。还支持超参数调优、排列特征重要性计算等。b5ccfe5bffa6f6ea4d86f067401e797a.png

软件包安装

  1. 先安装 GeneSelectR 软件包

# install.packages("devtools")
devtools::install_github("dzhakparov/GeneSelectR")
  1. window 11 上安装 Anaconda3

下载地址:https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-2019.10-Windows-x86_64.exe

  1. 配置 GeneSelectR 环境

安装python的软件包有点多,需要挺长时间,耐心等待。

GeneSelectR::configure_environment()
Conda is installed.
The conda environment GeneSelectR_env does not exist. Do you want to create it? 

1: yes
2: no

Selection: 1
Creating conda environment and installing required packages
+ "D:/Program Files/Anaconda3/condabin/conda.bat" "create" "--yes" "--name" "GeneSelectR_env" "python=3.8" "numpy <= 1.19" "scikit-learn <= 0.22.1" "pandas" "boruta_py" "scikit-optimize" "--quiet" "-c" "conda-forge"
WARNING: A space was detected in your requested environment path
'D:\Program Files\Anaconda3\envs\GeneSelectR_env'
Spaces in paths can sometimes be problematic.
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: D:\Program Files\Anaconda3\envs\GeneSelectR_env

  added / updated specs:
    - boruta_py
    - numpy[version='<=1.19']
    - pandas
    - python=3.8
    - scikit-learn[version='<=0.22.1']
    - scikit-optimize

  The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    boruta_py-0.3              |             py_0          51 KB  conda-forge
    brotli-1.1.0               |       hcfcfb64_1          19 KB  conda-forge
    brotli-bin-1.1.0           |       hcfcfb64_1          20 KB  conda-forge
    bzip2-1.0.8                |       hcfcfb64_5         122 KB  conda-forge
    ca-certificates-2024.2.2   |       h56e8100_0         152 KB  conda-forge
    certifi-2024.2.2           |     pyhd8ed1ab_0         157 KB  conda-forge
    cycler-0.12.1              |     pyhd8ed1ab_0          13 KB  conda-forge
    fonttools-4.50.0           |   py38h91455d4_0         1.8 MB  conda-forge
    freetype-2.12.1            |       hdaf720e_2         498 KB  conda-forge
    intel-openmp-2024.0.0      |   h57928b3_49841         2.2 MB  conda-forge
    joblib-1.3.2               |     pyhd8ed1ab_0         216 KB  conda-forge
    kiwisolver-1.4.5           |   py38hb1fd069_1          54 KB  conda-forge
    lcms2-2.16                 |       h67d730c_0         496 KB  conda-forge
    lerc-4.0.0                 |       h63175ca_0         190 KB  conda-forge
    libblas-3.9.0              |     21_win64_mkl         4.8 MB  conda-forge
    libbrotlicommon-1.1.0      |       hcfcfb64_1          69 KB  conda-forge
    libbrotlidec-1.1.0         |       hcfcfb64_1          32 KB  conda-forge
    libbrotlienc-1.1.0         |       hcfcfb64_1         241 KB  conda-forge
    libcblas-3.9.0             |     21_win64_mkl         4.8 MB  conda-forge
    libdeflate-1.20            |       hcfcfb64_0         152 KB  conda-forge
    libffi-3.4.2               |       h8ffe710_5          41 KB  conda-forge
    libhwloc-2.9.3             |default_haede6df_1009         2.5 MB  conda-forge
    libiconv-1.17              |       hcfcfb64_2         621 KB  conda-forge
    libjpeg-turbo-3.0.0        |       hcfcfb64_1         804 KB  conda-forge
    liblapack-3.9.0            |     21_win64_mkl         4.8 MB  conda-forge
    libpng-1.6.43              |       h19919ed_0         339 KB  conda-forge
    libsqlite-3.45.2           |       hcfcfb64_0         849 KB  conda-forge
    libtiff-4.6.0              |       hddb2be6_3         769 KB  conda-forge
    libwebp-base-1.3.2         |       hcfcfb64_0         263 KB  conda-forge
    libxcb-1.15                |       hcd874cb_0         947 KB  conda-forge
    libxml2-2.12.6             |       hc3477c8_1         1.6 MB  conda-forge
    libzlib-1.2.13             |       hcfcfb64_5          54 KB  conda-forge
    m2w64-gcc-libgfortran-5.3.0|                6         342 KB  conda-forge
    m2w64-gcc-libs-5.3.0       |                7         520 KB  conda-forge
    m2w64-gcc-libs-core-5.3.0  |                7         214 KB  conda-forge
    m2w64-gmp-6.1.0            |                2         726 KB  conda-forge
    m2w64-libwinpthread-git-5.0.0.4634.697f757|                2          31 KB  conda-forge
    matplotlib-base-3.5.1      |   py38h1f000d6_0         7.3 MB  conda-forge
    mkl-2024.0.0               |   h66d3029_49657       103.5 MB  conda-forge
    msys2-conda-epoch-20160418 |                1           3 KB  conda-forge
    munkres-1.1.4              |     pyh9f0ad1d_0          12 KB  conda-forge
    numpy-1.19.0               |   py38h72c728b_0         4.9 MB  conda-forge
    openjpeg-2.5.2             |       h3d672ee_0         232 KB  conda-forge
    openssl-3.2.1              |       hcfcfb64_1         7.8 MB  conda-forge
    packaging-24.0             |     pyhd8ed1ab_0          49 KB  conda-forge
    pandas-1.4.1               |   py38h5d928e2_0        11.0 MB  conda-forge
    pillow-10.2.0              |   py38hc375fad_0        39.7 MB  conda-forge
    pip-24.0                   |     pyhd8ed1ab_0         1.3 MB  conda-forge
    pthread-stubs-0.4          |    hcd874cb_1001           6 KB  conda-forge
    pthreads-win32-2.9.1       |       hfa6e2cd_3         141 KB  conda-forge
    pyaml-23.12.0              |     pyhd8ed1ab_0          26 KB  conda-forge
    pyparsing-3.1.2            |     pyhd8ed1ab_0          87 KB  conda-forge
    python-3.8.19              |h4de0772_0_cpython        15.3 MB  conda-forge
    python-dateutil-2.9.0      |     pyhd8ed1ab_0         218 KB  conda-forge
    python_abi-3.8             |           4_cp38           7 KB  conda-forge
    pytz-2024.1                |     pyhd8ed1ab_0         184 KB  conda-forge
    pyyaml-6.0.1               |   py38h91455d4_1         148 KB  conda-forge
    scikit-learn-0.22.1        |   py38h7208079_1         6.2 MB  conda-forge
    scikit-optimize-0.9.0      |     pyhd8ed1ab_1          74 KB  conda-forge
    scipy-1.8.0                |   py38ha1292f7_1        27.2 MB  conda-forge
    setuptools-69.2.0          |     pyhd8ed1ab_0         460 KB  conda-forge
    six-1.16.0                 |     pyh6c4a22f_0          14 KB  conda-forge
    tbb-2021.11.0              |       h91493d7_1         158 KB  conda-forge
    tk-8.6.13                  |       h5226925_1         3.3 MB  conda-forge
    ucrt-10.0.22621.0          |       h57928b3_0         1.2 MB  conda-forge
    unicodedata2-15.1.0        |   py38h91455d4_0         362 KB  conda-forge
    vc-14.3                    |      hcf57466_18          17 KB  conda-forge
    vc14_runtime-14.38.33130   |      h82b7239_18         732 KB  conda-forge
    vs2015_runtime-14.38.33130 |      hcb4865c_18          17 KB  conda-forge
    wheel-0.43.0               |     pyhd8ed1ab_0          57 KB  conda-forge
    xorg-libxau-1.0.11         |       hcd874cb_0          50 KB  conda-forge
    xorg-libxdmcp-1.1.3        |       hcd874cb_0          66 KB  conda-forge
    xz-5.2.6                   |       h8d14728_0         213 KB  conda-forge
    yaml-0.2.5                 |       h8ffe710_2          62 KB  conda-forge
    zstd-1.5.5                 |       h12be248_0         335 KB  conda-forge
    ------------------------------------------------------------
                                           Total:       263.7 MB

The below NEW packages will be INSTALLED:

Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
Please restart your R session for the changes to take effect.
  1. 每次分析需要通过设置正确的 conda 工作环境来重新启动 GeneSelectR 分析。

GeneSelectR::set_reticulate_python()
# Set RETICULATE_PYTHON to D:\Program
# Files\Anaconda3\envs\GeneSelectR_env/python.exe for the current R session.
library(GeneSelectR)
# rest of your code
  1. 加载 GeneSelectR 软件包,发现还需要安装其他的依赖包,那就选择1. Yes 继续安装好了。

The following Bioconductor packages are required for full functionality of GeneSelectR: simplifyEnrichment.
Do you want to install them now? 

1: Yes
2: No
Selection: 1
'getOption("repos")' replaces Bioconductor standard repositories, see 'help("repositories",
package = "BiocManager")' for details.
Replacement repositories:
    CRAN: https://mirrors.tuna.tsinghua.edu.cn/CRAN/
Bioconductor version 3.18 (BiocManager 1.30.22), R 4.3.1 (2023-06-16 ucrt)
Installing package(s) 'simplifyEnrichment'
还安装相依关系‘NLP’, ‘tm’, ‘org.Hs.eg.db’, ‘slam’, ‘proxyC’
  1. 准备足够的资源

我是用的是工作站,all cores will be used,大概是1小时跑出结果,这还只是个测试数据而已,因此若要使用这个软件包记得保证有足够的资源哦。

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:   44.2s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:  2.2min finished
Performing Permuation Importance Calculation
Fitting the data split: 5 
Fitting pipeline for Lasso feature selection method

数据读取

数据矩阵应该是一个数据框架,以样本为行,以基因为列。

该数据集是从149名非洲儿童的血液样本中获得的大量RNAseq数据集,这些儿童被分为患有特应性皮炎(AD)和健康对照(HC)的儿童。此外,整个数据集包含儿童所在位置(城市和农村)的分层变量。此数据片段仅包含Urban示例。列代表基因,行代表样本。

load("./GeneSelectR-master/tests/testthat/fixtures/UrbanRandomSubset.rda")
head(UrbanRandomSubset[, 1:10])
##         treatment ENSG00000174371__EXO1 ENSG00000123600__METTL8
## CA002YF  Urban_AD          2.984508e+00            1.816292e+00
## CA009ST  Urban_AD           0.565690377             2.010920739
## CA010EB  Urban_AD          3.6708983671            3.1615995794
## CA011LQ  Urban_AD          1.408499e+00            2.674512e+00
## CA014LB  Urban_AD          2.5183409223            1.8185148529
## CA015AM  Urban_AD          2.6664729139            3.0494281366
##         ENSG00000154124__OTULIN ENSG00000006607__FARP2 ENSG00000135686__KLHL36
## CA002YF            4.959833e+00           4.125659e+00            6.584945e+00
## CA009ST             4.601555569            4.873391176             6.677867796
## CA010EB            4.6202078459           4.8954199262            6.0682238886
## CA011LQ            3.909325e+00           3.852403e+00            6.041770e+00
## CA014LB            3.8273512783           4.6382172341            6.3209867192
## CA015AM            4.4946295998           4.4197758295            5.9998300955
##         ENSG00000130348__QRSL1 ENSG00000268041__AC010616.1
## CA002YF           4.298858e+00                3.150753e+00
## CA009ST            4.043031406                 3.090960290
## CA010EB           4.4417887589                3.3511595489
## CA011LQ           4.185302e+00                3.049633e+00
## CA014LB           4.3729324835                2.6071577494
## CA015AM           4.6819749406                2.6885696358
##         ENSG00000163812__ZDHHC3 ENSG00000233041__PHGR1
## CA002YF            4.617841e+00          -2.934457e+00
## CA009ST             5.161640677           -2.934457404
## CA010EB            4.9131957592          -2.9344574043
## CA011LQ            5.061451e+00          -2.318961e+00
## CA014LB            5.1488088821          -2.9344574043
## CA015AM            5.0960550151          -2.9344574043
table(UrbanRandomSubset$treatment)
## 
##      Urban_AD Urban_Healthy 
##            31            29
library(dplyr)
### Feature Selection Procedure Basic Usage
X <- UrbanRandomSubset %>%
    dplyr::select(-treatment)  # get the feature matrix
y <- UrbanRandomSubset["treatment"]  # store the data point label in a separate vector
y <- as.factor(y[, 1])

实例操作

这个需要运行很久,注意时间问题,njobs = -1 表示 所有的cores都用上了,我做的时候是16个,大概1小时,没有资源的就别尝试了,多次卡断了哦。

或者出结果后立刻保持结果对象,以方便后面即使调取,不用反复等待。

默认设置:如果不提供,则建立默认的特征选择方法和超参数网格。默认情况下,有四种选择特征的方法:

1. 单变量特征选择(Univariate feature selection);

2. L1惩罚逻辑回归(Logistic regression with L1 penalty);

3. 森林之神(Boruta);

4. 随机森林(Random Forest)。

selection_results <- GeneSelectR(X = X, y = y,njobs = -1,
                                 perform_test_split = FALSE,
                                 calculate_permutation_importance=TRUE,
                                 perform_test_split=TRUE) # all cores will be used used
saveRDS(selection_results,file = 'selection_results.rds') #保存 rds

保证好之后直接调取结果对象即可:

selection_results <- readRDS("selection_results.rds")  # 读取 rd

结果解读

结果对象的解读

结果对象selection_results 包括 6 部分内容,如下:

The function returns an object of class “PipelineResults”, containing:
‘best_pipeline’: A named list containing parameters of the best performer pipeline.

‘cv_results’: 每个管道的交叉验证结果.

‘inbuilt_feature_importance’: 聚合的内置特性重要性得分.

‘test_metrics’: 如果perform_test_split参数设置为TRUE,则返回每个管道的测试指标的数据框.

‘cv_mean_score’: 汇总交叉验证平均分数的数据框.

‘permutation_importance’: 如果计算了排列重要性,则返回其平均值.

051db0064e2f964bcaed0916315bb314.png

特征重要性绘图

最后,您可以通过调用绘图函数来可视化每个特征重要性方法计算的最重要特征:

plot_feature_importance(selection_results, top_n_features = 10)
## $Lasso

41c638e2071e5799e56cc812d02f1ad1.png

## 
## $Univariate

2f713f4a3a514f8fabc39a3c4720e975.png

## 
## $RandomForest

a87799fb0a05d1566bdc6a45ef2fbe16.png

## 
## $boruta
dd6c18f2392876721e514f93074d2c1a.png

机器学习的矩阵

您可以使用以下函数绘制特征选择过程度量:

plot_metrics(selection_results)

13630a47ebe6ba77ff0546c7ed86c96b.png

# or access it as a dataframe
selection_results@test_metrics
## # A tibble: 4 × 9
##   method       f1_mean  f1_sd recall_mean recall_sd precision_mean precision_sd
##   <chr>          <dbl>  <dbl>       <dbl>     <dbl>          <dbl>        <dbl>
## 1 Lasso          0.697 0.127        0.7      0.112           0.725       0.143 
## 2 RandomForest   0.694 0.103        0.683    0.109           0.761       0.0901
## 3 Univariate     0.762 0.146        0.75     0.156           0.811       0.135 
## 4 boruta         0.691 0.0837       0.683    0.0913          0.719       0.0683
## # ℹ 2 more variables: accuracy_mean <dbl>, accuracy_sd <dbl>
selection_results@cv_mean_score
##         method mean_score   sd_score
## 1       boruta  0.6285714 0.06298283
## 2        Lasso  0.6650000 0.05293672
## 3 RandomForest  0.6714286 0.03144074
## 4   Univariate  0.7014286 0.04755502

不同算法中的基因列表重叠

此外,您还可以检查特征选择列表中的基因是否具有重叠的特征。要做到这一点,请使用以下命令:

overlap <- calculate_overlap_coefficients(selection_results)
overlap
## $inbuilt_feature_importance_coefficient
## $inbuilt_feature_importance_coefficient$overlap
##              Lasso Univariate RandomForest boruta
## Lasso            1       1.00         1.00   1.00
## Univariate       1       1.00         0.96   0.78
## RandomForest     1       0.96         1.00   1.00
## boruta           1       0.78         1.00   1.00
## 
## $inbuilt_feature_importance_coefficient$jaccard
##              Lasso Univariate RandomForest boruta
## Lasso         1.00       0.52         0.94   0.18
## Univariate    0.52       1.00         0.52   0.25
## RandomForest  0.94       0.52         1.00   0.19
## boruta        0.18       0.25         0.19   1.00
## 
## $inbuilt_feature_importance_coefficient$soerensen
##              Lasso Univariate RandomForest boruta
## Lasso         1.00       0.68         0.97   0.31
## Univariate    0.68       1.00         0.68   0.40
## RandomForest  0.97       0.68         1.00   0.32
## boruta        0.31       0.40         0.32   1.00
## 
## 
## $permutation_importance_coefficients
## $permutation_importance_coefficients$overlap
##              Lasso Univariate RandomForest boruta
## Lasso         1.00       0.37         0.47   0.33
## Univariate    0.37       1.00         0.47   0.33
## RandomForest  0.47       0.47         1.00   0.33
## boruta        0.33       0.33         0.33   1.00
## 
## $permutation_importance_coefficients$jaccard
##              Lasso Univariate RandomForest boruta
## Lasso         1.00       0.17         0.18   0.03
## Univariate    0.17       1.00         0.26   0.05
## RandomForest  0.18       0.26         1.00   0.06
## boruta        0.03       0.05         0.06   1.00
## 
## $permutation_importance_coefficients$soerensen
##              Lasso Univariate RandomForest boruta
## Lasso         1.00       0.29         0.31   0.06
## Univariate    0.29       1.00         0.41   0.09
## RandomForest  0.31       0.41         1.00   0.11
## boruta        0.06       0.09         0.11   1.00

这将返回一个数据框,其中演示了内置特征重要性和排列重要性(如果计算的话)的三种类型的重叠系数:Soerensen-Dice, overlap和Jaccard。这些系数也可以可视化为重叠热图。要做到这一点,请做到以下几点:

plot_overlap_heatmaps(overlap)

718bc67c18112551a0c31f8c44ecbde1.png

此外,如果您有任何自定义列表(例如差异基因表达列表),您可以将其作为这样的参数传递:

custom_list <- list(custom_list = c("char1", "char2", "char3", "char4", "char5"),
    custom_list2 = c("char1", "char2", "char3", "char4", "char5"))
overlap1 <- calculate_overlap_coefficients(selection_results, custom_lists = custom_list)
plot_overlap_heatmaps(overlap1)

00b88f35bddb769ef174a98b64929a0a.png

Upset plot

要获得特征列表之间交点的确切数量,您可以使用upset plot函数:

plot_upset(selection_results)
## $inbuilt_importance

a1aad082a2fe8325608f7f856ec2e20a.png

## 
## $permutation_importance

117af942da34da2522bc46181706d77c.png

# plot upset with custom lists
plot_upset(selection_results, custom_lists = custom_list)
## $inbuilt_importance

47cf5222cd19b0b90592546745d7f49c.png

## 
## $permutation_importance

abd3fd99a8f961c0506b5b0aabfbc208.png

GO 富集分析

为了方便起见,实现了一个用于clusterprofiler(链接)GO富集的包装器函数,以及一个获取基因注释的函数。运行GeneSelectR后,要获取基因注释,请执行以下操作:

proxy <- httr::use_proxy(Sys.getenv("http_proxy"))
httr::set_config(proxy)
AnnotationHub::setAnnotationHubOption("PROXY", proxy)  ## 添加以上三句
ah <- AnnotationHub::AnnotationHub()
# Assuming valid proxy connection through ':1' If you experience connection
# issues consider using 'localHub=TRUE'
# |===================================================================| 100%

human_ens <- AnnotationHub::query(ah, c("Homo sapiens", "EnsDb"))
human_ens <- human_ens[["AH98047"]]
# BiocManager::install('ensembldb')
annotations_ahb <- ensembldb::genes(human_ens, return.type = "data.frame") %>%
    dplyr::select(gene_id, gene_name, entrezid, gene_biotype)

在做注释的时候发现生产结果变量selection_results中的feature为ENSG00000196405__EVL格式,所有我们需要将其分割为 ENSG00000196405或 EVL,这里面支持三种类型的基因ID 为:"ENTREZ", "ENSEMBL", "SYMBOL"。

selection_results@inbuilt_feature_importance$Lasso$feature = substr(selection_results@inbuilt_feature_importance$Lasso$feature,
    1, 15)
selection_results@inbuilt_feature_importance$Univariate$feature = substr(selection_results@inbuilt_feature_importance$Univariate$feature,
    1, 15)
selection_results@inbuilt_feature_importance$RandomForest$feature = substr(selection_results@inbuilt_feature_importance$RandomForest$feature,
    1, 15)
selection_results@inbuilt_feature_importance$boruta$feature = substr(selection_results@inbuilt_feature_importance$boruta$feature,
    1, 15)

selection_results@permutation_importance$Lasso$feature = substr(selection_results@permutation_importance$Lasso$feature,
    1, 15)
selection_results@permutation_importance$Univariate$feature = substr(selection_results@permutation_importance$Univariate$feature,
    1, 15)
selection_results@permutation_importance$RandomForest$feature = substr(selection_results@permutation_importance$RandomForest$feature,
    1, 15)
selection_results@permutation_importance$boruta$feature = substr(selection_results@permutation_importance$boruta$feature,
    1, 15)

有一个包装器函数可以使用clusterprofiler包运行GO富集分析。要使用默认设置运行GO富集分析,只需运行:

annotations_df <- annotate_gene_lists(pipeline_results = selection_results, annotations_ahb = annotations_ahb,
    format = "ENSEMBL")

annotated_GO <- GO_enrichment_analysis(annotations_df)
## Visualization of Parent Term Fractions
annot_child_fractions <- compute_GO_child_term_metrics(GO_data = annotated_GO, GO_terms = c("GO:0002376",
    "GO:0044419"), plot = TRUE)

9eec2904f40f9e87d486bc86189c58f5.png

b94dd6815fb2dab7c53029f66d0901d7.png

Semantic Similarity Analysis

分析的最后一步是对每个列表中的GO术语进行聚类和语义相似度分析。这是通过simplifyenrichment R包完成的。为了方便数据输入,实现了simplifyGOFromMultipleLists()函数的包装器:

#install.packages("magick")
pdf("simplify_enrichment.pdf",h=8,w=10)
hmap <- run_simplify_enrichment(annotated_GO,
                                method = 'louvain',
                                measure = 'Resnik',
                                padj_cutoff=0.05,
                                ont = "BP")
dev.off()

5b21e2ff3a80e989d00bc41a6b6bcac3.png

Reference

Damir Zhakparov, Kathleen Moriarty, Damian Roqueiro, Katja Baerenfaller

bioRxiv 2024.01.22.576646; doi: https://doi.org/10.1101/2024.01.22.576646


基于机器学习构建临床预测模型

MachineLearning 1. 主成分分析(PCA)

MachineLearning 2. 因子分析(Factor Analysis)

MachineLearning 3. 聚类分析(Cluster Analysis)

MachineLearning 4. 癌症诊断方法之 K-邻近算法(KNN)

MachineLearning 5. 癌症诊断和分子分型方法之支持向量机(SVM)

MachineLearning 6. 癌症诊断机器学习之分类树(Classification Trees)

MachineLearning 7. 癌症诊断机器学习之回归树(Regression Trees)

MachineLearning 8. 癌症诊断机器学习之随机森林(Random Forest)

MachineLearning 9. 癌症诊断机器学习之梯度提升算法(Gradient Boosting)

MachineLearning 10. 癌症诊断机器学习之神经网络(Neural network)

MachineLearning 11. 机器学习之随机森林生存分析(randomForestSRC)

MachineLearning 12. 机器学习之降维方法t-SNE及可视化(Rtsne)

MachineLearning 13. 机器学习之降维方法UMAP及可视化 (umap)

MachineLearning 14. 机器学习之集成分类器(AdaBoost)

MachineLearning 15. 机器学习之集成分类器(LogitBoost)

MachineLearning 16. 机器学习之梯度提升机(GBM)

MachineLearning 17. 机器学习之围绕中心点划分算法(PAM)

MachineLearning 18. 机器学习之贝叶斯分析类器(Naive Bayes)

MachineLearning 19. 机器学习之神经网络分类器(NNET)

MachineLearning 20. 机器学习之袋装分类回归树(Bagged CART)

MachineLearning 21. 机器学习之临床医学上的生存分析 (xgboost)

MachineLearning 22. 机器学习之有监督主成分分析筛选基因 (SuperPC)

MachineLearning 23. 机器学习之岭回归预测基因型和表型 (Ridge)

MachineLearning 24. 机器学习之似然增强Cox 比例风险模型筛选变量及预后估计 (CoxBoost)

MachineLearning 25. 机器学习之支持向量机应用于生存分析 (survivalsvm)

MachineLearning 26. 机器学习之弹性网络算法应用于生存分析 (Enet)

MachineLearning 27. 机器学习之逐步Cox回归筛选变量 (StepCox)

MachineLearning 28. 机器学习之偏最小二乘回归应用于生存分析 (plsRcox)

MachineLearning 29. 机器学习之嵌套交叉验证 (Nested CV)

MachineLearning 30. 机器学习之特征选择森林之神 (Boruta)

桓峰基因,铸造成功的您!

未来桓峰基因公众号将不间断的推出单细胞系列生信分析教程,

敬请期待!!

桓峰基因官网正式上线,请大家多多关注,还有很多不足之处,大家多多指正!http://www.kyohogene.com/

桓峰基因和投必得合作,文章润色优惠85折,需要文章润色的老师可以直接到网站输入领取桓峰基因专属优惠券码:KYOHOGENE,然后上传,付款时选择桓峰基因优惠券即可享受85折优惠哦!https://www.topeditsci.com/

482b5faabc00d22a21cbf0d36728a7fb.png

  • 9
    点赞
  • 19
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值