metaphlan-help_quantile value for the robust average-CSDN博客

本文链接：https://blog.csdn.net/situttt/article/details/140292970

(metaphlan) [zhangzhd@vm-login02 project]$ metaphlan -h
usage: metaphlan --input_type {fastq,fasta,bowtie2out,sam} [–force] [–bowtie2db METAPHLAN_BOWTIE2_DB] [-x INDEX] [–bt2_ps BowTie2 presets]
[–bowtie2_exe BOWTIE2_EXE] [–bowtie2_build BOWTIE2_BUILD] [–bowtie2out FILE_NAME] [–min_mapq_val MIN_MAPQ_VAL] [–no_map]
[–tmp_dir] [–tax_lev TAXONOMIC_LEVEL] [–min_cu_len] [–min_alignment_len] [–add_viruses] [–ignore_eukaryotes] [–ignore_bacteria]
[–ignore_archaea] [–ignore_ksgbs] [–ignore_usgbs] [–stat_q] [–perc_nonzero] [–ignore_markers IGNORE_MARKERS] [–avoid_disqm]
[–stat] [-t ANALYSIS TYPE] [–nreads NUMBER_OF_READS] [–pres_th PRESENCE_THRESHOLD] [–clade] [–min_ab] [–profile_vsc]
[–vsc_out VSC_OUT] [–vsc_breadth VSC_BREADTH] [-o output file] [–sample_id_key name] [–use_group_representative] [–sample_id value]
[-s sam_output_file] [–legacy-output] [–CAMI_format_output] [–unclassified_estimation] [–mpa3] [–biom biom_output]
[–mdelim mdelim] [–nproc N] [–subsampling SUBSAMPLING] [–subsampling_output SUBSAMPLING_OUTPUT]
[–subsampling_paired SUBSAMPLING_PAIRED] [-1 FORWARD_READS] [-2 REVERSE_READS] [–mapping_subsampling]
[–subsampling_seed SUBSAMPLING_SEED] [–install] [–offline] [–force_download] [–read_min_len READ_MIN_LEN] [-v] [-h]
[INPUT_FILE] [OUTPUT_FILE]

DESCRIPTION
MetaPhlAn version 4.1.1 (11 Mar 2024):
METAgenomic PHyLogenetic ANalysis for metagenomic taxonomic profiling.
用于宏基因组分类分析的宏基因组系统发育分析

AUTHORS: Aitor Blanco-Miguez (aitor.blancomiguez@unitn.it), Francesco Beghini (francesco.beghini@unitn.it), Moreno Zolfo (moreno.zolfo@unitn.it), Nicola Segata (nicola.segata@unitn.it), Duy Tin Truong, Francesco Asnicar (f.asnicar@unitn.it), Claudia Mengoni (claudia.mengoni@unitn.it)

COMMON COMMANDS

We assume here that MetaPhlAn is installed using the several options available (pip, conda, PyPi)
Also BowTie2 should be in the system path with execution and read permissions, and Perl should be installed)
这里我们假设已经使用各种可用选项（pip、conda、PyPi）安装了MetaPhlAn。
同时，BowTie2应该在系统路径中，并具有执行和读取权限，还需要安装Perl。

========== MetaPhlAn clade-abundance estimation =================
========== MetaPhlAn 进化支丰度估算 ==========

The basic usage of MetaPhlAn consists in the identification of the clades (from phyla to species )
present in the metagenome obtained from a microbiome sample and their
relative abundance. This correspond to the default analysis type (-t rel_ab).
MetaPhlAn的基本用法包括识别来自微生物群样本的宏基因组中存在的进化支（从门到种）及其相对丰度。
这对应于默认的分析类型（-t rel_ab）。

Profiling a metagenome from raw reads:
从原始读数中分析宏基因组：
$ metaphlan metagenome.fastq --input_type fastq -o profiled_metagenome.txt
You can take advantage of multiple CPUs and save the intermediate BowTie2 output for re-running
MetaPhlAn extremely quickly:
你可以利用多个CPU，并保存中间的BowTie2输出，以便非常快速地重新运行MetaPhlAn：
$ metaphlan metagenome.fastq --bowtie2out metagenome.bowtie2.bz2 --nproc 5 --input_type fastq -o profiled_metagenome.txt
If you already mapped your metagenome against the marker DB (using a previous MetaPhlAn run), you
can obtain the results in few seconds by using the previously saved --bowtie2out file and
specifying the input (–input_type bowtie2out):
如果你已经将宏基因组映射到标记数据库（使用之前的MetaPhlAn运行），可以通过使用之前保存的–bowtie2out文件并指定输入（–input_type bowtie2out）在几秒钟内获得结果：
$ metaphlan metagenome.bowtie2.bz2 --nproc 5 --input_type bowtie2out -o profiled_metagenome.txt
bowtie2out files generated with MetaPhlAn versions below 3 are not compatibile.
Starting from MetaPhlAn 3.0, the BowTie2 ouput now includes the size of the profiled metagenome and the average read length.
If you want to re-run MetaPhlAn using these file you should provide the metagenome size via --nreads:
MetaPhlAn 3.0以下版本生成的bowtie2out文件不兼容。从MetaPhlAn 3.0开始，BowTie2输出现在包括已分析宏基因组的大小和平均读取长度。如果你想使用这些文件重新运行MetaPhlAn，应该通过–nreads提供宏基因组的大小：
$ metaphlan metagenome.bowtie2.bz2 --nproc 5 --input_type bowtie2out --nreads 520000 -o profiled_metagenome.txt
You can also provide an externally BowTie2-mapped SAM if you specify this format with
–input_type. Two steps: first apply BowTie2 and then feed MetaPhlAn with the obtained sam:
你也可以提供外部通过BowTie2映射的SAM文件，如果使用–input_type指定该格式。两个步骤：首先应用BowTie2，然后使用获取的SAM文件输入MetaPhlAn：
$ bowtie2 --sam-no-hd --sam-no-sq --no-unal --very-sensitive -S metagenome.sam -x ${mpa_dir}/metaphlan_databases/mpa_v30_CHOCOPhlAn_201901 -U metagenome.fastq
$ metaphlan metagenome.sam --input_type sam -o profiled_metagenome.txt
We can also natively handle paired-end metagenomes, and, more generally, metagenomes stored in
multiple files (but you need to specify the --bowtie2out parameter):
我们还可以原生处理配对末端的宏基因组数据，以及一般存储在多个文件中的宏基因组数据（但需要指定–bowtie2out参数）：
$ metaphlan metagenome_1.fastq,metagenome_2.fastq --bowtie2out metagenome.bowtie2.bz2 --nproc 5 --input_type fastq

========== Marker level analysis ============================

MetaPhlAn introduces the capability of characterizing organisms at the strain level using non
aggregated marker information. Such capability comes with several slightly different flavours and
are a way to perform strain tracking and comparison across multiple samples.
Usually, MetaPhlAn is first ran with the default -t to profile the species present in
the community, and then a strain-level profiling can be performed to zoom-in into specific species
of interest. This operation can be performed quickly as it exploits the --bowtie2out intermediate
file saved during the execution of the default analysis type.
MetaPhlAn引入了利用非聚合标记信息对菌株级别进行组织特征化的能力。这种能力有几种略有不同的变体，可以用于跟踪和比较多个样本中的菌株。
通常情况下，首先使用默认的-t选项运行MetaPhlAn来分析社群中存在的物种，然后可以进行菌株级别的分析，以深入研究感兴趣的特定物种。这一操作可以快速进行，因为利用了在执行默认分析类型期间保存的–bowtie2out中间文件。

The following command will output the abundance of each marker with a RPK (reads per kilo-base)
higher 0.0. (we are assuming that metagenome_outfmt.bz2 has been generated before as
shown above).
下面的命令将输出每个标记物的丰度，以RPK（每千碱基读数）大于0.0为准（假设metagenome_outfmt.bz2已经像上面显示的那样生成）：
$ metaphlan -t marker_ab_table metagenome_outfmt.bz2 --input_type bowtie2out -o marker_abundance_table.txt

The obtained RPK can be optionally normalized by the total number of reads in the metagenome
to guarantee fair comparisons of abundances across samples. The number of reads in the metagenome
needs to be passed with the ‘–nreads’ argument
可以选择性地通过宏基因组中的总读数对得到的RPK进行标准化，以确保在样本间进行丰度公平比较。需要使用’–nreads’参数传递宏基因组中的读数。
The list of markers present in the sample can be obtained with ‘-t marker_pres_table’
可以使用’-t marker_pres_table’选项获取样本中存在的标记列表：
$ metaphlan -t marker_pres_table metagenome_outfmt.bz2 --input_type bowtie2out -o marker_abundance_table.txt

The --pres_th argument (default 1.0) set the minimum RPK value to consider a marker present
其中，–pres_th参数（默认为1.0）设置了考虑标记物存在所需的最小RPK值。
The list ‘-t clade_profiles’ analysis type reports the same information of ‘-t marker_ab_table’
but the markers are reported on a clade-by-clade basis.
可以使用’-t marker_pres_table’选项获取样本中存在的标记列表：
其中，–pres_th参数（默认为1.0）设置了考虑标记物存在所需的最小RPK值。
$ metaphlan -t clade_profiles metagenome_outfmt.bz2 --input_type bowtie2out -o marker_abundance_table.txt
Finally, to obtain all markers present for a specific clade and all its subclades, the
‘-t clade_specific_strain_tracker’ should be used. For example, the following command
is reporting the presence/absence of the markers for the B. fragilis species and its strains
the optional argument --min_ab specifies the minimum clade abundance for reporting the markers
最后，要获取特定进化支及其所有亚进化支中所有标记物的存在情况，应使用’-t clade_specific_strain_tracker’。例如，以下命令报告了B. fragilis物种及其菌株的标记物的存在/缺失情况。可选参数–min_ab指定了报告标记物所需的最小进化支丰度。
$ metaphlan -t clade_specific_strain_tracker --clade s__Bacteroides_fragilis metagenome_outfmt.bz2 --input_type bowtie2out -o marker_abundance_table.txt

positional arguments:
INPUT_FILE the input file can be:
* a fastq file containing metagenomic reads
OR
* a BowTie2 produced SAM file.
OR
* an intermediary mapping file of the metagenome generated by a previous MetaPhlAn run
If the input file is missing, the script assumes that the input is provided using the standard
input, or named pipes.
使用之前MetaPhlAn运行生成的宏基因组中间映射文件作为输入。如果输入文件丢失，脚本将假定输入通过标准输入或命名管道提供。
IMPORTANT: the type of input needs to be specified with --input_type
重要提示：输入类型需要使用–input_type参数进行指定
OUTPUT_FILE the tab-separated output file of the predicted taxon relative abundances
[stdout if not present]

Required arguments:
–input_type {fastq,fasta,bowtie2out,sam}
set whether the input is the FASTA file of metagenomic reads or
the SAM file of the mapping of the reads against the MetaPhlAn db.
请设置输入是宏基因组读取的FASTA文件还是读取与MetaPhlAn数据库映射的SAM文件。

Mapping arguments:
–force Force profiling of the input file by removing the bowtie2out file
强制对输入文件进行分析，删除BowTie2输出文件：
–bowtie2db METAPHLAN_BOWTIE2_DB
Folder containing the MetaPhlAn database. You can specify the location by exporting the DEFAULT_DB_FOLDER variable in the shell.[default /public/home/TonyWuLab/zhangzhd/anaconda3/envs/metaphlan/lib/python3.12/site-packages/metaphlan/metaphlan_databases]
-x INDEX, --index INDEX
Specify the id of the database version to use. If “latest”, MetaPhlAn will get the latest version.
If an index name is provided, MetaPhlAn will try to use it, if available, and skip the online check.
If the database files are not found on the local MetaPhlAn installation they
will be automatically downloaded [default latest]
指定要使用的数据库版本的ID。如果指定为"latest"，MetaPhlAn将获取最新版本。如果提供了索引名称，MetaPhlAn将尝试使用该索引（如果可用），并跳过在线检查。如果在本地MetaPhlAn安装中找不到数据库文件，它们将自动下载（默认为latest）。
–bt2_ps BowTie2 presets
Presets options for BowTie2 (applied only when a FASTA file is provided)
The choices enabled in MetaPhlAn are:
* sensitive
* very-sensitive
* sensitive-local
* very-sensitive-local
[default very-sensitive]
–bowtie2_exe BOWTIE2_EXE
Full path and name of the BowTie2 executable. This option allowsMetaPhlAn to reach the executable even when it is not in the system PATH or the system PATH is unreachable
–bowtie2_build BOWTIE2_BUILD
Full path to the bowtie2-build command to use, deafult assumes that 'bowtie2-build is present in the system path
指定用于构建BowTie2索引的完整路径，如果默认情况下假定系统路径中存在’bowtie2-build’命令。
–bowtie2out FILE_NAME
The file for saving the output of BowTie2
–min_mapq_val MIN_MAPQ_VAL
Minimum mapping quality value (MAPQ) [default 5]
–no_map Avoid storing the --bowtie2out map file
–tmp_dir The folder used to store temporary files [default is the OS dependent tmp dir]

Post-mapping arguments:
–tax_lev TAXONOMIC_LEVEL
The taxonomic level for the relative abundance output:
‘a’ : all taxonomic levels
‘k’ : kingdoms
‘p’ : phyla only
‘c’ : classes only
‘o’ : orders only
‘f’ : families only
‘g’ : genera only
‘s’ : species only
‘t’ : SGBs only
[default ‘a’]
–min_cu_len minimum total nucleotide length for the markers in a clade for
estimating the abundance without considering sub-clade abundances
[default 2000]
定在估算进化支丰度时，考虑标记物在一个进化支中的最小总核苷酸长度，而不考虑子进化支的丰度。默认值为2000。
–min_alignment_len The sam records for aligned reads with the longest subalignment
length smaller than this threshold will be discarded.
[default None]
这个阈值以下的最长子比对长度的SAM记录将被丢弃。
默认为None。
–add_viruses Together with --mpa3, allow the profiling of viral organisms
与–mpa3一起，允许对病毒生物进行分析。
–ignore_eukaryotes Do not profile eukaryotic organisms
不对真核生物进行分析。
–ignore_bacteria Do not profile bacterial organisms
–ignore_archaea Do not profile archeal organisms
–ignore_ksgbs Do not profile known SGBs (together with --sgb option)
不对已知的SGBs进行分析（与–sgb选项一起）。
–ignore_usgbs Do not profile unknown SGBs (together with --sgb option)
–stat_q Quantile value for the robust average
[default 0.2]
–perc_nonzero Percentage of markers with a non zero relative abundance for misidentify a species
[default 0.33]
标记物相对丰度非零的百分比，用于误识别物种。
–ignore_markers IGNORE_MARKERS
File containing a list of markers to ignore.
–avoid_disqm Deactivate the procedure of disambiguating the quasi-markers based on the
marker abundance pattern found in the sample. It is generally recommended
to keep the disambiguation procedure in order to minimize false positives
禁用根据样本中找到的标记物丰度模式来消除准标记物的过程。通常建议保留消除歧义的程序，以最小化假阳性。
–stat Statistical approach for converting marker abundances into clade abundances
‘avg_g’ : clade global (i.e. normalizing all markers together) average
‘avg_l’ : average of length-normalized marker counts
‘tavg_g’ : truncated clade global average at --stat_q quantile
‘tavg_l’ : truncated average of length-normalized marker counts (at --stat_q)
‘wavg_g’ : winsorized clade global average (at --stat_q)
‘wavg_l’ : winsorized average of length-normalized marker counts (at --stat_q)
‘med’ : median of length-normalized marker counts
[default tavg_g]
将标记物丰度转换为进化支丰度的统计方法：
‘avg_g’：进化支全局平均（即将所有标记物一起归一化）
‘avg_l’：长度归一化标记物计数的平均值
‘tavg_g’：截尾的进化支全局平均（在–stat_q分位数处截尾）
‘tavg_l’：截尾的长度归一化标记物计数的平均值（在–stat_q分位数处截尾）
‘wavg_g’：Winsorized（温索化）的进化支全局平均（在–stat_q分位数处Winsorized）
‘wavg_l’：Winsorized的长度归一化标记物计数的平均值（在–stat_q分位数处Winsorized）
‘med’：长度归一化标记物计数的中位数
[默认 tavg_g]

Additional analysis types and arguments:
-t ANALYSIS TYPE Type of analysis to perform:
* rel_ab: profiling a metagenomes in terms of relative abundances
* rel_ab_w_read_stats: profiling a metagenomes in terms of relative abundances and estimate the number of reads coming from each clade.
* reads_map: mapping from reads to clades (only reads hitting a marker)
* clade_profiles: normalized marker counts for clades with at least a non-null marker
* marker_ab_table: normalized marker counts (only when > 0.0 and normalized by metagenome size if --nreads is specified)
* marker_counts: non-normalized marker counts [use with extreme caution]
* marker_pres_table: list of markers present in the sample (threshold at 1.0 if not differently specified with --pres_th
* clade_specific_strain_tracker: list of markers present for a specific clade, specified with --clade, and all its subclades
[default ‘rel_ab’]
要执行的分析类型：
‘rel_ab’：以相对丰度的形式分析宏基因组
‘rel_ab_w_read_stats’：以相对丰度的形式分析宏基因组，并估算每个进化支来源的读取数量
‘reads_map’：从读取映射到进化支（仅对击中标记物的读取）
‘clade_profiles’：具有至少一个非空标记物的进化支的归一化标记物计数
‘marker_ab_table’：标准化的标记物计数（仅当> 0.0且通过–nreads指定时按宏基因组大小标准化）
‘marker_counts’：非标准化的标记物计数【极度谨慎使用】
‘marker_pres_table’：样本中存在的标记物列表（如果未通过–pres_th另外指定的话，阈值为1.0）
‘clade_specific_strain_tracker’：特定进化支及其所有亚进化支中存在的标记物列表，通过–clade指定
[默认为’rel_ab’]
–nreads NUMBER_OF_READS
The total number of reads in the original metagenome. It is mandatory when the --input_type is a SAM file.
原始宏基因组中读取的总数。当——input_type是一个SAM文件时，它是必需的。
–pres_th PRESENCE_THRESHOLD
Threshold for calling a marker present by the -t marker_pres_table option
–clade The clade for clade_specific_strain_tracker analysis
–min_ab The minimum percentage abundance for the clade in the clade_specific_strain_tracker analysis

Viral Sequence Clusters Analisys:
–profile_vsc Add this parameter to profile Viruses with VSCs approach.
–vsc_out VSC_OUT Path to the VSCs breadth-of-coverage output file
–vsc_breadth VSC_BREADTH
Minimum Breadth of Coverage for a Viral Group to be reported.
Default is 0.75 (at least 75 percent breadth to report)

Output arguments:
-o output file, --output_file output file
The output file (if not specified as positional argument)
–sample_id_key name Specify the sample ID key for this analysis. Defaults to ‘SampleID’.
–use_group_representative
Use a species as representative for species groups.
–sample_id value Specify the sample ID for this analysis. Defaults to ‘Metaphlan_Analysis’.
-s sam_output_file, --samout sam_output_file
The sam output file
–legacy-output Old MetaPhlAn2 two columns output
–CAMI_format_output Report the profiling using the CAMI output format
–unclassified_estimation
Scale relative abundances to the number of reads mapping to identified clades in order to estimate unclassified taxa
–mpa3 Perform the analysis using the MetaPhlAn 3 algorithm
–biom biom_output, --biom_output_file biom_output
If requesting biom file output: The name of the output file in biom format
–mdelim mdelim, --metadata_delimiter_char mdelim
Delimiter for bug metadata: - defaults to pipe. e.g. the pipe in k__Bacteria|p__Proteobacteria

Other arguments:
–nproc N The number of CPUs to use for parallelizing the mapping [default 4]
–subsampling SUBSAMPLING
Specify the number of reads to be considered from the input metagenomes [default None]
–subsampling_output SUBSAMPLING_OUTPUT
The output file for the subsampled reads. If --subsampling_paired is specified two files are created with suffixes R1 and R2. If not specified the subsampled reads will not be saved.
–subsampling_paired SUBSAMPLING_PAIRED
Specify the number of paired reads to be considered from the input metagenomes [default None]
-1 FORWARD_READS Specify the fastq file with forward reads of the input metagenomes. Reads are assumed to be in the same order in the forward and reverse files! [default None]
-2 REVERSE_READS Specify the fastq file with reverse reads of the input metagenomes. Reads are assumed to be in the same order in the forward and reverse files! [default None]
–mapping_subsampling
If used, the subsamping will be done on the mapping results instead of on the reads.
–subsampling_seed SUBSAMPLING_SEED
Random seed to use in the selection of the subsampled reads. Choose "random
for a random behaviour
–install Only checks if the MetaPhlAn DB is installed and installs it if not. All other parameters are ignored.
–offline If used, MetaPhlAn will not check for new database updates.
–force_download Force the re-download of the latest MetaPhlAn database.
–read_min_len READ_MIN_LEN
Specify the minimum length of the reads to be considered when parsing the input file with ‘read_fastx.py’ script, default value is 70
-v, --version Prints the current MetaPhlAn version and exit
-h, --help show this help message and exit