在cygwin安装HOMER和最全使用说明

太困了

已于 2023-04-14 10:22:05 修改

阅读量1.2k

点赞数 1

分类专栏：非0即1 文章标签：笔记

于 2023-04-14 10:08:33 首次发布

本文链接：https://blog.csdn.net/weixin_45431644/article/details/130146981

版权

非0即1 专栏收录该内容

1 篇文章

订阅专栏

HOMER是一款用于Motif识别和下一代测序数据分析的工具。本文详细介绍了HOMER在Cygwin环境中的安装步骤，包括所需软件包和依赖项，以及遇到的问题和解决方案。此外，还概述了HOMER的主要功能，如基于基因/启动子的Motif分析、基因组分析和在整个基因组中寻找Motif的用法。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

HOMER 基本功能

HOMER(Hypergeometric Optimization of Motif EnRichment)用于Motif identification和下一代测序分析，是一款很方便的软件。

*最不方便的地方在于安装… 笔者花了整整两天才部署完，有很多小坑。

一.homer在cygwin的安装

网路上已有很多现成的安装教程资料了.这边仅记录一些在cygwin中安装的问题和解决方法.

cygwin安装

下载安装好cygwin, 添加至系统环境变量. PATH=$PATH:/....(安装路径)

注意:安装时添加好cygwin安装包下载_开源镜像站-阿里云镜像网站,安装包会快一些.为了正常使用homer, 安装时添加好包:
- gcc
- g++
- make
- perl
- zip/unzip
- gzip/gunzip
- wget
  
  如果一开始缺了包不要紧, 重新运行cygwin的setup程序添加就可以了, 不是重新安装！
R中装好DESeq2, edgeR (一般生信用过R的都有吧)
!!!问题1!!!

之后在用homer的时候, 发现没有motif的可视化, 查看log日志发现是缺少了seqlogo, 在python和R中装了也没用, 其实它是包含在weblogo这里的. (在homer官方安装说明中这一段已被设置灰色不再使用,但事实上还是需要使用的). 所以这里还要安装几个依赖. 如果cygwin中不提供添加, 就用apt-cyg install yourpackacge 和以上一样全都要添加至 ~/.bash_profile或者 ~/.bashrc中.
- GMT (自带ghostscript,如果提示没安装就单独再下载一次gs)
- ncurses-libs, ncurses-devel, xz-devel, zlib-devel, HTSlib (samtools需要)
- samtools SAM tools - Browse Files at SourceForge.net
- Weblogo (version 2.8.2, 不要用3 )WebLogo - About 下载后解压,不需要自己安装。
- libpng12, apache2, uuid-dev (R), mysqldb (python)，httpd, hdf5, libuuid-devel (R), MySQL(python) 这边分debian/ubuntu 和 CentOS, 这是blat依赖包。我用不着, 要是保险点都装吧。
- blat 我没有成功安装, 如果不需要处理一些专门的CHIP-seq分析也用不着。
下载homer正式安装
- homer官网. 下载configureHomer.pl http://homer.ucsd.edu/homer/configureHomer.pl
- perl /Users/chucknorris/homer/configureHomer.pl -install 由于网络问题, 一直没成功. 后来莫名其妙用perl /configureHomer.pl -install homer下载安装好了。
安装homer中的package数据库

这边我用的是人的, 所以就以此举例.
- 人基因组: perl /configureHomer.pl -install hg38 #很大,1.4G
- 人启动子: perl configureHomer.pl -install human
如果需要其他的包, 可以用 perl /configureHomer.pl -list 查看有哪些可以安装的.

二.homer的功能和使用

1.基于基因/启动子的motif分析

使用findMotifs.pl

①功能: 寻找目标基因启动子中富集的motif. e.g.受某处理上调的基因, 特定细胞的基因等等.

②input: 一个你感兴趣的基因列表. 格式包括:
- NCBI Entrez Gene IDs
- NCBI Unigene IDs
- NCBI Refseq IDs (mRNA, protein)
- Ensembl Gene IDs
- Gene Symbols (i.e. Official Gene names, like "Nfkb1" )
- popular affymetrix probe IDs (MOE430, U133plus, U95, U75A)
③运行命令

findMotifs.pl <inputfile.txt> <promoter set> <output directory> [options]

<>中的内容是必须的. 其中, <promoter set>在以下种类中选择:
- human (Homo sapiens)
- mouse (Mus musculus)
- rat (Rattus norvegicus)
- fly (Drosophila melanogaster)
- worm (Caenorhabditis elegans)
- zebrafish (Danio rerio)
- yeast (Saccharomyces cerevisiae)
※一个例子:
```
 #目标序列在人类mRNA上富集的motif
 findMotifs.pl geneslist.txt human-mRNA output/ -rna -len 8
```
④其他可选参数

请参见官方介绍的Important motif finding parameters部分:

包括规定masked序列, 自定义启动子区域, 删除冗余启动子, 筛选motif长度, motif个数, 只在＋链寻找motif, 寻找oligos富集, 转换至人ID进行GO分析, 标准化CpG%, 消除lower order oligos带来的bias, 自定义背景基因, 在全局优化中允许的错配, 选择二项式/超几何分布打分, CPU核数。
返回找到的每一个motif的位置

①在findMotifs.pl中这个功能不是默认的.

②input: 基因列表. 格式同上.

③使用: 加上 -find <motif file>参数. 这个文件(.motif)在初始的分析中可以得到。

※一个例子:
```
 #目标序列在人类mRNA上富集的motif
 findMotifs.pl geneslist.txt human output/  -find motifs.motif >  output.txt
```
③运行结果/输出内容:
1. Peak/Region ID
2. Offset from the TSS
3. Sequence of the site
4. Name of the Motif
5. Strand
6. Motif Score (log odds score of the motif matrix, higher scores are better matches)
使用annotatePeaks.pl

①功能: 注释peaks, 及附近基因, GO分析, 量化ChIP-Seq tags密度, 绘制图.

②input: HOMER peaks file, 或者BED file. 具体见这一段说明：
HOMER peak files should have at minimum 5 columns (separated by TABs, additional columns will be ignored):
- Column1: Unique Peak ID
- Column2: chromosome
- Column3: starting position
- Column4: ending position
- Column5: Strand (+/- or 0/1, where 0="+", 1="-")
BED files should have at minimum 6 columns (separated by TABs, additional columns will be ignored)
- Column1: chromosome
- Column2: starting position
- Column3: ending position
- Column4: Unique Peak ID
- Column5: not used
- Column6: Strand (+/- or 0/1, where 0="+", 1="-")
In theory, HOMER will accept BED files with only 4 columns (+/- in the 4th column), and files without unique IDs, but this is NOT recommended. For one, if you don't have unique IDs for your regions, it's hard to go back and figure out which region contains which peak.

※mac中可以用homer的changeNewLine.pl <filename>转换文件格式。
③使用：

annotatePeaks.pl <peak/BED file> <genome> [options] > <output file>

annotatePeaks.pl tss <promoter set> -m <motif file> <output directory>

※例子:
```
 #i.基本的注释Genomic Annotation
 annotatePeaks.pl peaks.txt hg38 > output.txt
 #ii.在TSS而非峰模式下分析转录起始位点
 annotatePeaks.pl tss hg38 -size -300,50 -m motifs.motif > output.txt
 #iii.在peaks附近寻找motif
 annotatePeaks.pl peaks.txt mm8 -size 200 -m ms1.motif [cebp.motif] > output.txt
 #iii加上-mbed motif.bed 可以生成bed文件
 
 #iv.在peaks1中找到最近的peaks2 ???不是很清楚使用实例???
 ... -p <peak file 1> [peak file 2] [-pdist] [-pcount] [-size]
```
④输出结果

包括: 峰值是否在TSS (default -1kb ~ +100bp), TTS(default -100 bp to +1kb), CDS Exons, 5' UTR Exons, 3' UTR Exons, Introns, Intergenic, *CpG Islands, *Repeats ( *在Detailed Annotation选项中)
1. Peak ID
2. Chromosome
3. Peak start position
4. Peak end position
5. Strand
6. Peak Score
7. FDR/Peak Focus Ratio/Region Size
8. Annotation (i.e. Exon, Intron, ...)
9. Detailed Annotation (Exon, Intron etc. + CpG Islands, repeats, etc.)
10. Distance to nearest RefSeq TSS
11. Nearest TSS: Native ID of annotation file
12. Nearest TSS: Entrez Gene ID
13. Nearest TSS: Unigene ID
14. Nearest TSS: RefSeq ID
15. Nearest TSS: Ensembl ID
16. Nearest TSS: Gene Symbol
17. Nearest TSS: Gene Aliases
18. Nearest TSS: Gene description
19. Additional columns depend on options selected when running the program.
⑤可选参数
- annotation.pl <peak file> <genome> -gene <gene data file> > output.txt
  
  根据peaks最近的注释TSS, 向peaks添加特定基因的信息。
- 有关diagram plot 详见: ann可视化
  
  ※一些例子. 用excel打开, 作图 (除了热图)
```
 ###scatter
 #me1和me3在小鼠胚胎干细胞峰peaks附近的分布 *log--> -log2
 annotatePeaks.pl peaks.txt mm8 -size 1000 -d H3K4me1-ChIP-Seq/ H3K4me3-ChIP-Seq/ > output.txt
 
 ###hist
 #TSS相关的Motif YY1的分布
 annotatePeaks.pl tss mm9 -size -500,250 -hist 10 -m yy1.motif > output.txt
 
 ###positions ???目录下的tags是什么类型的文件??? 5‘tag
 #motifs near peaks
 annotatePeaks.pl peaks.txt hg18 -size 6000 -hist 25 -m <are.motif> [fox.motif ap1.motif]
 #genes/tags near peaks
 annotatePeaks.pl <peak file> <genome> -size <#> -hist <#> -d <tag directory 1> [tag directory2] ... -m <motif 1> <motif 2> ... >  <output matrix file>
 annotatePeaks.pl peaks.txt hg18 -size 6000 -hist 25 -d me1/ me2/ Mme3/ > output.txt
 
 ###heatmap --> matrix --> visualized with other software
 annotatePeaks.pl <peak file> <genome> -size <..> -hist <..> -ghist -d <tag directory 1> [tag directory2] ... > <output matrix file>
```
TSS相关的Motif YY1的分布例图:

2.基于基因组的分析

在基因组区域中寻找富集的motif

①input: HOMER peak file, BED file, 具体格式见前文引用。

②使用:

findMotifsGenome.pl <peak/BED file> <genome> <output directory> -size # [options]

③输出结果:
- homerMotifs.motifs<#> :从头搜寻的motif, 信息有长度, 算法单独运行
- homerMotifs.all.motifs : 全部的motifs
- motifFindingParameters.txt : 键入的命令
- knownResults.txt : 统计数据, 可在excel打开
- seq.autonorm.tsv : lower-order oligo的自动标准化
- homerResults.html : 从头搜寻motif的格式化
- homerResults/ 上面的内容
- knownResults.html: 已知的motif输出
- knownResults/ 上面的内容
④可选参数详见: findMotifsGenome
- 背景, motif长度, 用于motif搜寻的区域长度, 不搜索de novo的motif, 等等
- 重要的一个: -rna 用RNA数据, 输出mRNA motifs, 关于没有其他RNA数据库, 作者写了一句很可爱的话:
  
  I guess chuck roundhouse kicked all of the splicing and other RNA motifs into hard to find databases.
寻找具体motif的位置

①在findMotifsGenome.pl中这个功能也不是默认的.

②input：从使用homer得到筛选你感兴趣的motifs之后, 使用 -find <motifile>

③使用:

i.findMotifsGenome.pl <peaks.txt/bed> <genome> /out -find <motif file> /output.txt

ii.annotatePeaks.pl <peaks> <genome> -m motifs.motif > output.txt

④输出结果

在i中:
1. Peak/Region ID
2. Offset from the center of the region
3. Sequence of the site
4. Name of the Motif
5. Strand
6. Motif Score (log odds score of the motif matrix, higher scores are better matches)
在ii中:
1. Peak/Region ID
2. Chromosome
3. Start
4. End
5. Strand of Peaks
6-18: annotation information
1. CpG%
2. GC%
3. Motif Instances ...
⑤可视化
```
 #motif密度 -hist <..>
 
 annotatePeaks.pl peaks.txt hg38 -m m1.motif m2.motif -size 1000 -hist 10 > output.txt
```

3.在整个基因组中寻找motif

使用 scanMotifGenomeWide.pl

①input: motif file

②使用:

scanMotifGenomeWide.pl <motif file> <genome> [options]

※一个例子:
```
 #输出为bed, 默认txt
 scanMotifGenomeWide.pl pu1.motif mm9 -bed > pu1.sites.mm9.bed
```
③输出结果:
Tab delimited text file (default):
1. Site ID (motif name + number)
2. chr
3. start
4. end
5. strand
6. log-odds score
7. sequence
BED (tab) format (use -bed):
1. chr
2. start
3. end
4. motif name
5. log-odds score (will be floored to an integer)
6. strand
④可选参数详见genenomeWide

4.在mRNA中找出RNA motif

基本的使用:在2.1已经提到了, 在findMotifs.pl和and findMotifsGenome.pl中加入-rna即可.

寻找RNA motifs的Co-regulated基因列表

①使用

※一个例子
```
 findMotifs.pl mdownregulated.genes.txt human-mRNA Output/ -rna -len 8
```
其实规定了human-mRNA就不用再加-rna了。

现在也尝试match人类的miRNA seeds (miRBase)

②可选参数

在findMotifs.pl中可以添加:
- -min <#>:要考虑的最小mRNA长度 (移除极短的mRNA序列)
- -max <#>:要考虑的最大mRNA长度 (删除非常长的RNA)
分析motif 链特异性的genome区域

findMotifsGenome.pl fox2.clip.bed hg17 MotifOutput -rna