更多代码请见:https://github.com/xubo245/SparkLearning
Adam环境搭建(含window下eclipse配置)
环境:
集群:Ubuntu14.04 +Spark 1.5.2 +scala2.10
本地:window7 64 +eclipse4.3.2+scala2.10.4
1.Adam安装:参考【1】
$ git clone https://github.com/bigdatagenomics/adam.git
$ cd adam
$ export MAVEN_OPTS="-Xmx512m -XX:MaxPermSize=256m"
$ mvn clean package -DskipTests
更多配置请参考【1】,不细讲
2.eclipse下Spark环境搭建请参考【2】
3.在Adam的adam-apis adam-cli adam-core的target下分别下载:
adam-core_2.10-0.18.3-SNAPSHOT.jar
adam-cli_2.10-0.18.3-SNAPSHOT.jar
adam-apis_2.10-0.18.3-SNAPSHOT.jar
然后在新建的Scala Project中add jar
4.环境示例:
jdk1.7+scala2.10.4+Spark的jar包+3中的三个jar包
5.集群运行:
(1)
输入adam-shell进入shell界面
代码:参考【1】 但将/data/NA21144.chrom11.ILLUMINA.adam", 数据换了下,变成了"hdfs://Master:9000/xubo/adam/output/small.adam",而且需要将Master切换成自己集群的IP
import org.bdgenomics.adam.rdd.ADAMContext
import org.bdgenomics.adam.projections.{AlignmentRecordField, Projection}
val ac = new ADAMContext(sc)
// Load alignments from disk
val reads = ac.loadAlignments(
"hdfs://<strong>Master</strong>:9000/xubo/adam/output/small.adam" //"/data/NA21144.chrom11.ILLUMINA.adam", 数据换了下,而且需要将Master切换成自己集群的IP
projection = Some(
Projection(
AlignmentRecordField.sequence,
AlignmentRecordField.readMapped,
AlignmentRecordField.mapq
)
)
)
// Generate, count and sort 21-mers
val kmers =reads.flatMap(_.getSequence.sliding(21).map(k => (k, 1L))).reduceByKey(_ + _).map(_.swap).sortByKey(ascending = false)
// Print the top 10 most common 21-mers
kmers.take(10).foreach(println)
输入结果:
scala> kmers.take(10).foreach(println)
(4,TCTTTCTTTCTTTCTTTCTTT)
(4,TTTCTTTCTTTCTTTCTTTCT)
(3,CTTTCTTTCTTTCTTTCTTTC)
(3,TTCTTTCTTTCTTTCTTTCTT)
(2,TCTTTTTCTTTCTTTCTTTCT)
(2,TTCTTTTTCTTTCTTTCTTTC)
(2,TTTCTTTTTCTTTCTTTCTTT)
(1,ATTGGATATCCTCCCAAATTT)
(1,AGGCATGAGGCACCGCGCCTG)
(1,CTACTGCCCAACAAGTCCCTA)
hdfs://Master:9000/xubo/adam/output/small.adam的获取:
在adam的安装目录下的/adam-core/src/test/resources文件夹中的small.sam,我的为:/home/hadoop/cloud/adam/adam-core/src/test/resources/small.sam
用
hadoop fs -put small.sam <span style="font-size: 13.3333px;"> /xubo/adam/dataAdam/</span>
上传到集群
然偶使用adam-submit的transform将其转换成adam文件:
adam-submit transform /xubo/adam/dataAdam/small.sam /xubo/adam/output/small.adam
该small.adam即为代码中的small.adam文件
(2)验证(不完善)
然后使用adam-submit指令将其转换成k-mer:
adam-submit count_kmers /xubo/adam/output/small.adam /xubo/adam/output/kmerSmallK21.adam 21
sdam-shell:
val kmer21=sc.textFile("/xubo/adam/output/kmerSmallK21.adam")
kmer21.foreach(println)
kmer21.count
结果:很长,省略
。。。。。。
(GCCTTGCAGGTTGAGTAGGAT,1)
(CATTATAAATATATTTAACAG,1)
(TTTTGAGCATGAAAGTAATAT,1)
(AAGTCAAAAAGAAAAAAAAGG,1)
(ACGGGGTTTCACCATGTTGGC,1)
(TCACAATGCCAACAGCTAAAT,1)
(CAACAGCTAAATGTACCCAAG,1)
(GCCTTGCAAGAATCTCTACTG,1)
(TCTCACTATGTTGCCTAGGCT,1)
(ATAAATGTTGATTGTCCTATT,1)
(ATTCCCAGGTCTTAGGTGCTG,1)
(CAGCCTTATTCCTATTTATAA,1)
(ACAAGATAGTACTTGAGCTAA,1)
(ACTCTCATTGACTGTTCAATG,1)
(TGTAAATTCAAATTGGATATC,1)
(AAAGTTTGGCTTTCAGTTGTA,1)
(ATAAGAGCAGCCTTATTCCTA,1)
(CAAACTCCTGGGCTCAAGTGA,1)
(CAGTGGGAGGTGGTGGCCATG,1)
(TAAGGTTTTTTTTGTTTGTTT,1)
(CATGAGGCACCGCGCCTGGCC,1)
(TCAAACATCACACTCCACGTT,1)
(CCGCCTCGGCCTCCCAAAGTG,1)
scala> kmer21.count
res12: Long = 1087
6.本地运行:
代码:
package testAdam
import org.apache.spark._
import org.bdgenomics.adam.rdd.ADAMContext
import org.bdgenomics.adam.projections.{AlignmentRecordField, Projection}
object kmer {
def main(args:Array[String]){
val conf=new SparkConf().setAppName("test Adam kmer").setMaster("local")
val sc=new SparkContext(conf)
val ac = new ADAMContext(sc)
// Load alignments from disk
//val reads = ac.loadAlignments("/data/NA21144.chrom11.ILLUMINA.adam",
// val reads = ac.loadAlignments("/xubo/adam/output/small.adam",
val reads = ac.loadAlignments("hdfs://Master:9000/xubo/adam/output/small.adam",
projection = Some(
Projection(
AlignmentRecordField.sequence,
AlignmentRecordField.readMapped,
AlignmentRecordField.mapq
)
)
)
// Generate, count and sort 21-mers
val kmers =reads.flatMap(_.getSequence.sliding(21).map(k => (k, 1L))).reduceByKey(_ + _).map(_.swap).sortByKey(ascending = false)
kmers.take(10).foreach(println)
// Print the top 10 most common 21-mers
}
}
运行结果:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/G:/149/jar%e9%87%8d%e8%a6%81/spark-assembly-1.5.2-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/D:/1win7/java/otherJar/adam-cli_2.10-0.18.3-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2016-03-05 20:38:22 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-03-05 20:38:24 WARN MetricsSystem:71 - Using default name DAGScheduler for source because spark.app.id is not set.
2016-03-05 20:38:26 WARN :139 - Your hostname, xubo-PC resolves to a loopback/non-reachable address: fe80:0:0:0:200:5efe:ca26:54fd%30, but we couldn't find any external IP address!
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
(4,TCTTTCTTTCTTTCTTTCTTT)
(4,TTTCTTTCTTTCTTTCTTTCT)
(3,CTTTCTTTCTTTCTTTCTTTC)
(3,TTCTTTCTTTCTTTCTTTCTT)
(2,TCTTTTTCTTTCTTTCTTTCT)
(2,TTCTTTTTCTTTCTTTCTTTC)
(2,TTTCTTTTTCTTTCTTTCTTT)
(1,ATTGGATATCCTCCCAAATTT)
(1,AGGCATGAGGCACCGCGCCTG)
(1,CTACTGCCCAACAAGTCCCTA)
2016-3-5 20:38:46 INFO: org.apache.parquet.hadoop.ParquetInputFormat: Total input paths to process : 1
2016-3-5 20:38:49 WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
2016-3-5 20:38:50 INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 20 records.
2016-3-5 20:38:50 INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block
2016-3-5 20:38:50 INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 233 ms. row count = 20
运行截图:
参考:
【1】 https://github.com/bigdatagenomics/adam
【2】 http://blog.csdn.net/xubo245/article/details/50789983
【3】 /adam/docs/source/01_intro.md
【4】 https://github.com/ga4gh/gastore