sed & awk之sed实战

sed基础知识见sed & awk值sed学习笔记

任务

数据文件见snp_result.txt
结果文件见snp_result.tab

此文件为从db_snp下载下来的一些snp位点信息,一共包括211个snp位点,每个位点结构大致如下:

140. rs3732219  [Homo sapiens]
SNV:C>T
Chromosome: 2:233718602 (GRCh38)
2:234627248 (GRCh37)
Gene:UGT1A10
UGT1A8
UGT1A7
UGT1A6
UGT1A5
UGT1A9
UGT1A4
Functional Consequence: intron_variant,upstream_transcript_variant
Validated: by frequency,by alfa,by cluster
Global MAF:
T=0.14357/719(1000Genomes)
T=0.086144/332(ALSPAC)
T=0.125/5(GENOME_DK)
T=0.097387/3056(GnomAD)
T=0.064128/64(GoNL)
T=0.149147/437(KOREAN)
T=0.157751/289(Korea1K)
T=0.12/72(NorthernSweden)
T=0.102273/63(PharmGKB)
T=0.125/27(Qatari)
C=0.443878/87(SGDP_PRJ)
C=0.5/8(Siberian)
T=0.102001/12808(TOPMED)
T=0.083064/308(TWINSUK)
T=0.311321/66(Vietnamese)
T=0.087342/207(ALFA)
HGVS: NC_000002.12:g.233718602C>T, NC_000002.11:g.234627248C>T, NG_002601.2:g.133859C>T

其中第一行为序号+rs编号+种属,后面则是snp的注释信息,包括SNV(或者是DEL, DELINS),Chromosome, Gene, Functional Consequence, Clinical significance, Validated, Global MAF, HGVS等信息。需要注意的是,有的信息如Gene可能占多行,有的位点的部分信息可能会缺如。

本实战的目的则是将该文件通过sed处理为一个数据框格式的文件,每一行为一个snp位点,每一列对应其不同的注释信息,列间用\tab分隔,同一种信息如Gene有多条时用逗号分隔。

sed脚本如下,存为script.sed:

#rcognize rs, start a cycle
/^[0-9]*\. rs\([0-9]*\)  \[Homo sapiens]/{s//\1/;N;}


# HGVS is the last item of a document
/HGVS/!{

# pipeline of a document, multiple lines in pattern space
/\n/{
/SNV:\(.\)>\(.*\)$/{s//\1\t\2/;N;}
/DEL:\(.*\)>\(.*\)$/{s//\1\t\2/;N;}
/DELINS:\(.*\)>\(.*\)$/{s//\1\t\2/;N;}
/Chromosome: \([0-9]*\):.*$/{s//\1SEP_MARK/;h;N;g;N;}

# use command d to jump out the pipeline to process multiple gene
# b gene won't work, can be removed
/SEP_MARK\nGene:\(.*\)$/{s//\n\1SEP_MARK/;h;d;b gene}
/SEP_MARK\nGene:\(.*\)$/!{s/SEP_MARK/\n&/;h;b func}

:func
/SEP_MARK\nFunctional Consequence: \(.*\)$/{s//\n\1SEP_MARK/;h;N;b clin}
/SEP_MARK\nFunctional Consequence: \(.*\)$/!{s/SEP_MARK/\n&/;h;b clin}

:clin
/SEP_MARK\nClinical significance: \(.*\)$/{s//\n\1SEP_MARK/;h;N;b val}
/SEP_MARK\nClinical significance: \(.*\)$/!{s/SEP_MARK/\n&/;h;b val}

:val
/SEP_MARK\nValidated: \(.*\)$/{s//\n\1SEP_MARK/;h;N;}

# use command d to jump out the pipeline to process multiple maf
# b maf won't work, can be removed
/SEP_MARK\nGlobal MAF:/{N;s///;s/$/SEP_MARK/;h;d;b maf}
# if no MAF found, process HGVS
/SEP_MARK\nGlobal MAF:$/!{/SEP_MARK\nHGVS/{s//\n\n/;s/\n/\t/g;p;d;}}
}

/\=/!{
# jump back into the pipeline
/^Functional Consequence: \(.*\)$/{H;g;b func}
# process multiple gene
/^Functional Consequence: \(.*\)$/!{
:gene
H;g
s/SEP_MARK\n/,/
s/$/SEP_MARK/
h;d
}
}

# process multiple maf, no need to jump back into the pipeline, just process HGVS
/\=/{
:maf
H;g
s/SEP_MARK\n/,/
s/$/SEP_MARK/
h;d
}
}

# process HGVS, end the cycle
/^HGVS/{H;x;s/SEP_MARK\nHGVS: /\n/;s/\n/\t/g;p;d;}

通过sed运行:

sed -f script.sed snp_result.txt > snp_result.tab
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值