Python sax的 xml 数据文件解析及 如何去除解析文本中的特殊标签<sub>, <sup>

在大规模XML数据解析中,为了解决文本中如<sub>, <b>, <i>, <sup>等特殊标签问题,可以先进行预处理。本文介绍了使用Python的SAX方法,由于其适合批量数据处理,因此在处理大量数据时非常有效。" 115001537,8578268,软件测试:测试计划与执行指南,"['软件测试', '测试计划', '测试设计', '评价标准', '测试方法']
摘要由CSDN通过智能技术生成
wq

在数据解析之前,需要对数据中的异常文本(<sub>, <b>,<i>,<sup>等文本修饰符标签)进行预处理,例如下文,

<Abstract>
   <AbstractText><b>Background:</b> Lung adenocarcinoma has a strong tendency to develop 
into bone metastases, especially spinal metastases (SM). Long noncoding RNAs (lncRNAs) play 
critical roles in regulating several biological processes in cancer cells. However, the 
mechanisms underlying the roles of lncRNAs in the development of SM have not been 
elucidated to date. <b>Methods:</b> Clinical specimens were collected for analysis of differentially expressed lncRNAs. The Kyoto Encyclopedia of Genes and Genomes (KEGG) was 
used to examine the effects of these genes on pathways. RNA pull-down was utilized to 
identify the targeting protein of lncRNAs. The effects of lncRNA on its target were 
detected in A549 and SPCA-1 cells via perturbation of the lncRNA expression. Oncological 
behavioral changes in transfected cells and phosphorylation of kinases in the relevant 
pathways, with or without inhibitors, were observed. Further, tumorigenicity was found to 
occur in experimental nude mice. <b>Results:</b> LINC00852<sup>s2</sup> and the mitogen-activated 
protein kinase (MAPK) pathway were found to be associated with SM. Moreover, the LINC00852 
target S100A9 had a positive regulatory role in the progression, migration, invasion, and 
metastasis of lung adenocarcinoma cells, both <i>in vitro</i> and <i>in vivo</i>. 
Furthermore, S100A9 strongly activated the P38 and REK<sub>1/2</sub> kinases, and slightly activated 
the phosphorylation of the JNK kinase in the MAPK pathway in A549 and SPCA-1 cells. 
<b>Conclusion:</b> LINC00852 targets S100A9 to promote progression and oncogenic ability in 
lung adenocarcinoma SM through activation of the MAPK pathway. These findings suggest a 
potential novel target for early intervention against SM in lung cancer.
    </AbstractText>
</Abstract>

具体解决方法就是在解析之前进行预处理,然后再解析,本人要处理的数据量较大,所以采用数组进行了批量处理:


import os
from variation_preprocess.pubmed_test import xml_parser

source_dir = 'G:\\Pubmed_file\\'

List_Fname = []
List_Sname = []
List_csvname = []

def listdir(path, list_Fname, list_Sname, list_csv):
    for file in os.listdir(path):
        if file[-4:] == '.xml':
            file_path = 'G:/Pubmed_file/'+ file[:-4] + '.xml'
            file_save = 'G:/PubMed/'+ file[:-4] + '_edited.xml'
            file_csv = 'E:/PubMed/' + file[:-4] + '.csv'
            list_Fname.append(file_path)
            list_Sname.append(file_save)
            list_csv.append(file_csv)
    return list_Fname, list_Sname, list_csv


# 去除 sub, sub 标签,
def x
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值