下图为网站上的一篇article,可以明显提取到:题目,作者,作者单位(afiliation),贡献情况,通讯作者(corresponding author),
期号,卷号,投稿时间,接收时间,发表时间
摘要,主题(下图中未截取,可点击网页查看)
1、html源码的获取,保存到nature.txt
def fetch(url):
http_request = urllib2.Request(url = url)
http_response = urllib2.urlopen(http_request, data = None, timeout = 600)
#http_response rturns a file-like object, so it can read()
status_code = http_response.getcode()
if status_code != 200:
print "Error : " , status_code
else:
print "Succesful!"
print "----- start downloading date -----"
html_doc = http_response.read()
fn = open('nature.txt','w')
fn.write(html_doc)
print "----- finish downloading data -----"
fn.close()
2、解析html文件,提取信息
def parse(file_path):
# html sourse : nature journal, one artical
# title ,
# authors ,
# journal name
# volume
# page (xxx - xxx)
# datetime
# doi
article_info = {} # to store these features
f = codecs.open(file_path , 'r' , 'utf-8')
content = f.read()
f.close()
tree = etree.HTML(content)
# nodes = tree.xpath(u"//a[@class = 'name']/span[@class = 'fn']")
title = tree.xpath(u"//h1[@class='article-heading']")
article_info['title'] = title[0].text
authors = []
authors_nodes = tree.xpath(u"//a[@class = 'name']/span[@class = 'fn']")
for node in authors_nodes:
authors.append(node.text)
article_info['authors'] = authors
journal_name = tree.xpath(u"//dl[@class = 'citation']/dd[@class = 'journal-title']")
article_info['journal_name'] = journal_name[0].text
volume = tree.xpath(u"//dl[@class = 'citation']/dd[@class = 'volume']")
article_info['volume'] = volume[0].text.strip()
page = tree.xpath(u"//dl[@class = 'citation']/dd[@class = 'page']")
article_info['page'] = page[0].text
datePublished = tree.xpath(u"//dt[@class = 'published']/following-sibling::dd[1]/time")
article_info['datePublished'] = datePublished[0].text.strip(punctuation)
dateReceived = tree.xpath(u"//dt[@class = 'received first']/following-sibling::dd[1]/time")
article_info['dateReceived'] = dateReceived[0].text.strip()
dateAccepted = tree.xpath(u"//dt[@class = 'accepted']/following-sibling::dd[1]/time")
article_info['dateAccepted'] = dateAccepted[0].text .strip()
doi = tree.xpath(u"//dd[@class = 'doi']")
article_info['doi'] = doi[0].text
abstract = tree.xpath(u"//h1[text() = 'Abstract']/following-sibling::div[1]/p/text()")
# article_info['abstract'] = abstract
# print etree.tostring(abstract , pretty_print = True)
print abstract
subject_terms = tree.xpath(u"//h2[text() = 'Subject terms:']/following-sibling::ul[1]/*/a/text()")
article_info['subjectTerms'] = subject_terms
return article_info
目前解析xml或html的python库常见的有:beautifulsoup ,lxml,pyquery,scrapy,还有万能的正则表达式
lxml速度最快,
正则的提取速度与表达式写的质量有关,有时提取相同的内容但是不同的正则表达式速度可以相差10倍
pyquery是基于lxml的
beautifulsoup4最慢
scrapy挺强大的,但鄙人不才还没用过
【注】以上代码提取abstract部分有点问题,
<p>
'p第一部分'
<i>
'i第一部分'
</i>
'p第二部分'
<i>
'i第二部分'
</i>
'p第三部分'
</p>
需要把<p>标签下所有文字部分提取出来(包括<i>中内容),但是目前没有在xpath中找到解决办法