python+lxml+xpath提取nature网站中的article基本信息【初级版】


下图为网站上的一篇article,可以明显提取到:题目,作者,作者单位(afiliation),贡献情况,通讯作者(corresponding author),

期号,卷号,投稿时间,接收时间,发表时间

摘要,主题(下图中未截取,可点击网页查看)


1、html源码的获取,保存到nature.txt

def fetch(url):
	http_request = urllib2.Request(url = url)
	http_response = urllib2.urlopen(http_request, data = None, timeout = 600)
	#http_response rturns a file-like object, so it can read()

	status_code = http_response.getcode()

	if status_code != 200:
		print "Error : " , status_code
	else:
		print "Succesful!"

	print "-----  start downloading date  -----"
	html_doc = http_response.read()
	fn = open('nature.txt','w')
	fn.write(html_doc)
	print "-----  finish downloading data  -----"
	fn.close()

2、解析html文件,提取信息

def parse(file_path):
	# html sourse : nature journal, one artical
	# title , 
	# authors ,
	# journal name
	# volume
	# page (xxx - xxx)
	# datetime 
	# doi
	article_info = {}  # to store these features

	f = codecs.open(file_path , 'r' , 'utf-8')
	content = f.read()
	f.close()

	tree = etree.HTML(content)

	# nodes = tree.xpath(u"//a[@class = 'name']/span[@class = 'fn']")
	title = tree.xpath(u"//h1[@class='article-heading']")
	article_info['title'] = title[0].text
	
	authors = []
	authors_nodes = tree.xpath(u"//a[@class = 'name']/span[@class = 'fn']")
	for node in authors_nodes:
		authors.append(node.text)
	article_info['authors']  = authors

	journal_name = tree.xpath(u"//dl[@class = 'citation']/dd[@class = 'journal-title']")
	article_info['journal_name'] = journal_name[0].text

	volume = tree.xpath(u"//dl[@class = 'citation']/dd[@class = 'volume']")
	article_info['volume'] = volume[0].text.strip()

	page = tree.xpath(u"//dl[@class = 'citation']/dd[@class = 'page']")
	article_info['page'] = page[0].text

	datePublished = tree.xpath(u"//dt[@class = 'published']/following-sibling::dd[1]/time")
	article_info['datePublished'] = datePublished[0].text.strip(punctuation)

	dateReceived = tree.xpath(u"//dt[@class = 'received first']/following-sibling::dd[1]/time")
	article_info['dateReceived'] = dateReceived[0].text.strip()

	dateAccepted = tree.xpath(u"//dt[@class = 'accepted']/following-sibling::dd[1]/time")
	article_info['dateAccepted'] = dateAccepted[0].text	.strip()

	doi = tree.xpath(u"//dd[@class = 'doi']")
	article_info['doi'] = doi[0].text

	abstract = tree.xpath(u"//h1[text() = 'Abstract']/following-sibling::div[1]/p/text()")
	# article_info['abstract'] = abstract
	# print etree.tostring(abstract , pretty_print = True)
	print abstract

	subject_terms = tree.xpath(u"//h2[text() = 'Subject terms:']/following-sibling::ul[1]/*/a/text()")
	article_info['subjectTerms'] = subject_terms

	return article_info

目前解析xml或html的python库常见的有:beautifulsoup ,lxml,pyquery,scrapy,还有万能的正则表达式

lxml速度最快,

正则的提取速度与表达式写的质量有关,有时提取相同的内容但是不同的正则表达式速度可以相差10倍

pyquery是基于lxml的

beautifulsoup4最慢

scrapy挺强大的,但鄙人不才还没用过


【注】以上代码提取abstract部分有点问题,

<p>
    'p第一部分'
    <i>
         'i第一部分'
    </i>
    'p第二部分'
    <i>
         'i第二部分'
    </i>
    'p第三部分'
</p>

需要把<p>标签下所有文字部分提取出来(包括<i>中内容),但是目前没有在xpath中找到解决办法


  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值