python+lxml+xpath提取nature网站中的article基本信息【初级版】

最新推荐文章于 2022-10-29 17:37:15 发布

lan2720

最新推荐文章于 2022-10-29 17:37:15 发布

阅读量5.4k

点赞数

分类专栏： python 文章标签： python html xpath parse

本文链接：https://blog.csdn.net/lan2720/article/details/20805529

版权

python 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

下图为网站上的一篇article，可以明显提取到：题目，作者，作者单位（afiliation），贡献情况，通讯作者（corresponding author），

期号，卷号，投稿时间，接收时间，发表时间

摘要，主题（下图中未截取，可点击网页查看）

1、html源码的获取，保存到nature.txt

def fetch(url):
	http_request = urllib2.Request(url = url)
	http_response = urllib2.urlopen(http_request, data = None, timeout = 600)
	#http_response rturns a file-like object, so it can read()

	status_code = http_response.getcode()

	if status_code != 200:
		print "Error : " , status_code
	else:
		print "Succesful!"

	print "-----  start downloading date  -----"
	html_doc = http_response.read()
	fn = open('nature.txt','w')
	fn.write(html_doc)
	print "-----  finish downloading data  -----"
	fn.close()

2、解析html文件，提取信息

def parse(file_path):
	# html sourse : nature journal, one artical
	# title , 
	# authors ,
	# journal name
	# volume
	# page (xxx - xxx)
	# datetime 
	# doi
	article_info = {}  # to store these features

	f = codecs.open(file_path , 'r' , 'utf-8')
	content = f.read()
	f.close()

	tree = etree.HTML(content)

	# nodes = tree.xpath(u"//a[@class = 'name']/span[@class = 'fn']")
	title = tree.xpath(u"//h1[@class='article-heading']")
	article_info['title'] = title[0].text
	
	authors = []
	authors_nodes = tree.xpath(u"//a[@class = 'name']/span[@class = 'fn']")
	for node in authors_nodes:
		authors.append(node.text)
	article_info['authors']  = authors

	journal_name = tree.xpath(u"//dl[@class = 'citation']/dd[@class = 'journal-title']")
	article_info['journal_name'] = journal_name[0].text

	volume = tree.xpath(u"//dl[@class = 'citation']/dd[@class = 'volume']")
	article_info['volume'] = volume[0].text.strip()

	page = tree.xpath(u"//dl[@class = 'citation']/dd[@class = 'page']")
	article_info['page'] = page[0].text

	datePublished = tree.xpath(u"//dt[@class = 'published']/following-sibling::dd[1]/time")
	article_info['datePublished'] = datePublished[0].text.strip(punctuation)

	dateReceived = tree.xpath(u"//dt[@class = 'received first']/following-sibling::dd[1]/time")
	article_info['dateReceived'] = dateReceived[0].text.strip()

	dateAccepted = tree.xpath(u"//dt[@class = 'accepted']/following-sibling::dd[1]/time")
	article_info['dateAccepted'] = dateAccepted[0].text	.strip()

	doi = tree.xpath(u"//dd[@class = 'doi']")
	article_info['doi'] = doi[0].text

	abstract = tree.xpath(u"//h1[text() = 'Abstract']/following-sibling::div[1]/p/text()")
	# article_info['abstract'] = abstract
	# print etree.tostring(abstract , pretty_print = True)
	print abstract

	subject_terms = tree.xpath(u"//h2[text() = 'Subject terms:']/following-sibling::ul[1]/*/a/text()")
	article_info['subjectTerms'] = subject_terms

	return article_info

目前解析xml或html的python库常见的有：beautifulsoup ，lxml，pyquery，scrapy，还有万能的正则表达式

lxml速度最快，

正则的提取速度与表达式写的质量有关，有时提取相同的内容但是不同的正则表达式速度可以相差10倍

pyquery是基于lxml的

beautifulsoup4最慢

scrapy挺强大的，但鄙人不才还没用过

【注】以上代码提取abstract部分有点问题，

<p>
    'p第一部分'
    <i>
         'i第一部分'
    </i>
    'p第二部分'
    <i>
         'i第二部分'
    </i>
    'p第三部分'
</p>

需要把<p>标签下所有文字部分提取出来（包括<i>中内容），但是目前没有在xpath中找到解决办法

lan2720

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录