使用python xpath爬虫引号数据处理/标签文本获取方式

最新推荐文章于 2023-03-21 00:18:15 发布

yida7942

最新推荐文章于 2023-03-21 00:18:15 发布

阅读量999

点赞数

文章标签： python xpath excel

本文链接：https://blog.csdn.net/yida7942/article/details/107819708

版权

本文介绍了如何使用Python的XPath来爬取网页数据，特别是面对无法直接获取的数据，通过`descendant-or-self::text()`可以获取同级及下级信息。在使用Selenium时，如果常规方法无效，可以借助`get_attribute('textContent')`获取标签的文本内容。

摘要由CSDN通过智能技术生成

针对部分无法获取数据，在xpath语句中使用“descendant-or-self::text()”，即可获取同级及下级的信息

个人常用xpath爬虫格式：

import requests
from lxml import etree
from fake_useragent import UserAgent
import urllib
from xlrd import open_workbook
from xlutils.copy import copy

#设置headers
ua = UserAgent(verify_ssl=False)
headers = {
    "User-Agent": ua.random,
    }

#获取url链接的xml格式
def getxml(url):
    res = requests.get(url, headers, timeout = 30)
    res.encoding = res.apparent_encoding
    text = res.text
    xml = etree.HTML(text)
    return xml

#获取详细信息
urllink= 'https://www.tianyancha.com/elibs_quoted/p'
for i in range(1,270):
    url = urllink + str(i)
    print(url)
    xml = getxml(url)
    eles = xml.xpath('//div[@class="elib-table"]//tbody/tr/td/descendant-or-self::text()')
#写入excel
	rexcel = open_workbook("D:/名单.xls")
	excel = copy(rexcel)
	table = excel.get_shee