爬虫的步骤解析内容xpath介绍_爬虫连上了，却不知道怎么解析里面的内容？四种办法帮你忙...

最新推荐文章于 2023-04-23 09:14:11 发布

TechTitan

最新推荐文章于 2023-04-23 09:14:11 发布

阅读量123

点赞数

文章标签：爬虫的步骤解析内容xpath介绍

本文链接：https://blog.csdn.net/weixin_29169899/article/details/112527398

版权

很多CSDN呀，各类网站呀，一搜爬虫，上来就一排代码向你砸来，如果你没有写过，是刚接触python的初学者，一看就懵了，就跟没上过高中，直接扔给你一套高考试题一样，很容易放弃，作为过来人，我花费了大量的时间看别人写的代码，懂了以后，自己借鉴试错，踩了无数个坑，从开始的各种报错，到现在知道一点为什么。最后废话不多说，开始今天的主题吧。

第一种，也是最简单方便的，利用json格式，获取网站的数据。

import pandas as pd

url = "https://club.jd.com/comment/productCommentSummaries.action?referenceIds=xxxxxx"
    "User-Agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Mobile Safari/537.36",
}

req = requests.get(url,headers)
the_page = req.text
the_page = the_page[14:-2]
#data = json.loads(the_page) 
#看看data里面的内容。
data = json.loads(the_page).get("CommentsCount")
df = pd.DataFrame(data)
df.to_csv("comment_32.csv",index=False)

分三大块

拿到url，爬的网页里面有header，拿来用；
用requests模块来get，这里注意，有些网页是post，比如我之前的一篇文章，爬魔兽世界怀旧服的；
发到在线json编辑器http://www.bejson.com/jsoneditoronline/，查看是否是严格的json格式，如果不是，就像这里，把json用str的截取出来json的部分，用loads方法，loads到json中。

json可以直接被pandas读取，超级方便，这一点很多网站都没有这种操作，是我自己偶然发现的，是不是很厉害呢？哈哈哈。。。

第二种，利用xpath，页面解析器。

from lxml import etree
from urllib.request import urlopen,Request

url= 'https://search.jd.com/Search?keyword=%E6%89%8B%E7%8E%AF&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&stock=1&page=1&s=1&click=0&scrolling=y'
head = {
     'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36'
  }

r = requests.get(url, headers=head)
#指定编码方式，不然会出现乱码
r.encoding='utf-8'
html1 = etree.HTML(r.text)
#定位到每一个商品标签li
datas = html1.xpath('//li[contains(@class,"gl-item")]')
with open ('JD_shouhuan_200.csv','a',newline='',encoding='utf-8')as f:
write=csv.writer(f)        
for data in datas:
    p_name = data.xpath('div/div[@class="p-name p-name-type-2"]/a/em/text()')
    p_price = data.xpath('div/div[@class="p-price"]/strong/i/text()')
    p_shop = data.xpath('div/div[@class="p-shop"]/span/a/text()')
    p_url = data.xpath('div/div[@class="p-img"]/a/@href')
    write.writerow([p_name,p_price,p_shop,p_url])
f.close()

上面是我写的，爬取京东手环类产品，名称，价格，店铺信息的爬虫，也分三个部分；

拿到url，爬的网页里面有header，拿来用；
用requests模块来get，获取到信息后，用etree.HTML（）方法来补充完整。
最后也是最关键的一步，用xpath的contains或直接定位到类名。然后在后面一步一步把标签用/隔开，直到找到标签内的内容。

第三种，用beautifulsoup包。

request = Request(url,headers=head)
response = urlopen(request)
code = response.read().decode()
bs = BeautifulSoup(code,'lxml')
name_list = bs.select('.p-name em ')
for name in name_list:
    name=(name.get_text())

跟前面二种方法一样，找url,查headers，beautifulsoup是通过select方法，直接把所有的类，标签找出来，后面可以用for循环，进行取数。

第四种，用re正则表达式。

req = requests.get(url,headers)
the_page = req.text
the_page = the_page[14:-2]

p = re.compile('(?<="SkuId":)d+')
q = re.compile('(?<="ProductId":)d+')
s = re.compile('(?<="ShowCount":)d+')
o = re.compile('(?<="CommentCount":)d+')
s = re.compile('(?<="AverageScore":)d+')
t = re.compile('(?<="GoodCount":)d+')
u = re.compile('(?<="AfterCount":)d+')
r = re.compile('(?<="PoorCount":)d+')

Sku_Id = re.findall(p,the_page)
ProductId = re.findall(q,the_page)
ShowCount = re.findall(s,the_page)
CommentCount = re.findall(o,the_page)
AverageScore = re.findall(s,the_page)
GoodCount = re.findall(t,the_page)
AfterCount = re.findall(u,the_page)
PoorCount = re.findall(r,the_page)
pd_all = pd.DataFrame([Sku_Id,ProductId,ShowCount,CommentCount,AverageScore,GoodCount,AfterCount,PoorCount]).T
pd_all.columns =["Sku_Id","ProductId","ShowCount","CommentCount","AverageScore","GoodCount","AfterCount","PoorCount"]
pd_all.head()

这里前面还是一样，request，get到数据后，用正则进行匹配，因为要取类标签后面的数据，这里使用了先行断言。CSDN这一篇文章写得很好，https://blog.csdn.net/icewfz/article/details/79900993。

总结：所有的方法用，我最喜欢的还是json，毕竟时间就是金钱，我的朋友！xpath,文章结构前后有不一样的地方，会取不到值。

一句话总结，能用json，用json，不能用，建议用beuatifulsoup，原因是自动化，只需要知道页面大结构就行了。

TechTitan

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬虫的步骤解析内容xpath介绍_爬虫连上了，却不知道怎么解析里面的内容？四种办法帮你忙...

很多CSDN呀，各类网站呀，一搜爬虫，上来就一排代码向你砸来，如果你没有写过，是刚接触python的初学者，一看就懵了，就跟没上过高中，直接扔给你一套高考试题一样，很容易放弃，作为过来人，我花费了大量的时间看别人写的代码，懂了以后，自己借鉴试错，踩了无数个坑，从开始的各种报错，到现在知道一点为什么。最后废话不多说，开始今天的主题吧。第一种，也是最简单方便的，利用json格式，获取网站的数据。imp...
复制链接

扫一扫