前言
对nonebot2机器人感兴趣,又不知道如何编写插件的可以看过来,或者对贴吧攻略感兴趣的也可以看过来,插件下载在总结一栏。
一、网址分析
我们以一个简单的网页游戏为例子
在百度贴吧进去之后,选择自己要去的贴吧名称,网页上的链接如图所示,将链接分享会变为https://tieba.baidu.com/f?ie=utf-8&kw=%E5%A5%A5%E5%A5%87%E4%BC%A0%E8%AF%B4&fr=search
使用代码即可将特定贴吧kw=变为上面
import urllib.parse
key="奥奇传说"
key=urllib.parse.quote(key)
print(key)
那么接下来一步就是在贴吧内进行攻略查询
此时结果并不是我们想要的,点击主题帖,获得新的网址:
https://tieba.baidu.com/f/search/res?ie=utf-8&kw=%E5%A5%A5%E5%A5%87%E4%BC%A0%E8%AF%B4&qw=%E5%90%AF%E5%85%83%E8%AF%BA%E4%BA%9A&un=&rn=10&pn=0&sd=&ed=&sm=1&only_thread=1,其中%E5%90%AF%E5%85%83%E8%AF%BA%E4%BA%9A,即qw=是你搜索的内容,使用上步代码即可得到。
二、获取
1.引入库
import requests
import urllib.parse
import re
from lxml import etree # 导入xpath
from urllib import request
2.网页分析
# msg 为你要查询攻略的内容
key = urllib.parse.quote(msg)
theme = "&un=&rn=10&pn=0&sd=&ed=&sm=1&only_thread=1"
ur = "https://tieba.baidu.com/f/search/res?ie=utf-8&kw=%E5%A5%A5%E5%A5%87%E4%BC%A0%E8%AF%B4&qw="
# print(key)
url = ur + key + theme
# print(url)
r = requests.get(url).text
# print(r)
# 获取攻略链接
matches = re.findall('<div class="s_post"><span class="p_title"><a data-tid=(.*?) data-fid=(.*?) class="bluelink" href="(.*?)" class="bluelink" target="_blank" >',r)
ur=[]
for i in matches:
ur.append('https://tieba.baidu.com/'+i[2])
print(ur)
#此时的ur就是攻略链接
此时结果并不是我们想要的,仅仅只有链接,那么我们还需要获取每条链接的名称,发布时间。
# 定义树形结构解析器,r为上面的url链接获取的文本
selector = etree.HTML(r, parser=None, base_url=None)
detail_text =selector.xpath("/html/body/div[@class='wrap1']/div[@class='wrap2']/div[@class='s_container clearfix']/div[@class='s_main']/div[@class='s_post_list']/div[@class='s_post']")
inf=[]
for j in detail_text:
z=j.xpath('string(.)').strip()
x=z.split(" ")
mytest = [i for i in x if i != '']
inf.append(mytest)
print(inf)
这时候就获取到了每条链接的名称,发布时间,此出不展示结果。
那么如果有人查询游戏里没有的,就会产生[ ],将它结合起来代码如下:
# inf为攻略标题,时间
infor = ''
if inf != []:
for n in range(len(ur)):
x = inf[n]
infor = infor + x[0] + ' ' + '时间: ' + x[-1] + '\n' + ur[n] + '\n'
else:
for n in range(len(ur)):
infor = infor + ur[n] + '\n'
print(infor)
那么到此攻略爬取就结束了,完整代码如下:
import requests
import urllib.parse
import re
from lxml import etree # 导入xpath
from urllib import request
msg = '启元诺亚'
key = urllib.parse.quote(msg)
theme = "&un=&rn=10&pn=0&sd=&ed=&sm=1&only_thread=1"
ur = "https://tieba.baidu.com/f/search/res?ie=utf-8&kw=%E5%A5%A5%E5%A5%87%E4%BC%A0%E8%AF%B4&qw="
# print(key)
url = ur + key + theme
# print(url)
r = requests.get(url).text
# print(r)
# 获取攻略链接
matches = re.findall(
'<div class="s_post"><span class="p_title"><a data-tid=(.*?) data-fid=(.*?) class="bluelink" href="(.*?)" class="bluelink" target="_blank" >',
r)
# print(matches)
ur = []
for i in matches:
ur.append('https://tieba.baidu.com/' + i[2])
# print(ur)
# 定义树形结构解析器
selector = etree.HTML(r, parser=None, base_url=None)
detail_text = selector.xpath(
"/html/body/div[@class='wrap1']/div[@class='wrap2']/div[@class='s_container clearfix']/div[@class='s_main']/div[@class='s_post_list']/div[@class='s_post']")
inf = []
for j in detail_text:
z = j.xpath('string(.)').strip()
x = z.split(" ")
mytest = [i for i in x if i != '']
inf.append(mytest)
print(inf)
# inf为攻略标题,时间
infor = ''
if inf != []:
for n in range(len(ur)):
x = inf[n]
infor = infor + x[0] + '时间: ' + x[-1] + '\n' + ur[n] + '\n'
else:
for n in range(len(ur)):
infor = infor + ur[n] + '\n'
print(infor)
总结
贴吧攻略爬取完毕后,只需要按照机器人插件格式编写即可使用,详细插件下载https://github.com/PnengChen/nonebot2
如果你热爱nonebot2机器人,可加群:970353786,非诚勿扰。