最近爬取了一些关键词的infobox,此代码中没有加入ip池,所以爬个30条左右关键词就会受到反爬限制,ip就会被封十多分钟,加入UserAgent和设置随机sleep的时间我觉得对反爬没有用。
关键词“北京”的infobox展示如下:
接下来是爬取代码,网址自行写入即可使用,代码内的注释中解释的比较详细了
import requests
import unicodedata
import re
from bs4 import BeautifulSoup
import bs4
from fake_useragent import UserAgent
import time
import random
#随机获取一条UA,我感觉对反爬没用^_^
# header = UserAgent().random
# agent_list = [
# "Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
# "Avant Browser/1.2.789rel1 (http://www.avantbrowser.com)",
# "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5",
# "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9",
# "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7",
# "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14",
# "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14",
# "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20",
# "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27",
# "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1",
# "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2",
# "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7",
# "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre",
# "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10"
# ]
# header = {"User-Agent": random.choice(agent_list)}
header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"}
def search(content):
# content = input("输入想要查询的内容:")
# 运行代码后键入关键词查询的话就解除引用
url0 = "爬取的网址"
url = url0 + content
# 网址后加关键词到具体页面
response = requests.get(url, headers= header)
soup = BeautifulSoup(response.content, "html.parser")
list0 = []
list1 = []
# 获取左半部分infobox内容,因为infobox是左右对称的,网页源代码稍有区别,不属于同一个class,其余一样
info_box_left = soup.find("dl", class_="basicInfo-block")
for dt in info_box_left.find_all("dt", class_="basicInfo-item name"):
name = dt.get_text().strip()
value = dt.find_next("dd", class_="basicInfo-item value").get_text().strip()
value = value.replace("\n", "")
value = re.sub(r"\[.*?\]", "", value)
#爬取到的infobox内容有很多带注释[?]或者换行符“\n”,通通删掉
list0.append(f"{content}%{name}%{value}")
# 获取右半部分infobox内容
info_box_right = soup.find("dl", class_="basicInfo-block basicInfo-right")
for dt in info_box_right.find_all("dt", class_="basicInfo-item name"):
name = dt.get_text().strip()
value = dt.find_next("dd", class_="basicInfo-item value").get_text().strip()
value = value.replace("\n", "")
value = re.sub(r"\[.*?\]", "", value)
list0.append(f"{content}%{name}%{value}")
for i in list0:
list1.append(unicodedata.normalize('NFKC', i))
print(list1)
# search("北京")
# 将下边所有代码注释掉,调用search函数,搜什么关键词会直接像我下图所示print出来。
with open("三元组扩充.txt", "a", errors="ignore") as file:
for item in list1:
file.write(str(item) + "\n")
# 将列表中爬取到的三元组信息逐行加入“三元组扩充.txt”文档中
with open("关键词列表.txt", encoding = 'gbk') as f:
for line in f:
try:
time.sleep(random.uniform(20, 30))
# 这里随机设置了睡眠时间在20s-30s,结果对反反爬没有任何卵用......
search(line.strip())
except Exception as e:
print(f"An error occurred with argument {line.strip()}: {e}")
# 由于ip被封或者百度百科没有收录这个关键词,就会报错,报了就报了无所谓,下边continue会继续带你搜下一个关键词
continue
# 我先爬取了页面中带超链接的关键词(之所以要带超链接的有这个关键词的信息),并保存进了这个“关键词列表.txt”中,这样代码会取一个关键词,然后查这个关键词,并把infobox自动保存为三元组的格式
关于unicodedata.normalize('NFKC', i)
,如果直接将i加入到列表,那么爬出来的infobox如下图所示,会出现很多不间断空白符:
而使用unicodedata.normalize('NFKC', i)
后,会将空白符去掉,完美的三元组信息就被提取到了:
到这,网页中infobox的信息就被保存为三元组了。
在三元组列表中,分隔符为“%”,这个可以任意替换。
加入代理ip防止反爬会在之后的文章中写,记录下学习,如有错误恳请批评指正,欢迎大家讨论,互相进步。