2024.4.14 词条数据爬取和格式转换
本周我对之前分析的百度百科词条编写了爬虫进行数据爬取,获得了大量的词条和描述文档,并且同时获得了一些词条间的关系。
爬虫编写
网站分析
https://baike.baidu.com/item/%E8%AE%A1%E7%AE%97%E6%9C%BA%E7%BD%91%E7%BB%9C/18763
百度百科是类似于wiki pedia的百科网站,数据库里存储了大量的概念词条和相关描述,同时在每个概念的描述文档中还使用超链接连接到相关的概念。
该网站不需要登录即可游客访问,使用浏览器可以在未登录的情况下快速多次访问不受限制。
使用技术和工具
本次爬虫使用Python作为编程语言,使用了requests模块发送http请求,使用time进行随机等待请求时间间隔防止被反爬机制识别,使用re模块对爬取的html格式数据进行正则匹配提取有用信息,使用json模块进行数据的格式化和写入存储
另外本次爬虫还使用了企业提供的IP池,支持爬虫程序进行多线程并发的快速请求,提高爬取效率,支持大量数据爬取。
代码实现
首先设计爬虫机制,对指定的词条进行爬取,通过文档中的超链接和相关词条标题加入待爬取队列,进行递归爬取,对规模较大的数据集爬取3层,对规模较小的数据集爬取4层。
def query(url):
while True:
try:
r = requests.get(url='http://api.proxy.ipidea.io/getProxyIp?num=1&tag=static_hk_balance&return_type=json&lb=1&sb=0&flow=1&protocol=http')
r = r.json()
print(r)
proxy = {
"http": 'http://' + r['data'][0]['ip'] + ':' + str(r['data'][0]['port']),
"https": 'http://' + r['data'][0]['ip'] + ':' + str(r['data'][0]['port'])
}
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
# 'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'
# 'Accept-Encoding': 'gzip, deflate, br, zstd',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36',
# 'Cookie': 'zhishiTopicRequestTime=1714204958834; BIDUPSID=990AEC8E63B79ECEB2DCA3A7BEEA347E; PSTM=1688897620; BAIDUID=990AEC8E63B79ECE513E0A76649F94E7:FG=1; BDUSS=XY4cU5nUXU0NEZSRXFxRGM5UFdwUnBBRFQ5MlZUVFdWclM5Y2N5eHAtflhYTkprRVFBQUFBJCQAAAAAAAAAAAEAAAB~ktsyMjQ3MTExYQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAANfPqmTXz6pkS; BDUSS_BFESS=XY4cU5nUXU0NEZSRXFxRGM5UFdwUnBBRFQ5MlZUVFdWclM5Y2N5eHAtflhYTkprRVFBQUFBJCQAAAAAAAAAAAEAAAB~ktsyMjQ3MTExYQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAANfPqmTXz6pkS; Hm_lvt_55b574651fcae74b0a9f1cf9c8d7c93a=1700572499,1701322775; MCITY=-60949%3A288%3A; H_WISE_SIDS_BFESS=40416_40499_40446_40080_60129_60138; baikeVisitId=2efca0c8-7c1c-4797-8bb5-d03d9db4ab56; BAIDUID_BFESS=990AEC8E63B79ECE513E0A76649F94E7:FG=1; ZFY=hJ65HUmPh7OYnHjNKX2IUJefkgbBJUR9vLt4jjZ0hio:C; H_PS_PSSID=40499_40446_40080_60138; H_WISE_SIDS=40499_40446_40080_60138; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BA_HECTOR=ag8k002g048l8420a0a0a180prrdan1j2pb881t; BDRCVFR[feWj1Vr5u3D]=I67x6TjHwwYf0; PSINO=1; delPer=0; channel=baidusearch; ab_sr=1.0.1_MjE4YTM5MzJkYmE0Njg1M2IxYmIwYTIyMWMzYjVmMjUwMThkZWE3NDExNzZjMjk0ODk1YWU4OGRhMjViNjRkYWYwNWYwZDUyMzhhMmY3ZjMzMGU1YWFlNjdmMjdhZTRjMjUyOTA5NDNhMzU3YjJkNDNkNTRlMTA1NjcyMTViN2MzYTQxOTE4OTQyOWZkNjMyZjU4OWE0N2M4MDJlYjkyYjU5Yjk5ZjBiYWQxMjkzNDNhZTJmYTVhMDA3YmU5YjFh; RT="z=1&dm=baidu.com&si=d50562e6-c279-4806-8422-2414ca533856&ss=lvhtdf0a&sl=9&tt=1yf&bcn=https%3A%2F%2Ffclog.baidu.com%2Flog%2Fweirwood%3Ftype%3Dperf&ld=gyo&ul=hla"'
'Connection': 'close'
}
# req = urllib.request.Request(url=url, headers=headers, method='GET')
# # response = urllib.request.urlopen(req)
# proxy_handler = urllib.request.ProxyHandler(proxy)
# opener = urllib.request.build_opener(proxy_handler, urllib.request.HTTPHandler)
# response = opener.open(req)
# text = response.read().decode('utf-8')
# print("response:", text)
# print("response:", response.getcode())
r = requests.get(url=url, headers=headers, proxies=proxy)
r.encoding = "utf-8"
print(r.status_code)
print(r.text)
return r.text
except:
traceback.print_exc()
if __name__ == '__main__':
que = []
vis = set()
que.append(('面向对象', 2262089, 1))
vis.add('面向对象')
# que.append(('网络环境', 4422188, 1))
# vis.add('网络环境')
cur = 0
docs = []
mention_list = []
doc_f = open('./docs.json', 'w', encoding='utf-8')
mention_f = open('./mentions.json', 'w', encoding='utf-8')
while True:
if cur >= len(que):
break
print(f"cur/sum: {cur}/{len(que)}")
entity_name, entity_id, depth = que[cur]
cur += 1
print("request:", 'https://baike.baidu.com/item/' + entity_name + '/' + str(entity_id))
url = 'https://baike.baidu.com/item/' + urllib.parse.quote(entity_name) + '/' + str(entity_id)
text = query(url)
doc = {
'title': entity_name,
'id': entity_id,
'content': ''
}
pattern = r"<span class=\".*?\" data-text=\"true\">(.*?)</span>"
matches = re.findall(pattern, text)
for match in matches:
# print("match:", match)
pattern = r'href="/item/([^"]+)/(\d+).*?>(.*?)</a>'
sentence = re.search(pattern, match)
if sentence:
match_entity_name, match_entity_id, match_mention = re.findall(pattern, match)[0]
match_entity_id = int(match_entity_id)
# print("a:", match_mention)
if depth < 4 and match_entity_name not in vis:
que.append((match_entity_name, match_entity_id, depth + 1))
vis.add(match_entity_name)
start_pos = len(doc['content'])
doc['content'] += match_mention
end_pos = len(doc['content'])
mention = {
'doc_id': entity_id,
'entity_id': match_entity_id,
'entity_name': match_entity_name,
'mention': match_mention,
'start_pos': start_pos,
'end_pos': end_pos
}
mention_list.append(mention)
mention_f.write(json.dumps(mention, ensure_ascii=False) + '\n')
else:
# print("span:", match)
doc['content'] += match
docs.append(doc)
doc_f.write(json.dumps(doc, ensure_ascii=False) + '\n')
# print("doc:", doc)
time.sleep(1)
# break
首先直接使用requests进行http请求,请求间隔为1秒,发现爬取约30条网页信息后被百度反爬机制侦测并爬取到人机验证界面。
尝试设置时间间隔为随机20-30秒,重新爬取,发现在爬取约半小时后仍然被人机验证。虽然爬取时间变长,但因为爬取间隔边长,总的爬取数量仍没有什么提升。
发现浏览器请求可以不受限制,尝试携带agent请求头模拟chrome浏览器请求,并携带登录用户的cookie。
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
# 'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'
# 'Accept-Encoding': 'gzip, deflate, br, zstd',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36',
# 'Cookie': 'zhishiTopicRequestTime=1714204958834; BIDUPSID=990AEC8E63B79ECEB2DCA3A7BEEA347E; PSTM=1688897620; BAIDUID=990AEC8E63B79ECE513E0A76649F94E7:FG=1; BDUSS=XY4cU5nUXU0NEZSRXFxRGM5UFdwUnBBRFQ5MlZUVFdWclM5Y2N5eHAtflhYTkprRVFBQUFBJCQAAAAAAAAAAAEAAAB~ktsyMjQ3MTExYQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAANfPqmTXz6pkS; BDUSS_BFESS=XY4cU5nUXU0NEZSRXFxRGM5UFdwUnBBRFQ5MlZUVFdWclM5Y2N5eHAtflhYTkprRVFBQUFBJCQAAAAAAAAAAAEAAAB~ktsyMjQ3MTExYQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAANfPqmTXz6pkS; Hm_lvt_55b574651fcae74b0a9f1cf9c8d7c93a=1700572499,1701322775; MCITY=-60949%3A288%3A; H_WISE_SIDS_BFESS=40416_40499_40446_40080_60129_60138; baikeVisitId=2efca0c8-7c1c-4797-8bb5-d03d9db4ab56; BAIDUID_BFESS=990AEC8E63B79ECE513E0A76649F94E7:FG=1; ZFY=hJ65HUmPh7OYnHjNKX2IUJefkgbBJUR9vLt4jjZ0hio:C; H_PS_PSSID=40499_40446_40080_60138; H_WISE_SIDS=40499_40446_40080_60138; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BA_HECTOR=ag8k002g048l8420a0a0a180prrdan1j2pb881t; BDRCVFR[feWj1Vr5u3D]=I67x6TjHwwYf0; PSINO=1; delPer=0; channel=baidusearch; ab_sr=1.0.1_MjE4YTM5MzJkYmE0Njg1M2IxYmIwYTIyMWMzYjVmMjUwMThkZWE3NDExNzZjMjk0ODk1YWU4OGRhMjViNjRkYWYwNWYwZDUyMzhhMmY3ZjMzMGU1YWFlNjdmMjdhZTRjMjUyOTA5NDNhMzU3YjJkNDNkNTRlMTA1NjcyMTViN2MzYTQxOTE4OTQyOWZkNjMyZjU4OWE0N2M4MDJlYjkyYjU5Yjk5ZjBiYWQxMjkzNDNhZTJmYTVhMDA3YmU5YjFh; RT="z=1&dm=baidu.com&si=d50562e6-c279-4806-8422-2414ca533856&ss=lvhtdf0a&sl=9&tt=1yf&bcn=https%3A%2F%2Ffclog.baidu.com%2Flog%2Fweirwood%3Ftype%3Dperf&ld=gyo&ul=hla"'
'Connection': 'close'
}
在爬取半天,约爬取1300条网页后仍然被人机验证,约半小时到一小时内可以恢复访问,中间的爬取会被阻塞,效率太低。
尝试配置IP池,使用IP代理的方式每次请求使用不同的IP,防止被反爬机制识别到同一IP的操作。
使用IP池的结果可以正确避免被反爬机制识别,但存在部分IP连接超时,因此需要对每次代理访问进行多次尝试。
使用IP池后,由于每次访问使用不同IP,即访问请求可以并发且时间间隔降低,因此时间间隔设置为随机1-5秒
爬虫结果
我们对计算机网络、面向对象、软件工程、计算机组织与结构进行了相关词条数据爬取,总共获得了约1.5GB的文本数据,包括词条描述和词条间关系,大概占比1:1