百度贴吧爬虫

最新推荐文章于 2024-04-08 08:34:21 发布

weixin_30446613

最新推荐文章于 2024-04-08 08:34:21 发布

阅读量137

点赞数

文章标签：爬虫 python

原文链接：http://www.cnblogs.com/NanaseHaruka/p/11174930.html

版权

python脚本爬取王者荣耀吧

本篇博客介绍了这次由于项目需求，爬取百度贴吧--王者荣耀吧的帖子的过程。

一、安装第三方库

pip install requests
pip install bs4
pip install lxml
pip install html5lib

二、源码分析

1. 分析请求链接的规律

　　F12打开开发者工具，百度搜索王者荣耀吧，在Network选项卡找到对应请求，可以看到其请求链接基本遵循如下规律：

https://tieba.baidu.com/f?kw=%E7%8E%8B%E8%80%85%E8%8D%A3%E8%80%80&ie=utf-8&tab=corearea&pn=450
kw: 贴吧名字（王者荣耀）
ie: 编码方式
tab: 首页标签
pn: 页码

其中首页标签的选项卡如下，可以一个一个点一遍试试，看看对应URL的tab字段是什么值：

2. 分析Response

　　找到该请求的响应，可以看到每一条帖子的概要内容都在，没有用AJAX，故无需分析XHR。

但是需要注意的是，我们需要爬取的内容是每一条帖子的信息和内容，这一部分没有包含在<html>...</html>标签内，而是在之后另外用<code>......</code>包裹，并且，是注释内容。因此，在用beautifulsoup解析的时候，如果不做处理，是无法解析出我们想要的内容的。

3. 正式开始爬取

　　需求：爬取王者荣耀吧的帖子并保存到本地，可以选择页数，选择标签，选择指定日期之前的帖子，选择包含关键词的帖子。每一条帖子包含标题、链接、发表日期，详细内容，所有回帖。

（1）发送请求，获取响应

 1 def get_html(post_name, tab, pn):
 2     """
 3     获取html
 4     :param post_name: 贴吧名
 5     :param tab: 标签名
 6     :param pn: 页码
 7     :return:
 8     """
 9     try:
10         url = 'https://tieba.baidu.com/f'
11 
12         headers = {
13             'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) \
14                           Chrome/75.0.3770.100 Safari/537.36'
15         }
16         # tag:
17         # 核心区：corearea; 看帖：main
18         data = {
19             'kw': post_name,
20             'tab': tab,
21             'pn': pn,
22         }
23         response = requests.get(url, params=data, headers=headers, timeout=30)
24         # 必须修改HTML页面，把HTML结束标签改到最后，否则soup解析只到原来的HTML标签就结束了，后面的code标签里的内容被丢弃
25         html = response.text.replace('</body>', '')
26         html = html.replace('</html>', '')
27         response = html + '</body></html>'
28         # response.encoding = 'utf-8'
29         # print(response.text)
30         return response
31     except RuntimeError:
32         return 'ERROR'

注意代码中的注释，根据之前的分析，我们要的内容都不在<html><body>...</body></html>标签包裹之内，后面soup无法解析到，所以要修改获得的源码。

（2）解析响应

 1 def get_post_info(html, m, pn):
 2     """
 3     获取帖子的标题、链接信息，并从中筛选出有特定关键词的帖子
 4     :param html: 处理后的HTML页面
 5     :param m: month
 6     :param pn: 页码
 7     :return: 帖子信息
 8     """
 9     url = 'https://tieba.baidu.com'
10     soup = BeautifulSoup(html, 'lxml')
11     # 找到目标code标签，返回tag列表
12     code = soup.find_all('code', attrs={'id': 'pagelet_html_frs-list/pagelet/thread_list'})
13     # 提取code标签的内容（注释），返回列表
14     comment = code[0].contents
15     # print(type(comment[0]))
16     # comment = code[0].string
17     # print(type(comment))
18     # 重新开始解析comment
19     soup = BeautifulSoup(comment[0], 'lxml')
20     # soup = BeautifulSoup(comment, 'lxml')
21 
22     # 找到目标li标签
23     info = []
24 
25     # # 先找到置顶帖
26     # litags_top = soup.find_all('li', attrs={'class': 'j_thread_list thread_top j_thread_list clearfix'})
27     # for li in litags_top:
28     #     info_top = dict()
29     #     try:
30     #         info_top['title'] = li.find('a', attrs={'class': 'j_th_tit'}).text.strip()
31     #         info_top['link'] = ''.join([url, li.find('a', attrs={'class': 'j_th_tit'})['href']])
32     #         info_top['time'] = li.find('span', attrs={'class': 'pull-right is_show_create_time'}).text.strip()
33     #         info.append(info_top)
34     #     except:
35     #         print("错误：获取置顶帖标题失败！")
36 
37     # 再找到常规帖，提取标题、链接、发表日期、摘要信息
38     litags = soup.find_all('li', attrs={'class': 'j_thread_list clearfix'})
39     for li in litags:
40         try:
41             info_norm = dict()
42             info_norm['title'] = li.find('a', attrs={'class': 'j_th_tit'}).text.strip()
43             info_norm['link'] = ''.join([url, li.find('a', attrs={'class': 'j_th_tit'})['href']])
44             info_norm['date'] = li.find('span', attrs={'class': 'pull-right is_show_create_time'}).text.strip()
45             info_norm['abstract'] = li.find('div', attrs={'class': 'threadlist_abs threadlist_abs_onlyline'}). \
46                 text.strip()
47             info.append(info_norm)
48         except AttributeError as e:
49             print("错误：%s，可能是因为没有找到相应的标签" % e.args)
50         except:
51             print("错误：获取常规帖标题及摘要失败！")
52 
53     print('第 %s 页已经爬取成功， 开始处理...' % (pn/50+1))
54     # 筛选发表日期在一个月以内，且标题和摘要里有关键词['发热'，'卡'， '掉帧'， '']的帖子
55     # 获取当日日期
56     today = time.strftime('%m-%d', time.localtime(time.time()))
57     month = int(today.split('-')[0])
58     day = int(today.split('-')[1])
59 
60     if month - m >= 1:
61         last_month = month - m
62     else:
63         last_month = 12 + (month - m)
64     # if last_month == 2 and day >= 29:
65     #     one_month_before = ''.join([str(last_month), '-', '28'])
66     # else:
67     #     one_month_before = ''.join([str(last_month), '-', str(day)])
68 
69     # num = len(info)
70     info_new = []
71     for post in info:
72         if ':' in post['date']:
73             info_new.append(post)
74         elif int(post['date'].split('-')[0]) == last_month and int(post['date'].split('-')[1]) >= day:
75             info_new.append(post)
76         elif int(post['date'].split('-')[0]) == month and int(post['date'].split('-')[1]) <= day:
77             info_new.append(post)
78 
79     # # 关键词分开存放
80     # keywords = ['发热', '卡顿', '掉帧', '卡死']
81     # num = len(keywords)
82     # info_has_kw = [[] for i in range(num)]
83     # for post in info_new:
84     #     for i in range(num):
85     #         if keywords[i] in post['abstract']:
86     #             info_has_kw[i].append(post)
87     #             break
88 
89     print('第 %s 页已经处理完成，开始爬取下一页...' % (pn/50+1))
90     # return info_has_kw
91     return info_new

（3）保存到本地

def save2file(info, savepath=os.path.dirname(os.path.realpath(__file__))+'\\post.txt'):
    """
    将爬取到的帖子内容写入到本地，保存到指定目录的txt文件中，保存目录默认为当前目录。
    :param info: 帖子内容
    :param savepath: 输出文件路径，默认为当前目录
    :return:
    """
    # num = len(info)
    # for i in range(num):
    #     with open(savepath, 'a+') as f:
    #         for post in info[i]:
    #             f.write('标题：{} \t 链接：{} \t'.format(post['title'], post['link']))
    with open(savepath, 'a+') as f:
        for post in info:
            f.write('标题：{} \t 链接：{} \t'.format(post['title'], post['link']))
    print("当前页面已经保存到本地！")

（4）主程序

if __name__ == '__main__':
    post_name = '王者荣耀'
    tab = 'main'
    # 循环控制爬取的页数
    for pn in range(10):
        html = get_html(post_name, tab, pn*50)
        info = get_post_info(html, 3, pn*50)
        # print(info)
        save2file(info)
    print('-------所有帖子下载完成-------')

转载于:https://www.cnblogs.com/NanaseHaruka/p/11174930.html

weixin_30446613

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
百度贴吧爬虫

python脚本爬取王者荣耀吧本篇博客介绍了这次由于项目需求，爬取百度贴吧--王者荣耀吧的帖子的过程。一、安装第三方库pip install requestspip install bs4pip install lxmlpip install html5lib二、源码分析1. 分析请求链接的规律　　F12打开开发者工具，百度搜索王者荣耀吧，在Netw...
复制链接

扫一扫