python爬虫爬取百度贴吧帖子

最新推荐文章于 2023-06-26 14:15:34 发布

weixin_43904840

最新推荐文章于 2023-06-26 14:15:34 发布

阅读量837

点赞数 1

文章标签： python 学习日志

本文链接：https://blog.csdn.net/weixin_43904840/article/details/89792553

版权

工具是scrapy和beautifulsoup。
待创建目录下cmd，输入scrapy genspider spider_name 'spider_url’创建新爬虫。
用pycharm打开爬虫根目录，在spider文件夹里找到spider.py，在里面编写爬虫程序。
因为前段时间很喜欢玩csgo，所以选择爬取csgo贴吧的帖子。最多爬取10页。
爬虫的主函数是parse。

import scrapy
from bs4 import BeautifulSoup
from a.items import URLItem
import requests

class UrlSpiderSpider(scrapy.Spider):
    name = 'url_spider'
    allowed_domains = ['https://tieba.baidu.com/f?ie=utf-8&kw=csgo&fr=search'] #贴吧链接
    start_urls = ['https://tieba.baidu.com/f?ie=utf-8&kw=csgo&fr=search/'] #贴吧url

    def parse(self, response):
        post_get = BeautifulSoup(response.body, 'lxml') #beautifulsoup抓取网页，解释器选择lxml
        post_find = post_get.find_all('li', class_='j_thread_list clearfix') #第一轮搜索节点，搜索所有包含单个帖子的小节点
        for post in post_find: #对每一个节点都搜索到内含帖子的url、标题、最后回复时间
            url_item = URLItem()
            a = post.find('a', class_='j_th_tit') #抓取含有链接标题、链接url的节点
            a_text = a['href'] #提取链接url
            a_title = a.text #提取链接标题
            a_par3 = a.parent.parent.parent #向上追溯到包含回复时间等其他信息的大节点
            a_time = a_par3.find('span', class_='threadlist_reply_date pull_right j_reply_data') #在大节点内重新搜索包含回复时间的小节点
            a_time = a_time.text #提取最后回复时间
            url_item['post_url'] = a_text
            url_item['post_name'] = a_title
            url_item['post_reply_time'] = a_time
            #print(a_text)
            #print(a_title)
            #print("最后回复于：", end=a_time) #逐个打印链接url、链接标题、最后回复时间
            #print("================================================================")
            yield url_item
        other_find = post_get.find_all('a', class_='pagination-item') #提取首页底端的跳转到其他页的链接
        for page in other_find: #逐页用beautifulsoup进行抓取
            url_item = URLItem()
            page_url = page['href']
            page_url = 'http://' + page_url[2:]
            url_obj = requests.get(page_url)
            url_go = BeautifulSoup(url_obj.content, 'lxml')
            o_post_find = url_go.find_all('li', class_='j_thread_list clearfix') #第一轮搜索节点，搜索所有包含单个帖子的小节点
            for o_post in o_post_find: #以下功能同上个for循环
                b = o_post.find('a', class_='j_th_tit')
                b_text = b['href']
                b_title = b.text
                b_par3 = b.parent.parent.parent
                b_time = b_par3.find('span', class_='threadlist_reply_date pull_right j_reply_data')
                b_time = b_time.text
                url_item['post_url'] = b_text
                url_item['post_name'] = b_title
                url_item['post_reply_time'] = b_time
                #print(b_text)
                #print(b_title)
                #print("最后回复于", end=b_time)
                #print("================================================================")
                yield url_item

爬取了前10页的所有帖子的标题、链接和最后回复时间。
爬取的数据用item进行结构化提取，并用pipeline进行保存。
items.py代码：

import scrapy


class URLItem(scrapy.Item):
    post_name = scrapy.Field()
    post_url = scrapy.Field()
    post_reply_time = scrapy.Field()

定义了URLItem类，其一共有三个参数，也就是帖子标题、链接、最后回复时间。定义新参数统一方法为

parameter_name = scrapy.Field()

定义新pipeline需要在settings.py中配置之。找到

ITEM_PIPELINES = {

}

模块，解除它的被注释状态，并将新pipeline名称加入其中。其格式为：

ITEM_PIPELINES = {
    'a.pipelines.URL_INFPipeline': 300,
#    'a.pipelines.APipeline': 300,

}

注释掉的内容为默认存在的pipeline。pipeline类定义在pipelines.py中。计划采用csv文件保存数据，因此需要事先import到scrapy.exporters中的CsvItemExporter类。pipeline代码为：

from scrapy.exporters import CsvItemExporter


class URL_INFPipeline(object):
    def open_spider(self, spider):  #该函数可以选择不定义，名称、形参唯一限定
        self.file = open('url_inf.csv', 'wb')
        self.exporter = CsvItemExporter(self.file)
        self.exporter.start_exporting()

    def process_item(self, item, spider):  #该函数必须定义，名称、形参唯一限定
        self.exporter.export_item(item)
        return item
        
    def close_spider(self, spider):  #该函数可以选择不定义，名称、形参唯一限定
        self.exporter.finish_exporting()
        self.file.close()

每个pipeline类的名字可以随便取，但是必须存在一个process_item函数。根据scrapy官方介绍，该函数必须返回item，或者raise dropitem并可以附带一个自定义的打印语句输出，具体参见Docs » Item Pipeline。关于exporter的功能与用法也可参照该站。
编写好爬虫以后在根目录下cmd，输入scrapy crawl url_spider运行爬虫。运行完以后在根目录下获得一个新的csv文件，里面有此次爬取的全部内容。

weixin_43904840

关注

1
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
python爬虫爬取百度贴吧帖子

工具是scrapy和beautifulsoup。待创建目录下cmd，输入scrapy genspider spider_name 'spider_url’创建新爬虫。用pycharm打开爬虫根目录，在spider文件夹里找到spider.py，在里面编写爬虫程序。因为前段时间很喜欢玩csgo，所以选择爬取csgo贴吧的帖子。最多爬取10页。爬虫的主函数是parse：import scra...
复制链接

扫一扫