Python爬虫实战--斗鱼直播爬虫

最新推荐文章于 2024-07-06 11:39:08 发布

雾里看花_lhh

最新推荐文章于 2024-07-06 11:39:08 发布

阅读量5.8k

点赞数 3

分类专栏： Python爬虫实战 python学习 selenium实战文章标签：斗鱼直播 Python爬虫实战 selenium实战

本文链接：https://blog.csdn.net/m0_37903789/article/details/84330435

版权

python学习同时被 3 个专栏收录

25 篇文章 1 订阅

订阅专栏

Python爬虫实战

7 篇文章 2 订阅

订阅专栏

selenium实战

1 篇文章 0 订阅

订阅专栏

前言：
稍微总结一下前面我们所学到的内容吧！在前面的实战学习中，我们学会如何使用requests来获取网页源码，并从中提取出我们所需要的数据，那接下来，我们也将进一步学会使用selenium获取网页，分析网页，和提取数据。
目标站点分析
目标URL：https://www.douyu.com/directory/all
明确内容：
在这里插入图片描述
本次爬虫实战里，我们将要爬取斗鱼网站上面所有的房间信息，并提取我们的目标数据：房间名，房间链接，房主，房间分类，观看人数等红色方框中的数据。
翻页策略：
之前讲过，我们在翻页是，有三种操作模式：
1.直接查找页面之间的联系，如：/pnXX/,这只需要XX递增，便可以遍历所有的页面了
2.对于AJAX加载的页面，我们需要找到接口，并请求它，获取页面信息，如新浪微博
3.通过找到，下一页的按钮，再使用selenium，点击按钮，完成翻页
这里我们可以采用第三种，来获取下一页的页面信息
在这里插入图片描述

 # 3.点击下一页
        print('Next Page!')
        temp_list = self.driver.find_elements_by_class_name("shark-pager-next")
        # 4.进行循环获取所有的房间信息
        while len(temp_list) > 0 and times < 3:
            times += 1
            temp_list[0].click()
            time.sleep(3)
            item_list = self.get_room_info()
            self.save_item_list(item_list)
            print('Next Page!')
            temp_list = self.driver.find_elements_by_class_name("shark-pager-next")

提取数据

    def get_room_info(self):
        li_list = self.driver.find_elements_by_xpath("//ul[@id='live-list-contentbox']/li")
        item_list = []
        for li in li_list:
            room_name = li.find_element_by_xpath("./a").get_attribute("title")
            room_link = li.find_element_by_xpath("./a").get_attribute("href")
            room_img = li.find_element_by_xpath("./a/span/img").get_attribute("src")
            room_category = li.find_element_by_xpath(".//span[@class='tag ellipsis']").text
            room_author = li.find_element_by_xpath(".//span[@class='dy-name ellipsis fl']").text
            watch_number = li.find_element_by_xpath(".//span[@class='dy-num fr']").text
            item = dict(
                room_name=room_name,
                room_link=room_link,
                room_img=room_img,
                room_category=room_category,
                room_author=room_author,
                watch_number=watch_number,
            )
            item_list.append(item)
            # print(item)
        return item_list

战果展示
在这里插入图片描述
还是留个小作业吧~请大家进一步完善代码，并将提取的数据存入到数据库中。
源码地址：https://github.com/NO1117/Douyu_Spider
Python交流群：942913325 欢迎大家一起交流学习