斗鱼房间信息自动化爬取

最新推荐文章于 2024-11-05 15:28:12 发布

ＪＩＮＣＨＥＮＧ０４０８

最新推荐文章于 2024-11-05 15:28:12 发布

阅读量273

点赞数 1

文章标签： python selenium xpath

本文链接：https://blog.csdn.net/weixin_43297167/article/details/104552553

版权

此处选择的网址是斗鱼绝地求生界面。使用自动化测试工具selenium，为什么选selenium而不要requests呢？因为在翻页的时候网址不会变化，使用requests没有next_url不太方便。xpath路径最好自己写，灵活一点，F12选中元素copy xpath只能选中单个元素。
主要遇到了两个坑。一个是在加载页面和翻页的时候，要time.sleep()几秒。二是写“下一页”标签的xpath路径时，class属性中有空格，就算复制过来，空格也会自动消失，需要自己手动写。

import time
from pprint import pprint
from selenium import webdriver

class douyu(object):
    def __init__(self):
        self.start_url='https://www.douyu.com/g_jdqs'
        self.driver=webdriver.Chrome()

    def get_content_list(self):
        time.sleep(10)
        div_list=self.driver.find_elements_by_xpath('//div[@class="DyListCover-content"]')
        list=[]
        for i in div_list:
            dict = {}
            dict['anchor']=i.find_element_by_xpath('./div[2]/h2').text
            dict['people_num']=i.find_element_by_xpath("./div[2]/span").text
            dict['title']=i.find_element_by_xpath("./div[1]/h3").get_attribute('title')
            list.append(dict)
        pprint(list)

    def next_page(self):
        # 这里有个大坑！！class属性前面有空格
        self.to_next=self.driver.find_elements_by_xpath("//li[@class=' dy-Pagination-next']")
        return self.to_next

    def save_content(self):
        #保存数据自定义。上面打印出来了，不保存也没关系。
        pass

    def run(self):
        self.driver.maximize_window()
        self.driver.get(self.start_url)
        self.get_content_list()
        while self.next_page():
            self.to_next[0].click()
            time.sleep(5)
            self.get_content_list()
        self.driver.close()

if __name__ == '__main__':
    dy=douyu()
    dy.run()