某视频数据获取

最新推荐文章于 2024-05-22 09:32:55 发布

凯迪松鼠

最新推荐文章于 2024-05-22 09:32:55 发布

阅读量383

点赞数

分类专栏：可视化爬虫文章标签： python

本文链接：https://blog.csdn.net/m0_48358490/article/details/119052273

版权

可视化同时被 2 个专栏收录

3 篇文章 3 订阅

订阅专栏

爬虫

2 篇文章 0 订阅

订阅专栏

本程序旨在分析腾讯视频电视剧频道数据,仅学习使用.

简介:本程序运用了多种方法来针对反爬和伪元素.

成品图:

第一部分:

根据链接可知,电视剧库,只需要传入年份,和页数就可以把所有的视频爬取下来.因此

def txSPList(offset, year):
    url_tx = 'https://v.qq.com/x/bu/pagesheet/list?_all=1&append=1&channel=tv&listpage=2&offset={offset}&pagesize=30&sort=18&year={year}'.format(
        offset=offset, year=year)#电视剧库链接
    headers = {
        "User-Agent": random.choice(UserAgent_List),
        "Connection": "close"
    }
    time.sleep(1)
    requests.packages.urllib3.disable_warnings()
    res = requests.get(url=url_tx, headers=headers)

    res.encoding = 'utf-8'  #转码
    html = etree.HTML(res.text)
    list_url = html.xpath("//div[contains(@class,'list_item')]")  # 提取每一块  每一个
    list_info_all = []
    for i in list_url:
        while  True:
            try:
                url = i.xpath("./a[@class='figure']/@href") #电视剧  url
                title = i.xpath("./a[@class='figure']/@title") #电视剧 标题
                jishu = i.xpath("./a[@class='figure']/div/text()")#电视剧 集数
                if len(jishu)<=0:
                    jishu=[' ']

                vip = i.xpath("./a[@class='figure']/img[contains(@class,'mark_v')]/@alt")  #vip标签
                if len(vip)>0:
                    vip=vip
                else:
                    vip=['',]
                jianjie = i.xpath("./div/div/@title")
                if len(jianjie)<=0:
                    jianjie=[' ']

                if len(jishu)>0 and "全" in jishu[0]:
                    zhuangtai = '全'
                elif len(jishu)>0 and "更新至" in jishu[0]:
                    zhuangtai = '更新至'
                else:
                    zhuangtai=' '

                l_d,l_1 = txSP_(url[0])
                s_url = sukan(title[0])
                g_code,g_jianjie=gfzh(title[0])
                if len(g_code) > 0:
                    g_url = 'https://v.qq.com/biu/creator/home?vcuid={}'.format(g_code[0])  # 创作号链接
                    if len(g_jianjie)<=0:
                        g_jianjie=' '
                else:
                    g_url = ' '

                if s_url !='':
                    c_code=czh(title[0])
                    if len(c_code)>0:
                        c_url='https://v.qq.com/biu/creator/home?vcuid={}'.format(c_code[0])  #创作号链接
                    else:
                        c_url=' '

                    f_l=f_info(s_url)  #发布者信息列表
                    if f_l[0]=='undefined':
                        f_l=[' ',' ']
                else:

                    c_url=' '
                    f_l=[' ',' ']

                list_info = title + jishu + jianjie + vip + url+f_l
                list_info.append(c_url)
                list_info.insert(1, zhuangtai)
                list_info.append(l_d)
                list_info.append(l_1)
                list_info.append(s_url)
                list_info.append(g_url)
                list_info.append(g_jianjie)

                print(len(list_info),list_info)
                list_info_all.append(list_info)
                break
            except Exception as e:
                print('-----------------------------')
                print(e,'暂停十秒')
                time.sleep(10)
                continue
    df = pd.DataFrame(list_info_all)
    e_name = str(year)+'腾讯视频信息' + '.csv'
    df.to_csv(e_name, mode='a', index=False, header=False, encoding='UTF-8-sig')

可以直接获取到相关信息.

后来想到经常会用到请求数据,就干脆直接封装成两个类:

请求类

class TxRequests:

    headers = {
        "User-Agent": random.choice(UserAgent_List),
        "Connection": "close"
    }
    def __init__(self,url,headers=headers):
        self.url=url
        self.headers=headers
    #请求  自动化获取源码
    def tx_request(self):
        time.sleep(2)
        res = requests.get(url=self.url, headers=self.headers)
        res.encoding = 'utf-8'  # 转码

        self.html = etree.HTML(res.text)
        return self.html
    #解析
    def tx_xpath(self,xp_str):

        self.data=self.html.xpath(xp_str)  #xp_str : "//div[contains(@class,'list_item')]"
        return self.data

自动化类

class Txselenium:

    headers = {
        "User-Agent": random.choice(UserAgent_List),
        "Connection": "close"
    }
    def __init__(self,url,headers=headers):
        self.url=url
        self.headers=headers
    #请求  自动化获取源码
    def tx_request(self):

        opt = webdriver.ChromeOptions()
        opt.add_argument(
            '--user-agent= Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36')
        # 无痕浏览--headless
        opt.add_argument('--headless')
        opt.add_argument('--incognito')
        # 不加载图片
        opt.add_argument('blink-settings=imagesEnabled=false')
        opt.add_argument("--start-maximized")
        #opt.add_experimental_option("excludeSwitches", ["enable-logging"])
        #  创建浏览器对象的时候添加配置对象
        browser = webdriver.Chrome(chrome_options=opt)
        browser.get(self.url)
        time.sleep(3)
        self.res= browser.page_source
        self.html = etree.HTML(self.res)
        browser.delete_all_cookies()
        browser.close()
        browser.quit()
    #解析
    def tx_xpath(self,xp_str):
        if self.html !='':
            self.data=self.html.xpath(xp_str)  #xp_str : "//div[contains(@class,'list_item')]"
            return self.data

图形可视化这里,导入文件:


def w1():
   global file_path1
   file_path1 = filedialog.askopenfilename()

开始暂停:

做了一个标记,如果是真就是运行,如果不是就暂停, 虽然这种方式会有延后性的问题, 但却可以保护当个循环执行完毕,然后才是while刷新查询.

def shop():
   global fl
   fl= not fl
   if fl:
       text2.delete('1.0', "end")
       text2.insert(tk.INSERT, '运行中' + '\n')
       text2.update()

   else:
       text2.delete('1.0', "end")
       text2.insert(tk.INSERT, '暂停中' + '\n')
       text2.update()