慢慢爬虫路3-B站爬取视频信息

目的:在B站搜索“爬虫”,爬取搜索界面所有视频的url链接

1.确定搜索界面url链接

在B站主页搜索“爬虫”,进入搜索界面
在这里插入图片描述
得到url链接:

https://search.bilibili.com/all?keyword=%E7%88%AC%E8%99%AB&from_source=webtop_search&spm_id_from=333.851

对链接进行删减

https://search.bilibili.com/all?keyword=%E7%88%AC%E8%99%AB

发现该链接同样可以进入界面,进一步查看界面,发现搜索结果有50页,想要爬取所有链接,则需要进行翻页,第2页url链接为

https://search.bilibili.com/all?keyword=%E7%88%AC%E8%99%AB&page=2

因此我们可以使用循环进行翻页:

        for i in range(1, 51):
            url = f"https://search.bilibili.com/all?keyword=%E7%88%AC%E8%99%AB&page={i}"
            driver.get(url)

2.定位每个视频的视频信息

在该界面右键检查,选中某个视频即可
在这里插入图片描述
可看到在每个视频的li标签下的a标签的href属性包含该视频的url链接

videos = driver.find_elements_by_xpath("//li[@class='video-item matrix']")
    for video in videos:
     	# 视频url链接
        video_url = video.find_element_by_css_selector('a[class="img-anchor"]').get_attribute('href')
        # 视频发布时间
        publish_time = video.find_element_by_css_selector("span[class='so-icon time']").text
        # 视频发布者用户名
        up_name = video.find_element_by_css_selector("a[class='up-name']").text
        # 视频标题
        title = video.find_element_by_css_selector("a[class='title']").text
        # 观看人数
        view_number = video.find_element_by_css_selector("span[class='so-icon watch-num']").text

注意每一页都有20个视频,对应20个li标签,因此定位到li标签需要使用find_elements_by_xpath,然后定位每个li标签下的元素,需要用find_element_by_css_selector,不能使用find_element_by_xpath,因为find_element_by_xpath是从根部开始查找所以在循环中都是一个结果。

3.将视频信息保存到csv文件中

所有代码:

# -*- coding: utf-8 -*-
import time
import selenium
from selenium import webdriver
import csv
# import xlrd
import random
import sys
# import re
import requests

chrome_path = r'D:\python\anaconda\envs\py36\Scripts\chromedriver.exe'

# 获取浏览器
chrome_options = webdriver.ChromeOptions()
prefs = {'profile.managed_default_content_settings.images': 2}
# chrome_options.add_experimental_option('prefs', prefs)
driver = webdriver.Chrome(executable_path=chrome_path, options=chrome_options)

# 处理selenium中webdriver特征值
driver.execute_cdp_cmd(
    'Page.addScriptToEvaluateOnNewDocument',
    {
        'source': 'Object.defineProperty(navigator,"webdriver",{get:()=>undefined})'
    }
)
driver.maximize_window()


def view_bar(num, total):
    rate = float(num) / float(total)
    rate_num = int(rate * 100)
    bar = '\r[%s%s]%d%%,%d' % ("=" * rate_num, "" * (100 - rate_num), rate_num, num)
    sys.stdout.write(bar)
    sys.stdout.flush()


def random_sleep(mu, sigma):
    secs = random.normalvariate(mu, sigma)
    if secs <= 0:
        secs = mu
        time.sleep(secs)


def get_info(writer2):
    # title
    videos = driver.find_elements_by_xpath("//li[@class='video-item matrix']")
    for video in videos:
        video_url = video.find_element_by_css_selector('a[class="img-anchor"]').get_attribute('href')
        publish_time = video.find_element_by_css_selector("span[class='so-icon time']").text
        up_name = video.find_element_by_css_selector("a[class='up-name']").text
        title = video.find_element_by_css_selector("a[class='title']").text
        view_number = video.find_element_by_css_selector("span[class='so-icon watch-num']").text
        writer2.writerow([title, up_name, publish_time, view_number, video_url])


def main(writer1):
    # 等待网页加载
    time.sleep(3)
    get_info(writer1)
    random_sleep(3, 0.1)


if __name__ == '__main__':
    file_name = "bilibili_info.csv"
    with open(file_name, "a+", errors="ignore", newline='', encoding='utf-8') as fp:
        writer = csv.writer(fp, delimiter=';')
        writer.writerow(["标题", "发布者", "发布时间", "观看数", "视频url链接"])
        for i in range(1, 51):
            url = f"https://search.bilibili.com/all?keyword=%E5%BB%BA%E5%85%9A100%E5%91%A8%E5%B9%B4&duration=1&page={i}"
            driver.get(url)
            main(writer)
            view_bar(i, 50)
    driver.close()

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值