目的:在B站搜索“爬虫”,爬取搜索界面所有视频的url链接
1.确定搜索界面url链接
在B站主页搜索“爬虫”,进入搜索界面
得到url链接:
https://search.bilibili.com/all?keyword=%E7%88%AC%E8%99%AB&from_source=webtop_search&spm_id_from=333.851
对链接进行删减
https://search.bilibili.com/all?keyword=%E7%88%AC%E8%99%AB
发现该链接同样可以进入界面,进一步查看界面,发现搜索结果有50页,想要爬取所有链接,则需要进行翻页,第2页url链接为
https://search.bilibili.com/all?keyword=%E7%88%AC%E8%99%AB&page=2
因此我们可以使用循环进行翻页:
for i in range(1, 51):
url = f"https://search.bilibili.com/all?keyword=%E7%88%AC%E8%99%AB&page={i}"
driver.get(url)
2.定位每个视频的视频信息
在该界面右键检查,选中某个视频即可
可看到在每个视频的li标签下的a标签的href属性包含该视频的url链接
videos = driver.find_elements_by_xpath("//li[@class='video-item matrix']")
for video in videos:
# 视频url链接
video_url = video.find_element_by_css_selector('a[class="img-anchor"]').get_attribute('href')
# 视频发布时间
publish_time = video.find_element_by_css_selector("span[class='so-icon time']").text
# 视频发布者用户名
up_name = video.find_element_by_css_selector("a[class='up-name']").text
# 视频标题
title = video.find_element_by_css_selector("a[class='title']").text
# 观看人数
view_number = video.find_element_by_css_selector("span[class='so-icon watch-num']").text
注意每一页都有20个视频,对应20个li标签,因此定位到li标签需要使用find_elements_by_xpath,然后定位每个li标签下的元素,需要用find_element_by_css_selector,不能使用find_element_by_xpath,因为find_element_by_xpath是从根部开始查找所以在循环中都是一个结果。
3.将视频信息保存到csv文件中
所有代码:
# -*- coding: utf-8 -*-
import time
import selenium
from selenium import webdriver
import csv
# import xlrd
import random
import sys
# import re
import requests
chrome_path = r'D:\python\anaconda\envs\py36\Scripts\chromedriver.exe'
# 获取浏览器
chrome_options = webdriver.ChromeOptions()
prefs = {'profile.managed_default_content_settings.images': 2}
# chrome_options.add_experimental_option('prefs', prefs)
driver = webdriver.Chrome(executable_path=chrome_path, options=chrome_options)
# 处理selenium中webdriver特征值
driver.execute_cdp_cmd(
'Page.addScriptToEvaluateOnNewDocument',
{
'source': 'Object.defineProperty(navigator,"webdriver",{get:()=>undefined})'
}
)
driver.maximize_window()
def view_bar(num, total):
rate = float(num) / float(total)
rate_num = int(rate * 100)
bar = '\r[%s%s]%d%%,%d' % ("=" * rate_num, "" * (100 - rate_num), rate_num, num)
sys.stdout.write(bar)
sys.stdout.flush()
def random_sleep(mu, sigma):
secs = random.normalvariate(mu, sigma)
if secs <= 0:
secs = mu
time.sleep(secs)
def get_info(writer2):
# title
videos = driver.find_elements_by_xpath("//li[@class='video-item matrix']")
for video in videos:
video_url = video.find_element_by_css_selector('a[class="img-anchor"]').get_attribute('href')
publish_time = video.find_element_by_css_selector("span[class='so-icon time']").text
up_name = video.find_element_by_css_selector("a[class='up-name']").text
title = video.find_element_by_css_selector("a[class='title']").text
view_number = video.find_element_by_css_selector("span[class='so-icon watch-num']").text
writer2.writerow([title, up_name, publish_time, view_number, video_url])
def main(writer1):
# 等待网页加载
time.sleep(3)
get_info(writer1)
random_sleep(3, 0.1)
if __name__ == '__main__':
file_name = "bilibili_info.csv"
with open(file_name, "a+", errors="ignore", newline='', encoding='utf-8') as fp:
writer = csv.writer(fp, delimiter=';')
writer.writerow(["标题", "发布者", "发布时间", "观看数", "视频url链接"])
for i in range(1, 51):
url = f"https://search.bilibili.com/all?keyword=%E5%BB%BA%E5%85%9A100%E5%91%A8%E5%B9%B4&duration=1&page={i}"
driver.get(url)
main(writer)
view_bar(i, 50)
driver.close()