使用requests + bs4抓取B站web端Python视频数据
目标:掌握bs4抓取数据的套路
抓取关键词:
视频图片
播放量
上传时间
作者:
import requests
from bs4 import BeautifulSoup
def get_html():
url = "https://www.bilibili.com/"
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) \
Chrome/52.0.2743.116 Safari/537.36'
}
html = requests.get(url, headers=headers).text
soup = BeautifulSoup(html, 'lxml')
groom_module = soup.find_all(attrs={'class': 'groom-module home-card'})
for i in groom_module:
time = get_time(i.a['href'])
image = i.a.img['src']
pic = requests.get("https:"+image, timeout=10)
title = i.find(attrs={'class': 'title'}).text
fp = open("pictures\\" + image[-20:], 'wb')
fp.write(pic.content)
fp.close()
author = i.find(attrs={'class': 'author'}).text
play = i.find(attrs={'class': 'play'}).text
print(time, image,title,author,play)
def get_time(url):
url = "https://www.bilibili.com"+url
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) \
Chrome/52.0.2743.116 Safari/537.36'
}
html = requests.get(url, headers=headers).text
soup = BeautifulSoup(html, 'lxml')
return soup.find("time").text
get_html()
借此机会复习了一下美味汤的基本知识:
获取标签:soup.title 也就是节点
获取属性:soup.img['src'] 获取节点内部的属性
获取节点的名称:soup.title.name
标准选择器:find_all( name , attrs , recursive , text , **kwargs )
name:soup.find_all('ul')
attrs: soup.find_all(attrs={'id': 'list-1'}
soup.find_all(attrs={'name': 'elements'}
soup.find_all(id='list-1')
soup.find_all(class_='element')
text: soup.find_all(text='Foo')
find( name , attrs , recursive , text , **kwargs )
CSS选择器:
soup.select('.panel .panel-heading') .是类属性
soup.select('ul li') 标签名
soup.select('#list-2 .element') 先id再类名
soup.select('ul')[0]
for ul in soup.select('ul'):
print(ul.select('li'))
获取属性:
for ul in soup.select('ul'):
print(ul['id'])
print(ul.attrs['id'])
获取内容:
for li in soup.select('li'):
print(li.get_text())
写的比较简陋且没有功能简单,因为没有实际需求,所以实现功能也够了。