要爬取的网站:https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=%7B%E5%81%8F%E7%A7%BB%E9%87%8F%7D&type=T
爬取网站示意图:
爬取结果:
简单版:
复杂版:
代码:
简单版:
import numpy as np
import csv
import time
def get_one_page(url):
response = requests.get(url)
if response.status_code == 200:
return response.text
return None
def main():
url = 'https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=%7B%E5%81%8F%E7%A7%BB%E9%87%8F%7D&type=T'
html = get_one_page(url)
soup=BeautifulSoup(html,'lxml')
for book in soup.select('.subject-item'):
#find_all
# bookimg=book.find('img')
bookbreif=book.get_text(strip=True)#去除换行,空格
print(bookbreif)
main()
升级版:
import requests
from bs4 import BeautifulSoup
import re
import numpy as np
import csv
import time
def get_one_page(url):
response = requests.get(url)
if response.status_code == 200:
return response.text
return None
def main():
url = 'https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=%7B%E5%81%8F%E7%A7%BB%E9%87%8F%7D&type=T'
html = get_one_page(url)
soup=BeautifulSoup(html,'lxml')
for book in soup.select('.subject-item'):
#find_all
for link in book.find_all('a'):
if link.get('title') != None:
#
print("《"+link.get_text(strip=True)+"》")
bookurl=book.find('a').get('href')
print(bookurl)
bookpub=book.select('.pub')[0].text.lstrip('\n ').rstrip('\n ')
print(bookpub)
bookfeedback=book.select('.pl')[0].text.lstrip('\n ').rstrip('\n ')
print(bookfeedback)
main()
解释:
1,确定范围
for book in soup.select('.subject-item'):
#寻找class='subject-item'的标签
2, 获得该书的书名
for link in book.find_all('a'):
if link.get('title') != None:
print("《"+link.get_text(strip=True)+"》")
#在class=subject-item的标签下,寻找全部的标签<a>,然后if筛选,
#筛选标准:title不等于None
为什么筛选标准是 title不等于None?
因为标签<a>不止一个,要找出含有 书名 的<a>的特点,所以找到它(标签<a>)的特点就是:其title属性不能为空
3,获得该书的链接地址
bookurl=book.find('a').get('href')
4,获得该书的出版信息
bookpub=book.select('.pub')[0].text.lstrip('\n ').rstrip('\n ')
5,获得该书的用户评价
bookfeedback=book.select('.pl')[0].text.lstrip('\n ').rstrip('\n ')
知识点补充:
(1),如何获得标签内的信息
(1),bookfeedback=book.select('.pl')[0].text.lstrip('\n ').rstrip('\n ')
[0].text 表示获取标签内的信息,转为text
lstrip('\n ').rstrip('\n ') 表示删除多余的空格和换行
(2),find_all()与select()怎么用
find_all()获取标签内信息用get_text()属性
select()获取标签内信息用[0].text()