突然间发现这个网站,可以下载很多kindle电子书。观摩了下,和前段时间刚写的爬取头条有点类似。
该网站链接首页:https://bookset.me/,这次爬取排行榜链接:https://bookset.me/?rating=douban,打开观察发现排行榜其实真正分页规则是https://bookset.me/page/num?rating=douban,其中num代表页数。
具体代码如下:
#-*- coding: utf-8 -*-
import re
from requests.exceptions import RequestException
from bs4 import BeautifulSoup
import requests
from multiprocessing import Pool
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) Gecko/20100101 Firefox/56.0',
'Content-Type': 'application/x-www-form-urlencoded',
'Connection': 'Keep-Alive',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
}
# 获取页面信息
def get_page_index(page_num):
url = 'https://bookset.me/page/'+ str(page_num) + '?rating=douban'
print(url)
try:
response = requests.get(url, headers = headers)
if response.status_code == 200:
return response.text
return None
except RequestExcepti