这是之前学习爬虫时鼓捣的第一个小玩意。虽然简单,但是在写的过程中还是遇到了一些小问题,特此整理一下。
爬取网站:
爬虫思路:
- 请求指定模特的首页,解析并获取该模特所有相册的链接,存放于list中。
- 获取模特名字以及相册总数,本地建立同名文件夹(模特名+相册数目)
- list中取出各相册链接,解析网页并建立相册名同名文件夹
- 下载相册图片到指定文件夹
完成代码:
import requests
from bs4 import BeautifulSoup
import get_one_album as goa
import os
headers = {
'User-Agent' :'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
'Referer': 'https://www.meituri.com/t/4074/'
}
# url = 'https://www.meituri.com/t/4074/'
# url_list = ['https://www.meituri.com/t/4074/']
url = 'https://www.meituri.com/t/2441/'
url_list = ['https://www.meituri.com/t/2441/']
#url = 'https://www.meituri.com/t/646/'
#url_list = ['https://www.meituri.com/t/646/']
# url = 'https://www.meituri.com/t/296/' #有第二页
# url_list = ['https://www.meituri.com/t/296/']
os_path = 'D://爬虫/'
album_url_list = [] #各相册链接
def parse_url(url):
html = requests.get(url,headers=headers).content
bsObj = BeautifulSoup(html,'lxml')
name = bsObj.find('h1').text
return bsObj,name
def next_page(bsobj): #判断是否有多页,返回页面数量。相册多于40套时多页
num = 1
nextPage = bsobj.find('div', {
'id': 'pages'})
if nextPage:
pagenum = nextPage.findAll('a')
num = len(pagenum)-1
print(pagenum)
print('有{}页'.format(str(num)))
else:
print('只有1页'.format(str(num)))