前两天写的花瓣网爬虫,花瓣网是个瀑布式布局,而且下拉的时候发现要登录帐号。本来以为要用requests.post模拟登录,能够顺手练习一下的,结果不登陆帐号也照样可以爬取。一开始是打算用selenium的,后来运行的过程中发现太慢了,遂改用requests库。
import requests
from requests.exceptions import RequestException
import re
import os
from hashlib import md5
from multiprocessing.dummy import Pool as ThreadPool
def get_pictures(word, i):
url = 'http://huaban.com/search/?q=' + word + '&page=' + str(i + 1) + '&per_page=20&wfl=1'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.34 '
'40.84 Safari/537.36'
}
response = requests.get(url, headers=headers)
if response.status_code == 200:
html = response.text
# 获得图片的key
re_key = re.compile('file\":.*?key\":\"(.*?)\"', re.S)
key_list = re.findall(re_key, html)[1:21]
for img in key_list:
download_img(img)
def download_img(img):
url = 'http://img.hb.aicdn.com/' + img
print('Downloding:', url)
try:
response = requests.get(url)
if response.status_code == 200:
save_image(response.content)
except RequestException:
print('Downloading Error!')
def save_image(content):
file_path = '{0}/{1}.jpg'.format('D:\\Flower', md5(content).hexdigest())
if not os.path.exists(file_path):
with open(file_path, 'wb') as f:
f.write(content)
def main(index):
keyword = '宝儿姐'
get_pictures(keyword, index)
if __name__ == '__main__':
# log_in()
page = 5
group = [x for x in range(0, page)]
pool = ThreadPool(8)
pool.map(main, group)
pool.close()
pool.join()
代码量不多,逻辑也不复杂,主要的就是对花瓣网分析,抓取图片的key并进行相应处理。运行该爬虫前先要建立相应的文件夹。没什么太难的地方,也就不多做分析了。如果想要详细了解原理可以参考该博客:
https://blog.csdn.net/coder_pig/article/details/79209418