网站图片抓取爬虫代码-Python语言-立哥开发

最新推荐文章于 2020-11-20 20:23:08 发布

上海交大果粒人工智能学者全栈工程师

最新推荐文章于 2020-11-20 20:23:08 发布

阅读量320

点赞数

分类专栏：项目实例文章标签： python computer vision queue url dell

本文链接：https://blog.csdn.net/weixin_45806384/article/details/106438538

版权

项目实例专栏收录该内容

25 篇文章 0 订阅

订阅专栏

前言：
第一次接触爬虫，还是大学做挑战杯的时候了，单身奶狗一枚。那个时候，蒙着头就是找资料，然后写。尝试了很多次，但是还是不够完美，而且JS写的比较复杂。但是还是好多学姐学妹很崇拜我（坏笑）。
现在已经结婚，虽然依然每天不断学习与提升自己，但是毕竟经历社会多年的风雨磨炼，不是当年非常单纯的小男生。非常感谢亲爱的陪伴。作为老公，会一直陪伴你，守护你。让技术岁月，铭记我们的爱情。

import requests
import os
import time
import threading
from bs4 import BeautifulSoup

def download_page(url):

headers = {“User-Agent”: “Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0”}
r = requests.get(url, headers=headers)
r.encoding = ‘gb2312’
return r.text

def get_pic_list(html):

soup = BeautifulSoup(html, ‘html.parser’)
pic_list = soup.find_all(‘li’, class_=‘wp-item’)
for i in pic_list:
a_tag = i.find(‘h3’, class_=‘tit’).find(‘a’)
link = a_tag.get(‘href’)
text = a_tag.get_text()
get_pic(link, text)

def get_pic(link, text):

html = download_page(link) # 下载图片界面
soup = BeautifulSoup(html, ‘html.parser’)
pic_list = soup.find(‘div’, id=“picture”).find_all(‘img’)
headers = {“User-Agent”: “Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0”}
create_dir(‘pic/{}’.format(text))
for i in pic_list:
pic_link = i.get(‘src’) # 拿到图片的具体 url
r = requests.get(pic_link, headers=headers) # 下载图片，之后保存到文件
with open(‘pic/{}/{}’.format(text, link.split(’/’)[-1]), ‘wb’) as f:
f.write(r.content)
time.sleep(1)

def create_dir(name):
if not os.path.exists(name):
os.makedirs(name)

def execute(url):
page_html = download_page(url)
get_pic_list(page_html)

def main():
create_dir(‘pic’)
queue = [i for i in range(1, 70)]
threads = []
while len(queue) > 0:
for thread in threads:
if not thread.is_alive():
threads.remove(thread)
while len(threads) < 5 and len(queue) > 0:
cur_page = queue.pop(0)
url = ‘http://meizitu.com/a/more_{}.html’.format(cur_page)
thread = threading.Thread(target=execute, args=(url,))
thread.setDaemon(True)
thread.start()
print(’{}正在下载{}页’.format(threading.current_thread().name, cur_page))
threads.append(thread)

if name == ‘main’:
main()