Python网络爬虫（4）煎蛋网妹子图片抓取

最新推荐文章于 2021-01-17 14:58:42 发布

One-Shell

最新推荐文章于 2021-01-17 14:58:42 发布

阅读量574

点赞数

分类专栏： Python爬虫文章标签： python 网络爬虫

本文链接：https://blog.csdn.net/GenteelDevil/article/details/54232661

版权

Python爬虫专栏收录该内容

8 篇文章 0 订阅

订阅专栏

from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError
from bs4 import BeautifulSoup
import urllib.request
import re
import os

def get_html(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        print(e)
        return None
    except URLError as e:
        print(e)
        return None
    try:
        bsObj = BeautifulSoup(html,"html.parser")
        return bsObj
    except AttributeError as e:
        print(e)
        return None
    
def get_img(bsObj):
    img_addrs = bsObj.findAll("img",{"src":re.compile("\/\/ww[1-9]\.sinaimg\.cn\/mw600\/[0-9a-zA-z]{32}\.jpg")})
    return img_addrs

def save_img(path,img_addrs,i):
    for each in img_addrs:
        filename = path + str(i) + '.jpg'
        try:
            urllib.request.urlretrieve('http:'+each["src"],filename)
            print("%d.jpg download success!"%i)
        except HTTPError as e:
            print(e)
        i = i + 1
    return i

if __name__ == "__main__":
    while 1 :
        path = input("Please input the path:")
        try:
            os.makedirs(path)
            break
        except FileExistsError as e:
            print(e)
            continue
    page = int(input("Please input the pages:"))
    i = 0
    for n in range(1,page + 1):
        url = "http://jandan.net/ooxx/page-" + str(2308 - n) + "#comments"
        print(url)
        html = get_html(url)
        img_addrs = get_img(html)
        i = save_img(path,img_addrs,i)

没有对IP和post和get处理简单的爬虫

我在做的时候没有审清页面关系导致抓不到正确图片一上午啊！！！

One-Shell

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python网络爬虫（4）煎蛋网妹子图片抓取

from urllib.request import urlopenfrom urllib.error import HTTPErrorfrom urllib.error import URLErrorfrom bs4 import BeautifulSoupimport urllib.requestimport reimport osdef get_html(url):
复制链接

扫一扫

专栏目录