【笔记】Python爬虫之初探

最新推荐文章于 2021-01-11 21:48:57 发布

moshlwx

最新推荐文章于 2021-01-11 21:48:57 发布

阅读量438

点赞数

分类专栏： Python 文章标签： python 爬虫

本文链接：https://blog.csdn.net/moshlwx/article/details/53455647

版权

Python 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

周末闲着试了下python下载网站上的图片，挺有趣。总结下遇到的问题

流程

我理解的抓取网页的整个流程分为：

打开网页，F12分析源码
使用urllib.request获取网页源码
使用HTMLParser解析html
使用获得的信息进行操作（这次实践中即获得图片网址，下载到本地）

具体到这次的实际操作主要有：

request.urlopen(url)打开网址
response.read().decode('gbk')将网页内容解码到变量中（注意这里的解码方式要结合网页编码方式变化）
重写HTMLParser中的xxx_handler()方法，解析主网页获得第二级网页的地址
request.urlopen()循环打开第二级网页
重写HTMLParser获得网页中图片的地址
下载图片，os.makedirs()分文件夹存储图片。

遇到问题

用最简单的抓取网页方式下载二级网页中图片时，返回HTTP 403 forbbiden
- 在request参数中加入headers模拟浏览器行为
遇到某张图片失效返回HTTP 404
- 原来会跳出，不能执行后续。使用try...expect...执行出错后跳过的操作解决
分文件夹储存图片
- os.makedirs()建立文件夹
两次解析网页时需要的HTMLParser不同
- 分别针对需要重写HTMLParser中的方法

后续工作

一时兴起学了两天抓取网页，不知道流程是否很正确，要参考别人比较好的代码
通用性不强，每个网页需要先人工分析，不知道有没有通用的方法
针对有些网站的反爬虫方法要知道如何应对
对于大数据量，需要多线程，分布式操作

代码

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

"""
抓取meizitu.com网站中图片
__author__ = 'lwx'
"""

from urllib import request, error
from html.parser import HTMLParser
import os

is_picture = False
is_title = False
pict_dict = {}
pict_dir = []
html_src = []


hdr = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '
                  'AppleWebKit/537.11 (KHTML, like Gecko) '
                  'Chrome/23.0.1271.64 Safari/537.11',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
    'Accept-Encoding': 'none',
    'Accept-Language': 'en-US,en;q=0.8',
    'Connection': 'keep-alive'}


class MainHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        # if tag == 'div'and attrs[0][1] == 'picture' and tag == 'a':
        if tag == 'div'and attrs[0][1] == 'picture':
                global is_picture
                is_picture = True
        if is_picture and tag == 'a':
            # print(tag, attrs)
            html_src.append(attrs[0][1])

    def handle_endtag(self, tag):
        global is_picture, is_title
        is_picture = False


class PageHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        # print('tag: <%s>, attrs: %s' % (tag, attrs))
        # pass
        if tag == 'div'and attrs[0][1] == 'picture':
                global is_picture
                is_picture = True
        if tag == 'title':
            global is_title
            is_title = True

    def handle_endtag(self, tag):
        global is_picture, is_title
        is_picture = False
        is_title = False

    def handle_startendtag(self, tag, attrs):
        # print('<%s>, %s' % (tag, attrs))
        if is_picture and tag == 'img':
            global pict_dict
            pict_dict[attrs[0][1]] = attrs[1][1]
            # global is_picture
            # is_picture = 0
            # print(dict)

    def handle_data(self, data):
        # print(data)
        if is_title:
            global pict_dir
            pict_dir = data.split()[0]


def get_html(url):
    response = request.urlopen(url, timeout=10)
    # with open('test.txt', 'wb') as file:
    #     file.write(response.read())
    # req = request.Request(url, data=None, headers=headers)
    # response = request.urlopen(req, timeout=10)
    html = response.read().decode('gbk')

    return html


def download_pict(pict_dir, pict_name, pict_url):
    req = request.Request(pict_url, data=None, headers=hdr)
    # req.add_header('Referer', pict_url)
    print(pict_url)
    try:
        response = request.urlopen(req, timeout=50)
        with open('%s\%s.jpg' % (pict_dir, pict_name), 'wb') as pict:
            pict.write(response.read())
    except:
        pass
    # response = request.urlopen(pict_url, timeout=10)


def download_son_html(url):
    # url = 'http://www.meizitu.com/a/5478.html'

    html = get_html(url)
    parser = PageHTMLParser()
    parser.feed(html)
    # print(pict_dict)
    global pict_dict

    if not os.path.exists(pict_dir):
        os.makedirs(pict_dir)
    for pict_name in pict_dict:
        download_pict(pict_dir, pict_name, pict_dict[pict_name])
    pict_dict = {}


def find_html():
    url = 'http://www.meizitu.com'
    html = get_html(url)
    parser = MainHTMLParser()
    parser.feed(html)


if __name__ == '__main__':
    find_html()
    # print(html_src)
    for src in html_src:
        print(src)
        download_son_html(src)