用python写网络爬虫 -从零开始 1 编写第一个网络爬虫

最新推荐文章于 2022-03-17 00:41:32 发布

weixin_34405925

最新推荐文章于 2022-03-17 00:41:32 发布

阅读量134

点赞数

文章标签：爬虫 python 开发工具

原文链接：http://www.cnblogs.com/mrruning/p/7638377.html

版权

本文从最简单的爬虫开始，通过添加检测下载错误，设置用户代理，设置网络代理，逐渐完善爬虫功能 。
首先 说明一下代码的使用方法 ：在python2.7 环境下，用命令行也可以，用Pycharm编辑也可以。通过定义函数，然后引用函数完成网页抓取
例如 ：  download （”HTTP：//www.baidu.com“）

        download1 （”HTTP：//www.baidu.com“）

        download2（”HTTP：//www.baidu.com“）




1.用三行代码  完成第一个最简单的网络爬虫 

import urllib2
import urlparse


def download1(url):
    """Simple downloader"""
    return urllib2.urlopen(url).read()

2.升级一下，编写出现下载错误的网络爬虫

def download2(url):
    """Download function that catches errors"""
    print 'Downloading:', url
    try:
        html = urllib2.urlopen(url).read()
    except urllib2.URLError as e:
        print 'Download error:', e.reason
        html = None
    return html
3.网页5xx错误一般发生在服务器端，给爬虫加上一个判断，当错误代码大于500小于600的时候继续下载2次，

def download3(url, num_retries=2):
    """Download function that also retries 5XX errors"""
    print 'Downloading:', url
    try:
        html = urllib2.urlopen(url).read()
    except urllib2.URLError as e:
        print 'Download error:', e.reason
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                # retry 5XX HTTP errors
                html = download3(url, num_retries-1)
    return html

4.设置用户代理
一般情况下，默认的网络爬虫会被一些网站封杀，这里设置了一个"wswp"为名称的网络代理

def download4(url, user_agent='wswp', num_retries=2):
    """Download function that includes user agent support"""
    print 'Downloading:', url
    headers = {'User-agent': user_agent}
    request = urllib2.Request(url, headers=headers)
    try:
        html = urllib2.urlopen(request).read()
    except urllib2.URLError as e:
        print 'Download error:', e.reason
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                # retry 5XX HTTP errors
                html = download4(url, user_agent, num_retries-1)
    return html

5.支持代理
有时候我们需要用代理访问某个网站。比如，NTEflix屏蔽了美国以外的大多数国家。我们使用 requests 模块来实现网络代理的功能。

import urllib2
import urlparse

def download5(url, user_agent='wswp', proxy=None, num_retries=2):
    """Download function with support for proxies"""
    print 'Downloading:', url
    headers = {'User-agent': user_agent}
    request = urllib2.Request(url, headers=headers)
    opener = urllib2.build_opener()
    if proxy:
        proxy_params = {urlparse.urlparse(url).scheme: proxy}
        opener.add_handler(urllib2.ProxyHandler(proxy_params))
    try:
        html = opener.open(request).read()
    except urllib2.URLError as e:
        print 'Download error:', e.reason
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                # retry 5XX HTTP errors
                html = download5(url, user_agent, proxy, num_retries-1)
    return html

转载于:https://www.cnblogs.com/mrruning/p/7638377.html

weixin_34405925

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
用python写网络爬虫 -从零开始 1 编写第一个网络爬虫

本文从最简单的爬虫开始，通过添加检测下载错误，设置用户代理，设置网络代理，逐渐完善爬虫功能。首先说明一下代码的使用方法：在python2.7 环境下，用命令行也可以，用Pycharm编辑也可以。通过定义函数，然后引用函数完成网页抓取例如： download （”HTTP：//www.baidu.com“） download1 （”HTTP：//www.baidu.com...
复制链接

扫一扫