python3 爬虫学习——1

最新推荐文章于 2024-04-30 11:11:45 发布

coura

最新推荐文章于 2024-04-30 11:11:45 发布

阅读量396

点赞数 1

分类专栏：爬虫 python 文章标签： python 爬虫 url

本文链接：https://blog.csdn.net/coura/article/details/60466614

版权

爬虫同时被 2 个专栏收录

1 篇文章 0 订阅

订阅专栏

python

1 篇文章 0 订阅

订阅专栏

最近在学习运用python写爬虫

买的书以及网上资料大多还停留在python2

由于部分库有些改动，在博客里mark一下

爬虫第一版

import urllib.request
def download(url):
    return urllib.request.urlopen(url).read()

txt = download('https://www.baidu.com')
print(txt.decode()) #default parameter is 'utf-8'

因为是第一版所以功能很简单

通过urllib.request中的urlopen()函数直接获取网址，此处我们访问的是baidu

再通过解码得到txt文本

需要注意的有两点：

1.python2中的urllib2在python3中改成了urllib2.request

2.上述代码采用的是默认的解码方式只对通用的utf-8编码的网页有用采用像gb2312这种国标的网页可能会gg

第二版

import urllib.request
def download(url,num_retries=2):
    print('downloading: ',url)
    try:
        html = urllib.request.urlopen(url).read()
    except urllib.request.URLError as e:
        print('download error: ',e.reason)
        html = None
        if num_retries>0:
            if hasattr(e,'code') and 500<=e.code<600:
                return download(url,num_retries-1)
    return html

page = download('http://httpstat.us/500')
if page != None:
    print(page.decode())
else:
    print('Receive None')

第二版在第一版的基础上加了异常处理

在访问网页的时候最常见的是404异常（表示网页目前不存在）

4xx的错误发生在请求存在问题的时候

5xx的错误发生在服务器端存在问题

因此我们面对5xx的错误可以采用重试下载来应对

上述代码对于5xx的错误会重试三次三次都不成功会放弃

实例中的url是个美国的地址所以基本上都会error

第三版

#自动化识别网页编码方式并以改格式解码
import chardet
import urllib.request
def download(url,num_retries=2):
    print('downloading: ',url)
    try:
        html = urllib.request.urlopen(url).read()
    except urllib.request.URLError as e:
        print('download error: ',e.reason)
        html = None
        if num_retries>0:
            if hasattr(e,'code') and 500<=e.code<600:
                return download(url,num_retries-1)
    return html

page = download('http://www.sdu.edu.cn')
if page != None:
    charset_info = chardet.detect(page)     #获取文本编码方式
    print(charset_info)
    print(page.decode(charset_info['encoding'],'ignore'))
else:
    print('Receive None')

这一版的更新加入了自动识别网页编码方式

利用chardet这个模块给的detect方法检测网页编码方式

并采取该方式解码

需要注意的是有些网站很大时，检测时间会比较长

这种情况下只检测网站部分内容即可

chardet.detect(page[:500])

第四版

#使用代理进行访问
#自动化识别网页编码方式并以改格式解码
import chardet
import urllib.request
def download(url,user_agent='wswp',num_retries=2):
    print('downloading: ',url)
    headers = {'User-agent':user_agent}
    request = urllib.request.Request(url,headers=headers)
    try:
        response = urllib.request.urlopen(request)
        html = response.read()
    except urllib.request.URLError as e:
        print('download error: ',e.reason)
        html = None
        if num_retries>0:
            if hasattr(e,'code') and 500<=e.code<600:
                return download(url,num_retries-1)
    return html

def decode_page(page):
    if page != None:
        charset_info = chardet.detect(page[:500])  # 获取文本编码方式
        charset = charset_info['encoding']
        return page.decode(charset, 'ignore')
    else:
        return 'None Page'



page = download('http://www.nju.edu.cn')
txt = decode_page(page)

print(txt)

这次的改进是加入了代理，因为很多网站会限制爬虫，因此很多时候爬虫都要伪装成浏览器

经常伪装成Mozilla。。当然我们这个版本的只是演示一下

这次访问的是南大网站

接下来的版本放到了下一篇博客中

coura

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
python3 爬虫学习——1

最近在学习运用python写爬虫买的书以及网上资料大多还停留在python2由于部分库有些改动，在博客里mark一下爬虫第一版import urllib.requestdef download(url): return urllib.request.urlopen(url).read()txt = download('https://www.baidu.com
复制链接

扫一扫