# Python 爬虫快速学习

最新推荐文章于 2023-06-18 17:28:28 发布

全栈程序员

最新推荐文章于 2023-06-18 17:28:28 发布

阅读量233

点赞数

分类专栏： Python 文章标签： python 爬虫

本文链接：https://blog.csdn.net/qq_37248504/article/details/106892494

版权

Python 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

爬虫

概念

在这里插入图片描述

工具

可以爬虫的语言有很多。java、python、c++等。在这里使用python作为工具。

urllib库

urllib库介绍

python内置的urilib库，这个库可以模仿浏览器向服务器发送请求，获取网页内容。

库的重要属性

error：异常类
Parse：解析Url
Request：用各种协议打开URLs的一个扩展库
Response：Urilib使用的response

Request的使用

request.urlopen()方法使用

# 导入urllib库：完成向服务器发送请求
import urllib.request
response = urllib.request.urlopen("http://www.baidu.com")
result = response.read()
print(result)

result的内容就是我们请求的地址的网页内容

urlopen()方法，源码如下

def urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
            *, cafile=None, capath=None, cadefault=False, context=None):

url：输入的网址

data：给服务器请求的一些附加信息，比如一些登录信息什么的。

timeout：定义超时时间

cafile：CA认证（很少使用，默认即可）

capath：用于CA认证（很少使用，默认即可）

context：设置SSL加密传输（很少使用，默认即可）

response.getinfo()：返回Http的header信息
response.getcode()：返回请求的状态码

Request类

源码中的初始化函数

def __init__(self, url, data=None, headers={},
                 origin_req_host=None, unverifiable=False,
                 method=None):

url，data和上面urlopen中的提到的一样。
headers是HTTP请求的报文信息，如User_Agent参数等，它可以让爬虫伪装成浏览器而不被服务器发现你正在使用爬虫。headers很有用，可以解决反爬虫机制。

Request使用

结果和前面urlopen是一样的，前面提到urlopen除了可以接受指定参数，也可以接受Request类的对象。’ '里面填写自己浏览器的信息即可。

import urllib.request
headers ={'User-Agent':''}
response = urllib.request.Request('http://www.baidu.com', headers=headers)
html = urllib.request.urlopen(response)
result=html.read()
print(result)

Error的使用

UrlError
HttpError
以下代码使用了try…exception的结构，实现了简单的网页爬取，当有异常时，如URLError发生时，就会返回reason，或者HTTPError发生错误时就会返回code。

import urllib.request
import urllib.error
try:
    headers = {'User_Agent': 'Mozilla/5.0 (X11; Ubuntu;Linux x86_64;rv: 57.0) Gecko / 20100101Firefox / 57.0'}
    response = urllib.request.Request('http://blog.csdn.net/qq_37248504/article/details/106891181', headers=headers)
    html = urllib.request.urlopen(response)
    result=html.read()
    print(result)
except urllib.error.HTTPError as e:
    if hasattr(e,'reason'):
        print('错误原因是'+str(e.reason))
except urllib.error.URLError as e:
    if hasattr(e,'code'):
        print('错误状态吗是'+str(e.code))
else:
    print('请求成功！')

正则表达式

概念

用来检索、替换那些符合某个模式（规则）的文本
在这里插入图片描述

爬虫主要思路

先用正则语法定义一个规则
使用这个规则和网页进行对比，根据规则提取内容。

正则表达式例子

例如：一个字符串中包含一个邮箱lidonglidong@qq.com

pattern：\w+@\w+\.com \w+@(\w+\.)?\w+\.com

re模块核心函数

compile()函数

作用：返回一个正则表达式对象

match()函数

作用：从字符串的最开始和pattern进行匹配，匹配成功返回匹配对象，否则返回None

爬取图片简单例子

strp()：去除字符串两边的空格

`urlretrieve`方法

从远端下载文件内容，源码如下：

def urlretrieve(url, filename=None, reporthook=None, data=None):
# url:远端文件地址
# filename:本地文件的地址
# reportbook:我们可以利用这个回调函数来显示当前的下载进度
# data: 指 post 到服务器的数据

在这里插入图片描述

`re.findall`方法

返回string中所有与pattern相匹配的全部字串，返回形式为数组

def findall(pattern, string, flags=0):

在这里插入图片描述

文件解码

import chardet 
result = getHtml("http://pic.yxdown.com/list/0_74_1.html#")
encode_type = chardet.detect(result)
result = result.decode(encode_type['encoding'])  # 进行相应解码，赋给原标识符（变量

完整的Demo

import urllib.request
import re
import urllib
import chardet  # 需要导入这个模块，检测编码格式

得到网页内容

def getHtml(url):
    print("请求网页内容开始==================")
    try:
        print("爬取的网站地址为==============" + url)
        html = urllib.request.urlopen(url)
        result = html.read()
    except urllib.error.HTTPError as e:
        if hasattr(e, 'reason'):
            print('错误原因是' + str(e.reason))
    except urllib.error.URLError as e:
        if hasattr(e, 'code'):
            print('错误状态吗是' + str(e.code))
    else:
        print('请求成功！')
    return result
    print("请求网页内容结束======================")
# 进行内容解码不然会报错
result = getHtml("http://pic.yxdown.com/list/0_74_1.html#")
encode_type = chardet.detect(result)
result = result.decode(encode_type['encoding'])  # 进行相应解码，赋给原标识符（变量
# 展示下载的进度
def cbk(a, b, c):
    '''''回调函数
    @a:已经下载的数据块
    @b:数据块的大小
    @c:远程文件的大小
    '''
    per = 100.0 * a * b / c
    if per > 100:
        per = 100
    print("下载进度为=======================" + '%.2f%%' % per)
# 下载文件中的图片
def getImage(html):
    print("下载图片开始====================")
    reg = 'src="(http:.+?\.jpg)" alt='
    cp = re.compile(reg)
    print(cp)
    # print(html)
    images = re.findall(reg, html)
    x = 0
    for img in images:
        print(x)
        urllib.request.urlretrieve(img, './images/%s.jpg' % x, cbk)
        x += 1
    return images
    print("下载图片结束====================")
# 将结果保存到txt中
# 处理保存的文本文件的格式
def clearBlank(result_content):
    print("处理文本信息开始======================")
    with open('./one.txt', 'wb') as f:
        # 将内容保存到txt中
        f.write(result_content)
        print("保存文件完成==================")
        # 创建文件操作流对象
        file1 = open('./one.txt', 'r', encoding='utf-8')
        file2 = open('./two.txt', 'w', encoding='utf-8')
        try:
            for line in file1.readlines():
                if line == '\n':
                    line = line.strip("\n")
                file2.write(line)
        finally:
            f.close()
            file1.close()
            file2.close()
            print("处理文本信息结束=======================")
if __name__ == '__main__':
    getImage(result)
    clearBlank(getHtml("http://pic.yxdown.com/list/0_74_1.html#"))