爬虫学习01

妖妖琳CL

已于 2022-04-27 18:04:53 修改

阅读量561

点赞数

分类专栏： python爬虫文章标签： python

于 2022-04-12 16:36:26 首次发布

本文链接：https://blog.csdn.net/CL5221/article/details/124126900

版权

python爬虫专栏收录该内容

5 篇文章 1 订阅

订阅专栏

本文介绍了Python爬虫的基本概念、价值和分类，包括搜索引擎爬虫、聚焦爬虫、增量式和深度爬虫。详细讲解了爬虫的工作原理，通过Python的urllib.request模块实现GET和POST请求，并展示了如何处理URL参数。同时，文章还涉及了数据的解析、保存和错误处理。通过实例展示了如何爬取并处理含有汉字的URL。

摘要由CSDN通过智能技术生成

一、了解html请求方式

get请求
post请求
put
delete
head请求头

二、爬虫入门

1.概念：使用代码模拟用户批量的发送网络请求，批量的获取数据。

2.爬虫的价值

（1）买卖数据

（2）数据分析：出分析报告

（3）流量

（4）指数：阿里指数，百度指数

3.爬虫的分类

（1）使用搜索引擎：百度谷歌 360 雅虎搜狗

（2）聚焦爬虫

（3）增量式

（4）深度爬虫

4.爬虫的工作原理

（1）确认你抓取的目标的url是哪一个

（2）使用python代码发送请求，获取数据

（3）解析获取的数据（精确数据）

找到新的目标（url），回到第一步（自动化）

（4）数据持久化

python3(原生提供的模板)：urlib.request

urlopen: 1）返回response对象 2）response.read() 3)bytes.decode("utf-8")
get：传参
post
handle处理器的自定义
urlError

例如

爬取百度网址

（1）向百度发请求，得到响应对象

（2）获取响应对象内容（网页源代码）

#爬取百度网址


#导入请求模块（python标准库模块）
import urllib.request

def load_data():

#1.向百度发请求，得到响应对象

    url ="http://www.baidu.com/"
    respose = urllib.request.urlopen(url)
    print(respose)


#2.获取响应对象内容（网页源代码）
#print(reponse.read().decode('utf-8'))

  

    #读取内容是bytes类型
    data=respose.read()
    print(data)
    #需要将文件获取的内容转换成字符串类型
    str_data=data.decode("utf-8")
    print(str_data)



    #保存到本地，将数据写入文件
    with open("01-baidu.html","w",encoding="utf-8")as f:
        f.write(str_data)
    #将字符串类型转换成bytes
    str_name="01-baidu"
    bytes_name=str_name.encode("utf-8")
    print(bytes_name)

    #pyhon爬取的类型 ：str  和 bytes
    #bytes=reponse.read()
    #string=reponse.read().decode('utf-8')
    #如果爬取的是bytes类型，但是写入的时候需要str字符串类型  则是 decode("utf-8")
    #如果爬取的是str类型，但是要写入的是bytes类型 encode("utf-8"）
load_data()

带参数爬取

#带参数爬取
import urllib.request
import urllib.parse
import string
def get_method_params():
    url="http://www.baidu.com/s?wd="

    #拼接字符串（汉字）
    #python可以接受的数据
    #https://www.baidu.com/s?wd=%E7%BE%8E%E5%A5%B3

    name="美女"
    final_url=url+name
    #print(final_url)
    #代码发送了请求，但是网址里包含了汉字，需要url转译

    #将包含汉字的网址进行转译
    encode_new_url=urllib.parse.quote(final_url,safe=string.printable)
    print(encode_new_url)


    #使用代码发送网络请求
    response=urllib.request.urlopen(encode_new_url)
    print(response)

    #读取内容
    data=response.read().decode()
    print(data)

    #保存到本地
    with open("02-encode.html","w",encoding="utf-8")as f:
        f.write(data)


#出现错误
#UnicodeEncodeError: 'ascii' codec can't encode characters in position 10-11: ordinal not in range(128)
#是因为python是解释性语言，解析器只支持 ascii 0-127
#不支持中文


get_method_params()