Python数据分析___爬虫入门—0

最新推荐文章于 2024-06-20 19:05:47 发布

MapleSilent

最新推荐文章于 2024-06-20 19:05:47 发布

阅读量92

点赞数

本文链接：https://blog.csdn.net/huolingyizhixing/article/details/100657609

版权

Python数据分析___爬虫入门—0

1 基础说明

1.1 爬虫

爬虫都是针对于网页的，简单理解，就是用一个程序，伪装成浏览器，去看各个网站。

1.2 编码、解码问题

由于在爬虫中，涉及不同的数据类型，这里说明一下。

第一，python3中，数据可分为：字节串（2进制）、字符串。

第二，只有a是字符串，而不是字节串，a[索引] 才是期望得到的，要不然索引会对不上。

第三，具体转化，看下图：bytes——>decode——>string， string——>encode——>bytes
在这里插入图片描述
4：对于python：字符串——对应——>unicode编码；字节串——对应——>utf8、gbk等其它编码

（注：unicode编码是2个字节，字节数除以2得到，就能得到字符串的长度）

5：unicode编码（对应字符串）不能进行网络传输，也不能进行存储，因为它浪费空间，太占地方了；

下面通过代码来说明：

from urllib.request import urlopen

response=urlopen("http://www.hao123.com/")
类型是：<class 'http.client.HTTPResponse'>

content=response.read()		
根据4、5：网络传输不能用unicode(对应字符串)，那就只能是字节串：b'<!DOCTYPE htm... '

string=content.decode()
根据3，将字节——decode——>字符：'<!DOCTYPE html>...'

with open("a.html","w",encoding="utf-8") as f:
    f.write(string)

根据5，存储不能用unique(字符串)，因此要用encoding转为utf-8；