Python之爬虫urllib（一）

最新推荐文章于 2024-03-23 16:13:03 发布

ydw_ydw

最新推荐文章于 2024-03-23 16:13:03 发布

阅读量268

点赞数

分类专栏： Python 爬虫文章标签： Python 爬虫

本文链接：https://blog.csdn.net/ydw_ydw/article/details/81950403

版权

Python 同时被 2 个专栏收录

36 篇文章 2 订阅

订阅专栏

爬虫

15 篇文章 1 订阅

订阅专栏

本节介绍的信息内容

包含模块

网页编码问题解决

urlopen 的返回对象（在例子中指的是rsp）

包含模块

urllib.request: 打开和读取urls
urllib.error：包含urllib.request产生的常见的错误，使用try捕捉
urllib.parse: 包含解析url的方法
urllib.robotparse: 解析robots.txt文件
案例1

from urllib import request
'''
使用urllib.request请求一个网页内容，并把内容打印出来
'''


if __name__ == '__main__':

    url = "http://jobs.zhaopin.com/195435110251173.htm?ssidkey=y&ss=409&ff=03&sg=2644e782b8b143419956320b22910c91&so=1"
    # 打开相应url并把相应页面作为返回
    rsp = request.urlopen(url)

    # 把返回结果读取出来
    # 读取出来内容类型为bytes
    html = rsp.read()
    print(type(html))

    # 如果想把bytes内容转换成字符串，需要解码
    html = html.decode("utf-8")

    print(html)

网页编码问题解决

chardet 可以自动检测页面文件的编码格式，但是，可能有误
需要安装，安装方法：在python的安装文件夹的scripts文件夹里面有个pip.exe文件，安装时需要用到这个(貌似python2.4版本以上才默认有这个功能)，在命令行模式下进入pip.exe所在的文件夹，然后在命令提示符中输入pip.exe install chardet
Python获取网页编码的两种方法——requests.get、chardet
案例2

'''
利用request下载页面
自动检测页面编码

'''

import urllib
import chardet

if __name__ == '__main__':
    url = 'http://stock.eastmoney.com/news/1407,20170807763593890.html'

    rsp = urllib.request.urlopen(url)

    html = rsp.read()

    #1、利用 chardet自动检测
    cs = chardet.detect(html)
    print(type(cs))
    print(cs)


    #2、使用get取值保证不会出错
    html = html.decode(cs.get("encoding", "utf-8"))
    print(html)

urlopen 的返回对象（在例子中指的是rsp）

返回对象具有的函数
- geturl: 返回请求对象的url
- info: 请求反馈对象的meta信息
- getcode：返回的http code
- 案例3（也可以在 print(type(rsp)) 一行打上断点，然后看编辑器下方Console中给出的信息，可以得到URL等信息）

import urllib

if __name__ == '__main__':
    url = 'http://stock.eastmoney.com/news/1407,20170807763593890.html'

    rsp = urllib.request.urlopen(url)

    print(type(rsp))
    print(rsp)

    print("URL： {0}".format( rsp.geturl()))
    print("Info: {0}".format(rsp.info()))
    print("Code: {0}".format(rsp.getcode()))

    html = rsp.read()

    # 使用get取值保证不会出错
    html = html.decode()