python爬虫入门

最新推荐文章于 2024-05-10 02:04:46 发布

Hana I

最新推荐文章于 2024-05-10 02:04:46 发布

阅读量86

点赞数

分类专栏： python笔记

本文链接：https://blog.csdn.net/weixin_43854835/article/details/89519234

版权

python笔记专栏收录该内容

18 篇文章 0 订阅

订阅专栏

爬虫入门

- - 模块介绍

爬虫介绍

模块介绍

我用的是python3

# 常用的爬虫模块有urllib和request
"""
urllib
	-包含模块
		- urllib.request: 打开和读取urls
		- urllib.error: 包含urllib.request 产生的常见错误，使用try捕捉
		- urllib.parse: 包含即系url的方法
		- urllib.robotparse: 解析robots.txt
	-urlopen 的返回对象
		- geturl: 返回请求对象的url
		- info: 请求反馈对象的meta信息
		- getcode: 返回的http code
"""

小爬一下

from urllib import request # 调用模块

if __name__ == '__main__': # 模块带的参数
    url = "https://jobs.zhaopin.com/CC305333513J00235638508.htm" # url 

    rsp = request.urlopen(url) # 打开相应url并把相应页面作为返回
	
	# 把返回结果读取出来
	# 读取出来的内容类型为bytes
	
    html = rsp.read() 
    print(type(html))
	
    html = html.decode("utf-8") # 解码 括号内不输入的话 默认utf-8

    print(html)

具体信息

from urllib import request

if __name__ == '__main__':
    url = "https://jobs.zhaopin.com/CC305333513J00235638508.htm"

    rsp = request.urlopen(url)
    print(type(rsp))
    print(rsp)

    print("URL: {0}".format(rsp.geturl()))
    print("Info: {0}".format(rsp.info()))
    print("Code: {0}".format(rsp.getcode()))

    html = rsp.read()

    # 使用get取值保证不会出错
    html = html.decode()

搜索

from urllib import request, parse
# 多了一个parse 
# urllib.parse: 包含即系url的方法
# 上面已说

if __name__ == '__main__':
    url = "https://www.baidu.com/s?"
    wd = input("输入要搜索的东西：")

    # 要想使用data， 需要使用字典结构
    qs = {
        "wd": wd
    }
    # 转换url编码
    qs = parse.urlencode(qs)8


    fullurl = url + qs
    print(fullurl)

    rsp = request.urlopen(url)
    print(type(rsp))
    print(rsp)


    html = rsp.read()

    # 使用get取值保证不会出错
    html = html.decode()

Hana I

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python爬虫入门

爬虫入门模块介绍爬虫介绍模块介绍我用的是python3# 常用的爬虫模块有urllib和request"""urllib -包含模块 - urllib.request: 打开和读取urls - urllib.error: 包含urllib.request 产生的常见错误，使用try捕捉 - urllib.parse: 包含即系url的方法 - urllib.robot...
复制链接

扫一扫