python网络爬虫学习
- URL(父类是URI) :统一资源定位符
构成:协议、主机、地址 (protocol、host、path) - urllib包:抓取网页,处理URL,包含模块:
request:打开读取URL
error:(可以用try捕捉)
parse:解析URL
robotparser:可以测试一个页面是否可以被爬虫下载 - 用urllib实现简单的网页抓取
# -*- coding: UTF-8 -*-
from urllib import request
import chardet
if __name__ == "__main__":
response = request.urlopen("http://fanyi.baidu.com")
html = response.read()
charset = chardet.detect(html)
html = html.decode(charset["encoding"])
print(html)
f = open('out.txt', 'w+', encoding='utf-8')
f.write(html)
f.close()
读取打印百度翻译的URL,用chardet包实现自动获取编码格式(手动可以从浏览器审查元素1找)
urlopen可以处理string或request对象
obj = request.Request("http://fanyi.baidu.com/")
response = request.urlopen(obj)
request对象的其他函数:
注意: 写入txt时要指定utf-8格式(默认gbk)
报错信息:
Traceback (most recent call last):
File "C:/Users/MACHENIKE/PycharmProjects/untitled/crawler_demo1.py", line 12, in <module>
f.write(html)
UnicodeEncodeError: 'gbk' codec can't encode character '\u0e02' in position 58895: illegal multibyte sequence