import urllib.request;
import re;
url = "http://news.xinhuanet.com/politics/2015-12/20/c_128549058.htm";
data = urllib.request.urlopen(url);
# 返回对象,有各种方法
txt = data.read().decode("utf-8","ignore");
# ignore是为了在爬非utf-8的网页时不会挂掉
# 目前只会用utf-8 ,在爬新浪新闻、网易新闻时出问题
# 主要是不会转码←_←
title = re.compile(r"""<title>(.*?)</title>""",re.DOTALL);
# 利用re.DOTALL使'.'可以代表newline
for ch in title.finditer(data):
file.write(ch.group(1)+'\n');
下面大牛的文章先存好以后慢慢看
https://jecvay.com/2014/09/python3-web-bug-series1.html
https://jecvay.com/2014/09/python3-web-bug-series2.html
https://jecvay.com/2014/09/python3-web-bug-series3.html
https://jecvay.com/2014/10/python3-web-bug-series4.html
https://jecvay.com/2015/02/python3-web-bug-series5.html
网络爬虫入门
最新推荐文章于 2019-09-05 10:55:22 发布