爬网页的时候,明明网址是正确的,但是用python爬网页返回 not fount 404错误
网上查了结果是需要设置 user-agent
# -*- coding:utf-8 -*-
import urllib.request
import re
# install proxy
# url ="http://www.cnblogs.com/GuoYaxiang/p/6232831.html"
url = "http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2016/11.html"
req = urllib.request.Request(url,headers={'User-agent': 'Mozilla/5.0'})
html = urllib.request.urlopen(req).read()
html = html.decode("gbk").replace('\n','').replace('\t','')
# print(html)
pat = re.findall('citytr(.*?)html',html)
print(pat)