闲着没事爬个糗事百科的笑话看看
python3中用urllib.request.urlopen()打开糗事百科链接会提示以下错误
http.client.RemoteDisconnected: Remote end closed connection without response
但是打开别的链接就正常,很奇怪不知道为什么,没办法改用第三方模块requests,也可以用urllib3模块,还有一个第三方模块就是bs4(beautifulsoup4)
最后经过不懈努力,终于找到了为什么,原因就是没有添加headers,需要添加headers,让网站认为是从浏览器发起的请求,这样就不会报错了。
1
2
3
4
5
6
7
8
9
10
|
import
urllib.request
url
=
'http://www.qiushibaike.com/8hr/page/5/'
user_agent
=
'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers
=
{
'User-agent'
:user_agent}
request
=
urllib.request.Request(url,headers
=
headers)
html
=
urllib.request.urlopen(request)
print
(html.read().decode())
|
requests模块安装和使用,这里就不说了
附上官方链接:http://docs.python-requests.org/en/master/
中文文档:http://cn.python-requests.org/zh_CN/latest/
1
2
3
4
5
6
7
8
9
10
11
|
>>> r
=
requests.get(
'https://api.github.com/user'
, auth
=
(
'user'
,
'pass'
))
>>> r.status_code
200
>>> r.headers[
'content-type'
]
'application/json; charset=utf8'
>>> r.encoding
'utf-8'
>>> r.text
u
'{"type":"User"...'
>>> r.json()
{u
'private_gists'
:
419
, u
'total_private_repos'
:
77
, ...}
|
urllib3模块安装和使用,这里也不说了
附上官方链接:https://urllib3.readthedocs.io/en/latest/
1
2
3
4
5
6
7
|
>>>
import
urllib3
>>> http
=
urllib3.PoolManager()
>>> r
=
http.request(
'GET'
,
'http://httpbin.org/robots.txt'
)
>>> r.status
200
>>> r.data
'User-agent: *\nDisallow: /deny\n'
|
bs4模块安装和使用
附上官方链接:https://www.crummy.com/software/BeautifulSoup/
好了,上面三个模块有兴趣的可以自己研究学习下,以下是代码:
爬取糗事百科的段子和图片
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
import
requests
import
urllib.request
import
re
def
get_html(url):
page
=
requests.get(url)
return
page.text
def
get_text(html,
file
):
textre
=
re.
compile
(r
'content">\n*<span>(.*)</span>'
)
textlist
=
re.findall(textre,html)
num
=
0
txt
=
[]
for
i
in
textlist:
num
+
=
1
txt.append(
str
(num)
+
'.'
+
i
+
'\n'
*
2
)
with
open
(
file
,
'w'
,encoding
=
'utf-8'
) as f:
f.writelines(txt)
def
get_img(html):
imgre
=
re.
compile
(r
'<img src="(.*\.JPEG)" alt='
,re.IGNORECASE)
imglist
=
re.findall(imgre,html)
x
=
0
for
imgurl
in
imglist:
x
+
=
1
urllib.request.urlretrieve(imgurl,
'%s.jpg'
%
x)
html
=
get_html(
"http://www.qiushibaike.com/8hr/page/2/"
)
get_text(html,
'a.txt'
)
get_img(html)
|
本文转自 baby神 51CTO博客,原文链接:http://blog.51cto.com/babyshen/1889553,如需转载请自行联系原作者