python3 爬虫之爬取糗事百科

最新推荐文章于 2024-11-13 10:45:03 发布

weixin_33860553

最新推荐文章于 2024-11-13 10:45:03 发布

阅读量68

点赞数

文章标签：爬虫 python json

原文链接：https://yq.aliyun.com/articles/530668

版权

闲着没事爬个糗事百科的笑话看看

python3中用urllib.request.urlopen()打开糗事百科链接会提示以下错误

http.client.RemoteDisconnected: Remote end closed connection without response

但是打开别的链接就正常，很奇怪不知道为什么，没办法改用第三方模块requests，也可以用urllib3模块，还有一个第三方模块就是bs4(beautifulsoup4)

最后经过不懈努力，终于找到了为什么，原因就是没有添加headers，需要添加headers，让网站认为是从浏览器发起的请求，这样就不会报错了。

 
         import  
         urllib.request 
        
         url  
         =  
         'http://www.qiushibaike.com/8hr/page/5/' 
        
         user_agent  
         =  
         'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' 
        
         headers  
         =  
         { 
         'User-agent' 
         :user_agent} 
        
         request  
         =  
         urllib.request.Request(url,headers 
         = 
         headers) 
        
         html  
         =  
         urllib.request.urlopen(request) 
        
         print 
         (html.read().decode())

requests模块安装和使用，这里就不说了

附上官方链接：http://docs.python-requests.org/en/master/

中文文档：http://cn.python-requests.org/zh_CN/latest/

 
         >>> r  
         =  
         requests.get( 
         'https://api.github.com/user' 
         , auth 
         = 
         ( 
         'user' 
         ,  
         'pass' 
         )) 
        
         >>> r.status_code 
        
         200 
        
         >>> r.headers[ 
         'content-type' 
         ] 
        
         'application/json; charset=utf8' 
        
         >>> r.encoding 
        
         'utf-8' 
        
         >>> r.text 
        
         u 
         '{"type":"User"...' 
        
         >>> r.json() 
        
         {u 
         'private_gists' 
         :  
         419 
         , u 
         'total_private_repos' 
         :  
         77 
         , ...}

urllib3模块安装和使用，这里也不说了

附上官方链接：https://urllib3.readthedocs.io/en/latest/

 
         >>>  
         import  
         urllib3 
        
         >>> http  
         =  
         urllib3.PoolManager() 
        
         >>> r  
         =  
         http.request( 
         'GET' 
         ,  
         'http://httpbin.org/robots.txt' 
         ) 
        
         >>> r.status 
        
         200 
        
         >>> r.data 
        
         'User-agent: *\nDisallow: /deny\n'

bs4模块安装和使用

附上官方链接：https://www.crummy.com/software/BeautifulSoup/

好了，上面三个模块有兴趣的可以自己研究学习下，以下是代码:

爬取糗事百科的段子和图片

 
         import  
         requests 
        
         import  
         urllib.request 
        
         import  
         re 
        
         def  
         get_html(url): 
        
         page  
         =  
         requests.get(url) 
        
         return  
         page.text 
        
         def  
         get_text(html, 
         file 
         ): 
        
         textre  
         =  
         re. 
         compile 
         (r 
         'content">\n*<span>(.*)</span>' 
         ) 
        
         textlist  
         =  
         re.findall(textre,html) 
        
         num  
         =  
         0 
        
         txt  
         =  
         [] 
        
         for  
         i  
         in  
         textlist: 
        
         num  
         + 
         =  
         1 
        
         txt.append( 
         str 
         (num) 
         + 
         '.' 
         + 
         i 
         + 
         '\n' 
         * 
         2 
         ) 
        
         with  
         open 
         ( 
         file 
         , 
         'w' 
         ,encoding 
         = 
         'utf-8' 
         ) as f: 
        
         f.writelines(txt) 
        
         def  
         get_img(html): 
        
         imgre  
         =  
         re. 
         compile 
         (r 
         '<img src="(.*\.JPEG)" alt=' 
         ,re.IGNORECASE) 
        
         imglist  
         =  
         re.findall(imgre,html) 
        
         x  
         =  
         0 
        
         for  
         imgurl  
         in  
         imglist: 
        
         x  
         + 
         =  
         1 
        
         urllib.request.urlretrieve(imgurl,  
         '%s.jpg'  
         %  
         x) 
        
         html  
         =  
         get_html( 
         "http://www.qiushibaike.com/8hr/page/2/" 
         ) 
        
         get_text(html, 
         'a.txt' 
         ) 
        
         get_img(html)

 
   本文转自 baby神 51CTO博客，原文链接：http://blog.51cto.com/babyshen/1889553，如需转载请自行联系原作者

weixin_33860553

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫