python爬虫技术总结

最新推荐文章于 2024-06-20 07:25:25 发布

南歌子

最新推荐文章于 2024-06-20 07:25:25 发布

阅读量1.2k

点赞数

分类专栏：技术总结文章标签： Python python爬虫多线程

本文链接：https://blog.csdn.net/jnnock/article/details/9233891

版权

技术总结专栏收录该内容

7 篇文章 0 订阅

订阅专栏

最近用python做爬虫研究了一段时间，感觉太好了，下面和大家分享一下

依旧是以伯乐在线网站作为抓取的例子，从最新文章专栏抓取他们的文章，可以获取网站的分页来获取更多的内容下载打包起来。

下面是代码：

import urllib2,urllib
import threading
from bs4 import BeautifulSoup
import re
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

num=1
class Get(threading.Thread):
    def __init__(self,lock,link,list):
        threading.Thread.__init__(self)
        self.lock=lock
        self.link=link
        self.list=list
    def filter_tags(self,htmlstr):
        re_cdata=re.compile('//<!\[CDATA\[[^>]*//\]\]>',re.I) 
        re_script=re.compile('<\s*script[^>]*>[^<]*<\s*/\s*script\s*>',re.I)
        re_style=re.compile('<\s*style[^>]*>[^<]*<\s*/\s*style\s*>',re.I)
        re_p=re.compile('<P\s*?/?>')
        re_h=re.compile('</?\w+[^>]*>')
        re_comment=re.compile('<!--[^>]*-->')
        s=re_cdata.sub('',htmlstr)
        s=re_script.sub('',s)
        s=re_style.sub('',s)
        s=re_p.sub('\r\n',s)
        s=re_h.sub('',s) 
        s=re_comment.sub('',s)
        blank_line=re.compile('\n+')
        s=blank_line.sub('\n',s)
        return s
    def run(self):
        self.lock.acquire()
        content=urllib2.urlopen(self.link).read()
        g=self.filter_tags(content)
        title=r'<h1>(.+)</h1>'
        result=re.findall(title,content)
        print "%s \n" %",".join(result)      
        cn=r'<div class="entry">(.*)<!-- END .entry -->'
        tem=re.findall(cn,content,re.S)
        
        f=open('/home/tron/Python/code/'+'News'+'.txt','a+')
        for j,cc in zip(result,tem):
            f.write(j.encode("utf-8")+"\n"+self.filter_tags(cc)+"")
        f.close()
        if self.list<1:
            global num
            num+=1
            if num<4:
                main("http://blog.jobbole.com/all-posts/page/"+str(num)+"/") 
            else:
                print "No More"
                return
        self.lock.release()

def main(info):
    spuare=r'<div class="grid-8" id="archive">(.*)<!-- END .grid-8 -->'
    #info="http://blog.jobbole.com/all-posts/"
    content=urllib2.urlopen(info).read()
    div=re.search(spuare,content,re.S)
    link=r'http://blog.jobbole.com/\d{5}/'
    result=re.findall(link,div.group())
    result=set(result)
    list=len(result)
    lock=threading.Lock()
    for i in result:
        list-=1
        Get(lock,i,list).start()
if __name__=="__main__":
    main("http://blog.jobbole.com/all-posts/page/1/")

我同样也是用了多线程来比较高效地获取网页内容。

filter_tags()函数是用来出去网页中的像是<p/><div/>等标签，否则打包下载之后会不适应阅读

我用这段代码获取了前三页的内容，并且只获得啦发布的文章的信息而屏蔽掉了右侧热门文章的信息，代码有冗余。

南歌子

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
python爬虫技术总结

最近用python做爬虫研究了一段时间，感觉太好了，下面和大家分享一下依旧是以伯乐在线网站作为抓取的例子，从最新文章专栏抓取他们的文章，可以获取网站的分页来获取更多的内容下载打包起来。下面是代码：import urllib2,urllibimport threadingfrom bs4 import BeautifulSoupimport reimport sy
复制链接

扫一扫

专栏目录