蜘蛛3分钟找到了love

最新推荐文章于 2024-07-23 14:36:35 发布

scscsoce

最新推荐文章于 2024-07-23 14:36:35 发布

阅读量495

点赞数

文章标签： url python domain 多线程服务器 shell

本文链接：https://blog.csdn.net/scscsoce/article/details/3410800

版权

我想知道：一台PC上运行的蜘蛛，从sina.com作为起始url, 最快花多少时间搜索到love这个词？
下面是python写的简单的宽度搜索的蜘蛛

 
 # -*- coding: UTF-8 -*-
''''' given a key and start_url, the spider is suppose to find the key in html as fast as possible. 
 of course this program is not fast, just a bottom line test. 
'''
import Queue
import spider_lib as slib
import re

if __name__ == "__main__":
    start_url = "http://www.sina.com"
    key = r'/blove/b'
    urls = Queue.Queue()
    url_hash = {} 
    urls.put( start_url )
    cnt = 0
    while 1:
        url = urls.get()
        if not url_hash.has_key( url ):
            try:
                print 'process ' + url 
                lines = slib.fetch_page_as_lines( url ) 
            except Exception, e:
                print e
            match_line = slib.extract_lines( lines, key )
            if match_line != []:
                print 'got %s in %s at %s' % ( key, match_line[0][0], url )
                break
            url_hash[ url ] = 1
            domain_name = url.split("/", 3 )[2] 
            for line in lines:
                for new_url in re.findall( r'<a[^>]*href="([^"]*)"', line ):
                    if new_url.lower().startswith( 'http' ):
                        urls.put( new_url ) 
                    elif not re.match( '[^/]+://', new_url ):
                        urls.put( "http://" + domain_name + '/' + new_url )
                        
        
 

然后在shell下运行 >> time python ./search.py
结果是接近6分钟:　
    real    2m57.054s
    user    0m0.332s
     sys    0m0.048s

从程序角度看确实相当慢，不过仅仅作为一种‘底线’程序，存在许多优化方案:
多线程
用C改写
应聘到sina然后在他们服务器上面运行
。。。

你想试试吗?

scscsoce

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
蜘蛛3分钟找到了love

我想知道：一台PC上运行的蜘蛛，从sina.com作为起始url, 最快花多少时间搜索到love这个词？下面是python写的简单的宽度搜索的蜘蛛# -*- coding: UTF-8 -*- given a key and start_url, the spider is suppose to find the key in html as fast as possible. of
复制链接

扫一扫