Python写的Web spider（网络爬虫）

最新推荐文章于 2020-11-10 23:22:20 发布

vince_zw

最新推荐文章于 2020-11-10 23:22:20 发布

阅读量1.5k

点赞数 1

分类专栏：搜索 Python 文章标签： python Web spider 网络爬虫

本文链接：https://blog.csdn.net/zhaowen25/article/details/47132445

版权

搜索同时被 2 个专栏收录

3 篇文章 0 订阅

订阅专栏

Python

1 篇文章 0 订阅

订阅专栏

Python写的Web spider：

<span style="font-size:14px;"># web spider
# author vince 2015/7/29
import urllib2
import re

# get href content
pattern = '<a(?:\\s+.+?)*?\\s+href=\"([h]{1}[^\"]*?)\"'
t = set("")    # collection of url

def fecth(url):
    http_request = urllib2.Request(url)
    http_request.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36')
    http_response = urllib2.urlopen(http_request)
    print http_response.code
    if http_response.code == 200:
        for i in range(0,2000):     # 2000 rows
            html = http_response.readline()
            if html == '':
                break
            else:
                a = re.search(pattern, html)
                if a:
                    for href in a.groups():
                        print href
                        t.add(href)


# main start
#if __name__ == '__main__':    
  
url = 'http://blog.csdn.net/'     # target site
t.clear()
t.add(url)
while (len(t) != 0):
    uu = t.pop()
    print uu
    fecth(uu)
</span>

如果没有设置User-Agent，有些网站会不让访问，报403