python写的一个简单的spider

最新推荐文章于 2023-06-30 23:18:49 发布

maoxh

最新推荐文章于 2023-06-30 23:18:49 发布

阅读量670

点赞数

分类专栏： Python 文章标签： python attributes search html

本文链接：https://blog.csdn.net/maoxh/article/details/4450341

版权

Python 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

1. html parser: 继承SGMLParser类, 对html页面中的正文(tag <p>)和锚点 (tag <a>) 的内容进行提取

#html parser class MyParser(sgmllib.SGMLParser): def __init__(self): sgmllib.SGMLParser.__init__(self) self._tagType = '' self._paragraph = [] #store paragraph self._anchorlist = {} #pair {address, data} self._current_href = '' # href value when tag = a def start_p(self, attributes): self._tagType = 'p' def end_p(self): self._tagType = '' def start_a(self, attributes): self._tagType = 'a' for name, value in attributes: if name == 'href': self._current_href = value def end_a(self): self._tagType = '' self._current_href = '' def handle_data(self, data): if self._tagType == 'p': self._paragraph.append(data) if self._tagType == 'a': if self._anchorlist.has_key(self._current_href): self._anchorlist[self._current_href] += '/n' + data else: self._anchorlist[self._current_href] = data

2. Spider: 用urllib打开html page，通过MyParser提取页面信息（正文和锚点）. 通过关键字提取有用信息：只有包含了关键字的信息才会被保留下来，其他信息都会被舍弃掉。这个功能非常适合在网页上搜取想要的信息。如果关键字为空，则默认把所有信息都保留下来。

#spider class MySpider: #search keyword def search_keyword(self, str, keyword_set): match = 0 for keyword in keyword_set: if str.find(keyword) != -1: match+=1 return match #search page def search_page(self, page, keyword_set): p = MyParser() try: f = urllib.urlopen(page) except: print 'can/'t open the page. Pls check the address is correct' return ['can/'t open the page. Pls check the address is correct'] BUFSIZE = 8192 while True: data = f.read(BUFSIZE) if not data: break p.feed(data) p.close() result = [] for str in p._paragraph: if len(keyword_set) == 0 or self.search_keyword(str, keyword_set): result.append(str) for href in p._anchorlist: if len(keyword_set) ==0 or self.search_keyword(p._anchorlist[href], keyword_set): result.append(href + '/n' + p._anchorlist[href]) return result

不足（待改进）：

1）不支持递归搜索

2) 只提取网页中的正文信息和锚点信息

3) 关键字搜索功能有待加强