爬虫中网页分析的几种技术

最新推荐文章于 2024-04-12 02:17:38 发布

binling

最新推荐文章于 2024-04-12 02:17:38 发布

阅读量3.9k

点赞数

分类专栏： parser 同类问题汇总系统分析设计

本文链接：https://blog.csdn.net/binling/article/details/49279235

版权

同类问题汇总同时被 3 个专栏收录

51 篇文章 1 订阅

订阅专栏

系统分析设计

50 篇文章 0 订阅

订阅专栏

parser

15 篇文章 0 订阅

订阅专栏

一般来说我们只抓取网页中的特定数据，比如抓取某人所有的blog，我们就只关心list 页面中文章列表那部分的链接和title

有几种技术可以用来分析网页

1）正则匹配

2）一般字符串匹配content.substring(pattern, startIndex)，一般是带有startIndex的substring，而不是每次都是从头匹配

3) 基于sax的事件

4）DOM + XPath

抓去的数据有两种

1）基于数据本身的parttern，比如链接、email adrress，适合用正则

2）基于位置。数据本身没什么特别，关键在于在什么位置出现。适合用其他3种，

基于sax事件的最好，流式处理，不需要存储整个网页，缺点是有些网页不规范，sax 要求必须是合法、well formed xml。

substring和正则一般需要先把网页读成字符串，substring更简单轻量级一些，

DOM+xpath太杀鸡用牛刀了

例子，把自己csdn上所有的博文扒下来：

from urllib2 import Request, urlopen, URLError

page, articleList, visited, startOver = 1, [], set(), False
while not startOver:
    req = Request('http://blog.csdn.net/binling/article/list/' + str(page), headers={'User-agent': 'Mozilla 5.10'})
    try:content = urlopen(req).read()
    except URLError, e: break
    pos = 0
    while True:
        try:
            pos = content.index('link_title', pos)
            pos = content.index('href', pos)
            pos = content.index('"', pos)
            end = content.index('"', pos + 1)
            link = content[pos + 1:end].strip().decode('utf-8')
            if link in visited:
                startOver = True
                break
            pos = content.index('>', end)
            end = content.index('</a>', pos)
            title = content[pos + 1: end].strip()
            articleList.append((title.decode('utf-8'), link))
            visited.add(link)
        except: break
    page += 1

home = 'C:\\Personal\\CSDN'
for title, link in articleList:
    for c in '/\*:<>?"|':
        if c in title: title = title.replace(c, ' ')
    content = urlopen(Request('http://blog.csdn.net' + link, headers={'User-agent': 'Mozilla 5.10'})).read()
    with open(home + '\\' + title + '.html', 'w') as f:
        f.write(content)
        print title