Python分布式爬虫必学框架Scrapy打造搜索引擎-3 爬虫知识

最新推荐文章于 2024-06-30 20:00:11 发布

zb0567

最新推荐文章于 2024-06-30 20:00:11 发布

阅读量543

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/zb0567/article/details/104151215

版权

python 专栏收录该内容

72 篇文章 0 订阅

订阅专栏

scrapy vs requests+beautifulsoup

1、requests beautifulsoup都是库 scrapy是框架前两者是jquery 后者是view

2、scrapy可以加入requests beautifulsoup

3、scrapy基于twisted，性能是最大的优势

会用到requests但是不会用到beautifulsoup

4、scrapy方便扩展，提供了很多内置功能

5、scrapy内置css xpath selector很快很方便，beautifulsoup很慢

beautifulsoup 纯python scrapy是纯c 性能前者比后者强百倍，最多

互联网网页分类

1、静态网页

2、动态网页

3、webservice rest的api

爬虫能做什么

1、搜索引擎

2、推荐引擎

3、机器学习的数据样本

4、数据分析（如金融数据分析）、舆情分析等

正则表达式(模式匹配)

1、为什么必须

xpath css 只能获取1天前

2、正则表达式的简单应用及python示例

正则表达式

1、特殊字符判断字符串是否符合某个模式提取字符串当中的重要部分

1）^$*?+{2}{2,}{2,5}|

2) [][^][a-z].

3) \s\S\w\W

4) [\u4E00-\u9FA5]()\d

2、正则表达式的简单应用及python示例

import re

line8 = "xxx出生于2001年6月1日"
line7 = "xxx出生于2001/6/1日"
line6 = "xxx出生于2001-6-1日"
line5 = "xxx出生于2001-06-01日"
line4 = "xxx出生于2001-06"
line3 = "study in 南京大学2009"
line2 = "你好"
line = "bobby123 "
#if line == "bobby123"
regex_str = "^b.*3$"
#^以什么开头 .表示任意字符 *表示前面的字符重复n边 3$以什么结尾的
regex_str2 = ".*(b.*b).*"
regex_str3 = ".*?(b.*b).*"
regex_str4 = ".*?(b.*?b).*"
#?代表从左边数
regex_str5 = ".*(b.+b).*"
regex_str6 = ".*(b.{1}b).*"
regex_str7 = ".*(b.{2,}b).*"
regex_str8 = ".*(b.{2,5}b).*"
regex_str9 = "(bobby|bobby123)"
regex_str10 = "((bobby|boobby)123)"
regex_str11 = "([abcd]obby123)"
regex_str12 = "([a-z]{5}[123]{2}[^1])"
regex_str13 = "(.*\s)"
regex_str14 = "(\S+)"
# "(\S)"
regex_str15 = "(\w)"
#等同于 [A-Za-z0-9_] \W不等于\w才可以
regex_str16 = "([\u4E00-\u9FA5]+)"
regex_str17 = ".*?([\u4E00-\u9FA5]+大学)"
regex_str18 = ".*(\d{4})"
regex_str19 = ".*出生于(\d{4}[年/-]\d{1,2}([月/-]\d{1,2}|$))"

if re.match(regex_str, line):
   print("yes")
match_obj2 = re.match(regex_str2, line)
if match_obj2:
   print('2'+ match_obj2.group(1))
   #贪婪匹配，以最后一个b开始，下一个b向前匹配到结束，结果就是bb
match_obj3 = re.match(regex_str3, line)
if match_obj3:
   print('3'+ match_obj3.group(1))
   #非贪婪匹配，？从左边开始数，以最右边的为结束。则bobb
match_obj4 = re.match(regex_str4, line)
if match_obj4:
   print('4'+ match_obj4.group(1))
   #非贪婪匹配，？从左边开始数，以最左边的符合为结束。则bob
match_obj5 = re.match(regex_str5, line)
if match_obj5:
   print('5'+ match_obj5.group(1))
   #+表示重复大于一次 任意字符重复大于一次。则bobb
match_obj6 = re.match(regex_str6, line)
if match_obj6:
   print('6'+ match_obj6.group(1))
   #{1}表示字符只能重复1次。则bob
match_obj7 = re.match(regex_str7, line)
if match_obj7:
   print('7'+ match_obj7.group(1))
   #{2，}表示字符只能重复,2次以上。则bobb
match_obj8 = re.match(regex_str8, line)
if match_obj8:
   print('8'+ match_obj8.group(1))
   #{2，5}表示字符只能重复,2次以上5次以下。则bobb
match_obj9 = re.match(regex_str9, line)
if match_obj9:
   print('9'+ match_obj9.group(1))
   #{|}表示或。则bobby
match_obj10 = re.match(regex_str10, line)
if match_obj10:
   print('10'+ match_obj10.group(1))
   #{{|}123}}表示先从外边括号匹配123，里面再匹配bobby 则bobby123 第二个括号则是bobby
match_obj11 = re.match(regex_str11, line)
if match_obj11:
   print('11'+ match_obj11.group(1))
   #{[abcd]]}表示匹配其中一个字符 则bobby123
match_obj12 = re.match(regex_str12, line)
if match_obj12:
   print('12'+ match_obj12.group(1))
   #{[abcd]]}表示匹配其中一个字符 则bobby123 [0-9] 表示范围 [^1]表示不等于1
   #[.][*]都代表符号本身 bobby123
match_obj13 = re.match(regex_str13, line)
if match_obj13:
   print('13'+ match_obj13.group(1))
   # \s表示空格的意思 \S只要不是空格都可以 bobby123
match_obj14 = re.match(regex_str14, line)
if match_obj14:
   print('14'+ match_obj14.group(1))
   # \S只要不是空格都可以 非空格重复则bobby123 没有非空格重复就是b 非贪婪模式
match_obj15 = re.match(regex_str15, line)
if match_obj15:
   print('15'+ match_obj15.group(1))
   #\w等同于 [A-Za-z0-9_] \W不等于\w才可以 一个就是b
match_obj16 = re.match(regex_str16, line2)
if match_obj16:
   print('16'+ match_obj16.group(1))
   #提取汉字 但是只能一直提取汉字，不能提取空格等其他，提取到连续汉字为止
   #除非前面加.* 否则只识别汉字
match_obj17 = re.match(regex_str17, line3)
if match_obj17:
   print('17'+ match_obj17.group(1))
   #取消贪婪提取汉字
match_obj18 = re.match(regex_str18, line3)
if match_obj18:
   print('18'+ match_obj18.group(1))
   #提取数字
match_obj19 = re.match(regex_str19, line8)
if match_obj19:
   print('19'+ match_obj19.group(1))
   #综合 line 4 5 6 7 8

深度优先和广度优先

1、网站的树结构

2、深度优先算法和实现

3、广度优先算法和实现

以伯乐在线为例分层url 树形可以建立list 以防止重复死循环

深度优先、广度优先

默认深度优先递归实现

广度优先队列实现

def depth_tree(tree_node):
    if tree_node is not None:
        print(tree_node._data)
        if tree_node._left is not None:
            return depth_tree(tree_node._left)
        if tree_node._right is not None:
            return depth_tree(tree_node._right)

def level_queue(root):
    """"利用队列实现树的广度优先遍历"""
    if root is None:
        return
    my_queue = []
    node = root
    my_queue.append(node)
    while my_queue:
        node = my_queue.pop(0)
        print(node.elem)
        if node.lchild is not None:
            my_queue.append(node.lchild)
        if node.rchild is not None:
            my_queue.append(node.rchild)

爬虫去重策略

1、将访问过得url保存到数据库中下一个和他去重

2、将访问过得url保存到set中，只需要o(1)的代价就可以查询url

100000000*2byte*50个字符/1024/1024/1024=9G

1亿条，每条50个字符，2字节，除下来

3、url经过md5等方法哈希后保存到set中 1个url16个byte（128个bit）相当缩减5倍

4、用bitmap方法，将访问过的url通过hash函数映射到某一位

8个bit位，1个byte，每个bit位表示1个url，进一步压缩，但是 1个hash函数映射多个hash到1位，造成冲突

100000000/8byte/1024/1024=12m

5、bloomfilter方法对bitmap进行改进，多重hash函数降低冲突

scrapy采用第三种，但是我们可以采用第五种进行

字符串编码