安全入门笔记---爬虫相关

最新推荐文章于 2020-10-21 11:29:29 发布

chongba''123

最新推荐文章于 2020-10-21 11:29:29 发布

阅读量169

点赞数

分类专栏： chongba笔记文章标签： python 正则表达式 web

本文链接：https://blog.csdn.net/qq_40784515/article/details/105703208

版权

chongba笔记专栏收录该内容

31 篇文章 0 订阅

订阅专栏

小朋友才做选择，php和python我都要

1.1 spider爬虫简单入门

我们是通过一个简单的例子去刨析爬虫的工作原理，和几种特性。以及学习时候需要注意的知识点。目的是为了举一反三，对知识点发散，建立知识结构和模型。

这里的例子是爬取某网站的div数据，很有趣。
b站视频：https://www.bilibili.com/video/BV1J7411i7NY/?p=10&t=42

2 爬虫的需求分析

3 fuzz代码

4 代码功能

0x01 爬取所有内容，无正则无匹配

from urllib.request import urlopen as uo

url = "http://quotes.toscrape.com/page/1/"

response = uo(url)

html_content = response.read().decode("UTF-8")


import re


pattern = '<span class="text" itemprop="text">(.*)</span>'

quotes10_span = re.findall(pattern,html_content) # 以列表的形式返回可以匹配到的字符串 re.findall

print(quotes10_span)

RUNNING
虽然我读不懂英文，但是好像返回了所有的名言：
那么问题来了？？？为啥只匹配了一个itemprop
下面还有一行呢？作家嘎哈没匹配到呢/???，，，其实很简单。。因为这不是一个标签。我淦
0x01 爬取第一页的作家名称和名言


# -*- encoding:utf-8 -*-


from urllib.request import urlopen as uo

url = "http://quotes.toscrape.com/page/1/"

response = uo(url)

html_content = response.read().decode("UTF-8")


import re


pattern = '<span class="text" itemprop="text">(.*?)</span>'

quotes10_span = re.findall(pattern,html_content) # 以列表的形式返回可以匹配到的字符串 re.findall

authors_10 = re.findall('<small class="author" itemprop="author">(.*)</small>',html_content)         ##匹配作者
# print(quotes10_span)

print(len(quotes10_span))
print(quotes10_span[0])
print(len(authors_10))
print(authors_10[0])

ps:刚才尼玛因为输错了auther 就浪费了10分钟，。。。

running:

在这里插入图片描述

0x02 抓取全部的标签
代码:

from urllib.request import urlopen as uo

url = "http://quotes.toscrape.com/page/1/"

response = uo(url)

html_content = response.read().decode("UTF-8")


import re


pattern = '<span class="text" itemprop="text">(.*?)</span>'

quotes10_span = re.findall(pattern,html_content) # 以列表的形式返回可以匹配到的字符串 re.findall

authors_10 = re.findall('<small class="author" itemprop="author">(.*)</small>',html_content)         ##匹配作者


tags = re.findall('<a class="tag" href=".*">(.*)</a>',html_content)

print(tags[0])

running：
在这里插入图片描述

0x03 其实这里有个问题，就是这个标签无法区别是哪个名言的。那么我们换个思路.

先抓取全部的div标签，然后再枚举所有内容进行归类。

from urllib.request import urlopen as uo

url = "http://quotes.toscrape.com/page/1/"

response = uo(url)

html_content = response.read().decode("UTF-8")


import re


pattern = '<span class="text" itemprop="text">(.*?)</span>'

quotes10_span = re.findall(pattern,html_content) # 以列表的形式返回可以匹配到的字符串 re.findall

authors_10 = re.findall('<small class="author" itemprop="author">(.*)</small>',html_content)         ##匹配作者


# tags = re.findall('<a class="tag" href=".*">(.*)</a>',html_content)

div_10 = re.findall('<div class="tags">.*</div>',html_content,re.RegexFlag.DOTALL)

for i in div_10:
    print(i)

from urllib.request import urlopen as uo

url = "http://quotes.toscrape.com/page/1/"

response = uo(url)

html_content = response.read().decode("UTF-8")


import re


pattern = '<span class="text" itemprop="text">(.*?)</span>'

quotes10_span = re.findall(pattern,html_content) # 以列表的形式返回可以匹配到的字符串 re.findall

authors_10 = re.findall('<small class="author" itemprop="author">(.*)</small>',html_content)         ##匹配作者


# tags = re.findall('<a class="tag" href=".*">(.*)</a>',html_content)

div_10 = re.findall('<div class="tags">.*</div>',html_content,re.RegexFlag.DOTALL)

for div in div_10:
    print(div)
    print(len(div_10))

在这里插入图片描述

为啥len(div_10) 只有1呢
贪婪法则
a* = [aaa, aaaa, aaaaaaaaaa,[] ]
这里面a表示0次或多次
所以我们要匹配多个div就必须得.?

另外还有一个特殊的参数re.RegexFlag.DOTALL
现在我们进行二次循环遍历在div里面循环数据输出tag


from urllib.request import urlopen as uo

url = "http://quotes.toscrape.com/page/1/"

response = uo(url)

html_content = response.read().decode("UTF-8")


import re


pattern = '<span class="text" itemprop="text">(.*?)</span>'

quotes10_span = re.findall(pattern,html_content) # 以列表的形式返回可以匹配到的字符串 re.findall

authors_10 = re.findall('<small class="author" itemprop="author">(.*)</small>',html_content)         ##匹配作者


# tags = re.findall('<a class="tag" href=".*">(.*)</a>',html_content)

div_10 = re.findall('<div class="tags">.*</div>',html_content,re.RegexFlag.DOTALL)

for div in div_10:
    tags_each_quote =[] #存储每一句话的所有标签
    a_tags = re.findall('<a class="tag" href=".*">(.*)</a>',div)
    for tag in a_tags:
        tags_each_quote.append(tag) #在空数组后面添加新的对象

    print(tags_each_quote)