小朋友才做选择,php和python我都要
1.1 spider爬虫简单入门
我们是通过一个简单的例子去刨析爬虫的工作原理,和几种特性。以及学习时候需要注意的知识点。目的是为了举一反三,对知识点发散,建立知识结构和模型。
这里的例子是爬取某网站的div数据,很有趣。
b站视频:https://www.bilibili.com/video/BV1J7411i7NY/?p=10&t=42
2 爬虫的需求分析
3 fuzz代码
4 代码功能
- 0x01 爬取所有内容,无正则无匹配
from urllib.request import urlopen as uo
url = "http://quotes.toscrape.com/page/1/"
response = uo(url)
html_content = response.read().decode("UTF-8")
import re
pattern = '<span class="text" itemprop="text">(.*)</span>'
quotes10_span = re.findall(pattern,html_content) # 以列表的形式返回可以匹配到的字符串 re.findall
print(quotes10_span)
- RUNNING
虽然我读不懂英文,但是好像返回了所有的名言:
那么问题来了???为啥只匹配了一个itemprop
下面还有一行呢?作家嘎哈没匹配到呢/???,,,其实很简单。。因为这不是一个标签。我淦 - 0x01 爬取第一页的作家名称和名言
# -*- encoding:utf-8 -*-
from urllib.request import urlopen as uo
url = "http://quotes.toscrape.com/page/1/"
response = uo(url)
html_content = response.read().decode("UTF-8")
import re
pattern = '<span class="text" itemprop="text">(.*?)</span>'
quotes10_span = re.findall(pattern,html_content) # 以列表的形式返回可以匹配到的字符串 re.findall
authors_10 = re.findall('<small class="author" itemprop="author">(.*)</small>',html_content) ##匹配作者
# print(quotes10_span)
print(len(quotes10_span))
print(quotes10_span[0])
print(len(authors_10))
print(authors_10[0])
ps:刚才尼玛因为输错了auther 就浪费了10分钟,。。。
running:
- 0x02 抓取全部的标签
代码:
from urllib.request import urlopen as uo
url = "http://quotes.toscrape.com/page/1/"
response = uo(url)
html_content = response.read().decode("UTF-8")
import re
pattern = '<span class="text" itemprop="text">(.*?)</span>'
quotes10_span = re.findall(pattern,html_content) # 以列表的形式返回可以匹配到的字符串 re.findall
authors_10 = re.findall('<small class="author" itemprop="author">(.*)</small>',html_content) ##匹配作者
tags = re.findall('<a class="tag" href=".*">(.*)</a>',html_content)
print(tags[0])
running:
- 0x03 其实这里有个问题,就是这个标签无法区别是哪个名言的。那么我们换个思路.
先抓取全部的div标签,然后再枚举所有内容进行归类。
from urllib.request import urlopen as uo
url = "http://quotes.toscrape.com/page/1/"
response = uo(url)
html_content = response.read().decode("UTF-8")
import re
pattern = '<span class="text" itemprop="text">(.*?)</span>'
quotes10_span = re.findall(pattern,html_content) # 以列表的形式返回可以匹配到的字符串 re.findall
authors_10 = re.findall('<small class="author" itemprop="author">(.*)</small>',html_content) ##匹配作者
# tags = re.findall('<a class="tag" href=".*">(.*)</a>',html_content)
div_10 = re.findall('<div class="tags">.*</div>',html_content,re.RegexFlag.DOTALL)
for i in div_10:
print(i)
from urllib.request import urlopen as uo
url = "http://quotes.toscrape.com/page/1/"
response = uo(url)
html_content = response.read().decode("UTF-8")
import re
pattern = '<span class="text" itemprop="text">(.*?)</span>'
quotes10_span = re.findall(pattern,html_content) # 以列表的形式返回可以匹配到的字符串 re.findall
authors_10 = re.findall('<small class="author" itemprop="author">(.*)</small>',html_content) ##匹配作者
# tags = re.findall('<a class="tag" href=".*">(.*)</a>',html_content)
div_10 = re.findall('<div class="tags">.*</div>',html_content,re.RegexFlag.DOTALL)
for div in div_10:
print(div)
print(len(div_10))
为啥len(div_10) 只有1呢
贪婪法则
a* = [aaa, aaaa, aaaaaaaaaa,[] ]
这里面a表示0次或多次
所以我们要匹配多个div就必须得.?
另外还有一个特殊的参数re.RegexFlag.DOTALL
现在我们进行二次循环遍历在div里面循环数据输出tag
from urllib.request import urlopen as uo
url = "http://quotes.toscrape.com/page/1/"
response = uo(url)
html_content = response.read().decode("UTF-8")
import re
pattern = '<span class="text" itemprop="text">(.*?)</span>'
quotes10_span = re.findall(pattern,html_content) # 以列表的形式返回可以匹配到的字符串 re.findall
authors_10 = re.findall('<small class="author" itemprop="author">(.*)</small>',html_content) ##匹配作者
# tags = re.findall('<a class="tag" href=".*">(.*)</a>',html_content)
div_10 = re.findall('<div class="tags">.*</div>',html_content,re.RegexFlag.DOTALL)
for div in div_10:
tags_each_quote =[] #存储每一句话的所有标签
a_tags = re.findall('<a class="tag" href=".*">(.*)</a>',div)
for tag in a_tags:
tags_each_quote.append(tag) #在空数组后面添加新的对象
print(tags_each_quote)
输出结果: