Day12--实战--原生爬虫
🧸一、分析抓取目的确定抓取页面
爬取主播人气排行
🧸二、整理爬虫常规思路
# 明确目的
# 找到数据对应的网页
# 分析网页的结构找到数据的标签位置
# 模拟HTTP请求,向服务器发送请求,获取服务器返回给我们的HTML
# 用正则表达式提取我们需要的数(主播的名字和名气)
🧸三、数据提取层级分析及原则三、正则分析HTML、正则分析获取名字和人数 (数据精炼、 sorted 排序)
from urllib import request
import re
class Spider():
url = 'https://www.huajiao.com/category/1000'
root_pattern = '<div class="username clearfix">([\s\S]*?)</div>' # 用正则表达式的组
name_pattern = '<p class="name fl">([\s\S]*?)</p>'
number_pattern = '<span class="watches fr">([\s\S]*?)</span>'
def __fetch_content(self): # __XXX 私有方法
r = request.urlopen(Spider.url)
# bytes
htmls = r.read()
htmls = str(htmls, encoding='utf-8') # 把字节转换为文本
return htmls
# 数据分析
def __analysis(self, htmls):
root_html = re.findall(Spider.root_pattern, htmls)
anchors = []
for html in root_html:
name = re.findall(Spider.name_pattern, html)
number = re.findall(Spider.number_pattern, html)
anchor = {'name': name, 'number': number}
anchors.append(anchor)
# print(root_html[0])
# print(anchors)
return anchors
# 数据精炼
def __refine(self, anchors):
l = lambda anchor: {'name': anchor['name'][0].strip(),
'number': anchor['number'][0].strip()}
return map(l, anchors)
# 排序
def __sort(self, anchors):
anchors = sorted(anchors, key=self.__sort_seed, reverse=True)
return anchors
# 排序种子
def __sort_seed(self, anchor):
# 提取数字
r = re.findall('\d*', anchor['number'])
number = float(r[0])
# 处理文本 ‘万’
if '万' in anchor['number']:
number *= 10000
return number
# 展示
def __show(self, anchors):
# for anchor in anchors:
# print(anchor['number'])
for rank in range(0, len(anchors)):
print('rank' + str(rank + 1)
+ ' : ' + anchors[rank]['name']
+ ' : ' + anchors[rank]['number'])
def go(self): # 入口方法(主方法)
htmls = self.__fetch_content()
anchors = self.__analysis(htmls)
anchors = list(self.__refine(anchors))
anchors = self.__sort(anchors)
self.__show(anchors)
# print(anchors)
spider = Spider()
spider.go()