爬虫：爬取猫眼电影top100步骤,以json形式写入文件

最新推荐文章于 2022-05-04 16:46:40 发布

无比性感的程序媛

最新推荐文章于 2022-05-04 16:46:40 发布

阅读量1.2k

点赞数 2

本文链接：https://blog.csdn.net/panjunxiao/article/details/101438460

版权

项目需求：网站预览了解需求确定是post方法还是get方法，获取基础路由，分析路由变化
网址：http://maoyan.com/board （页面中的top100）电影名、演员、发布时间、等信息
使用正则提取数据
1. 缩小范围，提取有效信息,比如先提取数据框

2. 再在
中提取

3. 需要分析

有多个还是1个，多个需要遍历取出下一层
4.

是多个，所以遍历

取出下一层数据
以json形式写入文件，json格式可以直接读入列表字典的形式的内容，txt只能读取字符串，要遍历以后才能存入。

 def write_jaon(self,text,filename):
        # 将数据以json的形式写入文件
        with open(filename, 'w', encoding='utf-8') as s:
            json.dump(text, s)

# 读取json文件
with open('maoyan2.json', 'r', encoding='utf-8') as js:
    info = json.load(js)
print(info, 'oo')

完整代码

import re,requests,json

class Maoyan:
    def __init__(self,url):
        self.url = url
        self.paqu()
    def get_data(self,data):
        if data:
            return data.group(1)
        return ''
    def write_jaon(self,text,filename):
        # 将数据以json的形式写入文件
        with open(filename, 'w', encoding='utf-8') as s:
            json.dump(text, s)
    def paqu(self):
        move_list = []
        dic = {}
        # 取出10个页码的html，再取出每页的要的数据所在的<dl>,在取出<dl>中的所有<dd>,<dd>的内容就是一页的所有数据
        for i in range(10):
            headers = {
                'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
            }
            response = requests.get(self.url % (i * 10), headers=headers)  # 获取所有页码的html内容
            # 先获取整个包含数据的标签<dl>
            dl_p = re.compile(r'<dl class="board-wrapper">.*?</dl>', re.S)
            dl_content = dl_p.search(response.text).group()#因为返回的是一个match对象，所以group取出值才是一个字符串的形式
            dl_content = dl_p.findall(response.text)[0] #因为返回的是一个列表，所以要[0]才是一个字符串的形式
            # print(dl_content)
            # 取出<dl>中的所有<dd>
            dd_p = re.compile(r' <dd>.*?</dd>', re.S)
            print(dl_content)
            dd_content = dd_p.findall(dl_content)  # 返回所有的<dd>
            # print(dd_content)
            for dd in dd_content:  # 遍历所有的dd ,每次取出一个dd的数据
                # 电影名字
                title = self.get_data(re.search(r'title="(.*?)" class="image-link"', dd, re.S))
                # print(title)
                # 演员
                actor = self.get_data(re.search(r'<p class="star">(.*?)</p>', dd, re.S))
                # print(actor)
                # 上映时间
                publish_time = self.get_data(re.search(r'<p class="releasetime">(.*?)</p>', dd, re.S))
                # print(publish_time)
                # 评分
                score = re.search(r'<p class="score"><i class="integer">(.*?)</i><i class="fraction">(.*?)</i></p>', dd,
                                  re.S)
                if score:
                    score = score.group(1) + score.group(2)
                    # print(score)
                dic['title'] = title
                dic['actor'] = actor.strip()
                dic['publish_time'] = publish_time
                dic['score'] = score
                # 把所有内容加入列表,每次加入一个电影的字典
                move_list.append(dic)
            self.write_jaon(move_list,'maoyan2.json')
if __name__ == '__main__':
    filename = 'maoyan2.json'
    base_url = 'https://maoyan.com/board/4?offset=%s'
    Maoyan(base_url)

# print(move_list)

# 读取json文件
with open('maoyan2.json', 'r', encoding='utf-8') as js:
    info = json.load(js)
# print(info, 'oo')

其中需要注意的是：
dl_content = dl_p.search(response.text).group()# 返回的是一个match对象，所以group取出值才是一个字符串的形式
dl_content = dl_p.findall(response.text)[0] #因为返回的是一个列表，所以要[0]才是一个字符串的形式
dd_p = re.compile(r’

.*?

’, re.S)
dd_content = dd_p.findall(dl_content) #寻找的内容只能是一个字符串

无比性感的程序媛

关注

2
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
爬虫：爬取猫眼电影top100步骤,以json形式写入文件

项目需求：网站预览了解需求确定是post方法还是get方法，获取基础路由，分析路由变化网址：http://maoyan.com/board （页面中的top100）电影名、演员、发布时间、等信息使用正则提取数据1. 缩小范围，提取有效信息,比如先提取数据框2. 再在中提取3. 需要分析有多个还是1个，多个需要遍历取出下一层4. 是多个，所以遍历取出下一层数据以json形式写入文...
复制链接

扫一扫