python爬取猫眼电影排名

最新推荐文章于 2022-07-27 23:05:43 发布

nuancolor

最新推荐文章于 2022-07-27 23:05:43 发布

阅读量638

点赞数 1

分类专栏： MongoDB python

本文链接：https://blog.csdn.net/duanyijiangzhi/article/details/94593807

版权

MongoDB 同时被 2 个专栏收录

2 篇文章 0 订阅

订阅专栏

python

2 篇文章 1 订阅

订阅专栏

python爬取猫眼电影排名

本次爬虫主要使用requests库爬取和正则表达式re解析，下面进行简要分析

1、项目流程

1、获取猫眼电影排行榜一页的页面信息，通过requests.get获得

2、使用正则表达式解析一个页面的页面信息，获得需要内容

3、通过生成器爬取多个页面内容，输出

4、将所得到内容存入字典中，输出

5、将所得到信息存储到MongoDB数据库中

2、项目结果

成功爬取，存入mongodb数据库

mongodb查询在这里插入图片描述

3、项目代码

#!/usr/bin/env python 
# -*- coding:utf-8 -*-
#作者：nuancolor
#网址：暂无


import requests
from requests.exceptions import RequestException
import re
import pymongo

# 配置数据库信息
MONGO_HOST = "127.0.0.1"  # 主机IP
MONGO_URl = 'localhost'
MONGO_DB = 'test'  # 数据库名
MONGO_TABLE = 'movies'  # 表名

# 连接数据库
client = pymongo.MongoClient(MONGO_URl)
db = client[MONGO_DB]


# 存入数据库
def save_url_to_Mongo(result):
    try:
        if db[MONGO_TABLE].insert_one(result):
            print('存储到MongoDB成功', result)
    except Exception:
        print('存储到MongoDb失败', result)


# 获取
def get_one_page(url):
    try:
        response = requests.get(url)
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        return None


# 解析
def parse_one_page(html):
    pattern = re.compile('<dd>.*?board-index.*?>(\d+)</i>.*?<p.*?name">'
                         + '<a\shref="(.*?)".*?>(.*?)</a></p>.*?star">(.*?)</p>'
                         + '.*?>(.*?)</p>.*?integer">(.*?)</i>'
                         + '.*?fraction">(.*?)</i>.*?</dd>', re.S)
    items = re.findall(pattern, html)
    # 以字典的形式存储起来
    headurl = 'https://maoyan.com'
    for item in items:
        yield {
            'index': item[0],
            'url': headurl + item[1],
            'title': item[2],
            'actor': item[3].strip()[3:],
            'time': item[4].strip()[5:],
            'score': item[5] + item[6]
        }


def main(offset):
    url = 'https://maoyan.com/board/4?offset=' + str(offset)
    html = get_one_page(url)
    for item in parse_one_page(html):
        print(item)
        result = item
        save_url_to_Mongo(result)


if __name__ == '__main__':
    for i in range(3):
        main(i * 10)

4、遇到的问题及解决

1、进行页面解析是书写正则表达式一定要规范，不然会出现报错或解析内容为空列表

2、爬取电影的url发现页面只爬取到网页链接的后半部分，在进行数据处理是进行相应补充即可

3、连接pymongo是报错，没有发现该库，我使用的是spyder运行项目，换pycharm部署之后成功。

小结

本次项目主要是对requests库和re库的一个熟练使用，途中出现的问题等都加深了对爬虫处理的理解与应用。

nuancolor

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
python爬取猫眼电影排名

python爬取猫眼电影排名本次爬虫主要使用requests库爬取和正则表达式re解析，下面进行简要分析1、项目流程1、获取猫眼电影排行榜一页的页面信息，通过requests.get获得2、使用正则表达式解析一个页面的页面信息，获得需要内容3、通过生成器爬取多个页面内容，输出4、将所得到内容存入字典中，输出5、将所得到信息存储到MongoDB数据库中2、项目结果成功爬取，存入mo...
复制链接

扫一扫