[Python]记录第一次Python写爬虫的过程（猫眼电影TOP10）

最新推荐文章于 2024-07-24 10:38:58 发布

zytjasper

最新推荐文章于 2024-07-24 10:38:58 发布

阅读量875

点赞数 1

分类专栏： Python 文章标签：爬虫

本文链接：https://blog.csdn.net/zytjasper/article/details/80100582

版权

Python 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

爬的对象为猫眼电影排行榜TOP10，Website：http://maoyan.com/board

思路和程序参考了课时14：Request+正则表达式爬取猫眼电影

首先下面是爬取的结果：

贴出代码：

import json
import requests
from requests.exceptions import RequestException
import re

def get_one_page(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
    }
    try:
        response = requests.get(url,headers=headers)
        if response.status_code == 200:
            return response.text
    except RequestException:
        return None

def parse_one_page(html):
    pattern = re.compile('<dd>.*?board-index.*?>(\d+)</i>'  
                         '.*?<p.*?"name"><.*?title="(.*?)"'  
                         '.*?"star">(.*?)</p>' 
                         '.*?"releasetime">(.*?)</p>'  
                         '.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>' 
                         , re.S)
    res = re.findall(pattern,html)

    for item in res:
        yield {
            'index': item[0],
            'title': item[1],
            'actor': item[2].strip()[3:],
            'time':item[3].strip()[5:],
            'score': item[4] + item[5]
        }

def write_to_file(content):
    with open('result.txt','a',encoding='utf-8')as f:
        f.write(json.dumps(content, ensure_ascii=False) + '\n')
        f.close()

def main():
    url = 'http://maoyan.com/board'
    html=get_one_page(url)
    for item in parse_one_page(html):
        print(item)
        write_to_file(item)


if __name__ == '__main__':
    main()

问题及解决方法：

（1）IndentationError: unexpected indent

分析：拼写错误（exceptions，注意s...）及缩进问题（遇到了很多次缩进的问题，去报错行的上一行回车就能找到正确的缩进间距）。

（2）takes 0 positional arguments but 1 was given

分析：是自定义函数里面没有带参数，原程序爬的排行榜有100个，需要翻页所以main函数括号里设置了参数（offset），但我这个程序里不需要翻页（现在的猫眼只有TOP10）。

（3）'yield' outside function

分析：yield必须在function里面使用，不能直接用在function外面，注意是否写在def下面。

（4）<title>猫眼访问控制</title>

<h3><span class="icon">⛔️</span>很抱歉，您的访问被禁止了</h3>

分析：需要伪装浏览器，在headers中添加’User-Agent’字典内容如下：

 headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
    }

并修改：

response = requests.get(url,headers=headers)

My Prey Is Near.

部分内容参考：https://blog.csdn.net/wenboyu/article/details/78166713

https://blog.csdn.net/u013205877/article/details/70332612

zytjasper

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
[Python]记录第一次Python写爬虫的过程（猫眼电影TOP10）

爬的对象为猫眼电影排行榜TOP10，Website：http://maoyan.com/board思路和程序参考了课时14：Request+正则表达式爬取猫眼电影首先下面是爬取的结果：贴出代码：import jsonimport requestsfrom requests.exceptions import RequestExceptionimport redef ...
复制链接

扫一扫

专栏目录