抓取豆瓣2016年电影/分类_python

7 篇文章 0 订阅

Description


嗯,这次简单点
突然很想看电影,于是就抄起了python搞了一发豆瓣的电影年度清单,顺便统计了评分排名和分类之类的。还算简单吧
16年电影都在这个链接(大概)

'https://movie.douban.com/j/search_subjects?type=movie&tag=%E7%83%AD%E9%97%A8&sort=time&page_limit=365&page_start=0'

这里其实是可以get传输直接访问豆瓣的,也能访问这个链接,limit是显示多少条,设一个比较大的数字就能反馈全部电影了

大概长这样
这里写图片描述
想过用beautifulsoup但是不行,老老实实re匹配去吧

趴下来之后储存在一个dict里面,至于按key排序就比较好玩了。我们可以先记录一下dict的key生成list,然后对list排序,那么遍历这个list对应的dict值就是排好序的了
具体代码

d = {}
d['olahiuj'] = 'handsome'
for key in sorted(d.keys()):
    print d[key]

推荐用sorted而不是sort,因为它不改变原本的列表

j接下来就是解析抓到的网址对应找类别,不说了就是re匹配。这一块特别慢可以多线程,但是注意访问避免过频繁尽量像真人一点(笑
r然后呢我们还是用dict来保存类别和对应的计数,输出到一个csv里面保存
0python是自带csv模块的引用就好了

import csv

0之所以选择csv而不是其他主要是因为csv能用excel编辑浏览
0写操作我们这么做

with open('filename.csv', 'wb') as csvfile:
    blah = csv.writer(csvfile, dialect = 'excel')
    blah.writerow([1, 2, 3])

w为了保证list中的每一个项目都能处在单独的列里,设置dialect为’excel’,还有就是输出一定要是list(大概?

b本来还想着要可视化一下数据建个图什么的,明天再弄吧。话说同性分类有11部电影是什么鬼,排名第一是又是什么鬼

Code


# -*- coding: utf-8 -*-
from requests.packages.urllib3.exceptions import InsecureRequestWarning
import threading
import requests
import time
import csv
import os
import re

def getPage(html, url, headers, params = {}, referer = ''):
    flags = True
    if (url[:5] == 'https'):
        flags = False
    headers['Referer'] = referer
    response = html.get(url, headers = headers, params = params, verify = flags)
    page = response.content
    return page

def find(string, page, flags = 0):
    pattern = re.compile(string, flags = flags)
    results = re.findall(pattern, page)
    return results

def work(html, url, headers, cnt):
    tmp = ''
    for q in url:
        if q != '\\':
            tmp = tmp + q
    url = tmp
    page = getPage(html, url, headers)
    types = find(r'<span property="v:genre">(.+?)</span>', page)
    global mutex, rec
    mutex.acquire()
    print cnt
    for item in types:
        if rec.has_key(item):
            rec[item] += 1
        else:
            rec[item] = 1
    mutex.release()

def init():
    html = requests.session()
    doubanUrl = 'https://movie.douban.com'
    headers={'User-Agent':'Mozilla/5.0 (compatible; MSIE 5.5; Windows NT)'}
    page = getPage(html, 'https://movie.douban.com/j/search_subjects', headers, params = {'type': 'movie', 'tag': '热门', 'sort': 'time', 'page_limit': '400', 'page_start': '0'})
    results = find(r'"rate":"(.+?)",.+?"title":"(.+?)","url":"(.+?)"', page)
    urls = [item[2] for item in results]
    rates = [item[0] for item in results]
    titles = [item[1] for item in results]

    for i in xrange(len(urls)):
        for j in xrange(i + 1, len(urls)):
            if (rates[i] < rates[j]):
                rates[i], rates[j] = rates[j], rates[i]
                urls[i], urls[j] = urls[j], urls[i]
                titles[i], titles[j] = titles[j], titles[i]

    with open('douban.csv', 'wb') as csvfile:
        spamwriter = csv.writer(csvfile, dialect = 'excel')
        for i in xrange(len(rates)):
            spamwriter.writerow([titles[i], urls[i], rates[i]])
    global mutex, rec
    mutex = threading.Lock()
    rec = {}
    jobs = []
    cnt = 0
    for i in xrange(len(urls)):
        cnt += 1
        job = threading.Thread(target = work, args = (html, urls[i], headers, cnt))
        job.start()
        jobs.append(job)

    for job in jobs:
        job.join()

    with open('douban_type.csv', 'wb') as csvfile:
        spamwriter = csv.writer(csvfile, dialect = 'excel')
        for key in sorted(rec.keys(), reverse = True):
            spamwriter.writerow([key, rec[key]])

if __name__ == '__main__':
    requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
    init()
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值