【爬虫实践】爬虫获取豆瓣用户粉丝信息

有这样大一个需求,获取豆瓣大V粉丝观影数。我开始觉得这种应该很好获取,通过使用shell利用grep和awk 几行就可以搞定了,后来我发觉需要使用登陆才能获取豆瓣用户关注人,我打算利用curl 带cookie来实现登陆,不过都失败了,所以我利用python实现登陆,然后调用shell来获取一些用户信息,具体代码如下:

python脚本:

#-*-coding:utf-8-*-
import requests
import time
import os

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36'}
cookies = {'cookie': 'XXXXX'} #xxx是刚才保存的cookies信息,粘贴在这里
url1 = 'https://www.douban.com/people/37136489/rev_contacts?start='#需要爬都用户地址前缀

for i in range(0,444,70):
        num = str(i)
        url = url1 + num #加上后缀就是需要爬去都用户地址信息
	r = requests.get(url, cookies = cookies, headers = headers)
    time.sleep(3)
	with open('douban.txt', 'wb+') as f:
    	    f.write(r.content)
            os.system('bash test.sh'

shell脚本:

#/bin/bash

cat douban.txt |grep '<dd><a href="https://www.douban.com/people/' |awk -F '/people/' '{print $2}' |awk -F '/">' '{print $1}' >ids.txt
cat douban.txt |grep '<dd><a href="https://www.douban.com/people/'  |awk -F '/">' '{print $2}' |awk -F '</a>' '{print $1}'>>name.txt
while read line;do
    echo $line
    id=$line
    curl -s https://www.douban.com/people/$id/  > html
    name=`cat html |grep  "的主页</a></li>" |awk -F '/">' '{print $2}' |awk -F '的主页</' '{print $1}'`
    address=`cat html |grep '常居:&nbsp' |awk -F '/">' '{print $2}' |awk -F '</' '{print $1}'`
    time=`cat html |grep '加入</'  |awk -F '<br/>' '{print $2}' |awk -F '加入<' '{print $1}'`
    moive_num=`cat html |grep '看过</a>' |awk -F '部看过</' '{print $1}' |awk -F 'collect" target="_blank">' '{print $2}'`
    picture_url=`cat html|grep 'class="userface"' |awk -F 'src="' '{print $2}' |awk -F '" class' '{print $1}'`

    picture_name=$id_$name.jpg
    echo "| $name | $id | $address | $time | $moive_num " >> information
    wget -q -c -O $picture_name $picture_url
    sleep 3
done < ids.txt

可以获取用户都名字、域名(ID)、常居地、注册时间、看的电影数

但是有个问题就是,我发觉貌似一些新的用户再不带cookie都情况下无法访问主页。所以我后来就全都使用python来实现了

完整的python代码

 

#!usr/bin/env python3
# -*- coding:utf-8-*-

import requests
from bs4 import BeautifulSoup
import re
import csv
import time

# name:			get_douban_user_fans_info.py
# version: 		1.0
# ceateTime:	2018-11-12
# description:	自己手动输入需要捕获的用户粉丝列表,且需要输入粉丝数,可以获取粉丝名字和主页url以及观影数
# author:		mengyanhuangchao
# email:		406993906@qq.com
# adaptation    centos 7.5、ubuntu 16.04、mac 10.13

cookies = {
    'cookies': 'bid=fHap7LDQSG4; douban-fav-remind=1; __yadk_uid=U3rzmxaVjT9if7PVGwvevnzFCuFdTX98; push_doumail_num=0; ps=y; ue="406993906@qq.com"; __utmv=30149280.6257; _vwo_uuid_v2=D99C59093603A87BE66AAD4399EE7FD09|346053f8c63b3e66d43cc976afba85d7; __utmz=30149280.1540857646.4.4.utmcsr=baidu.com|utmccn=(referral)|utmcmd=referral|utmcct=/; _ga=GA1.2.1939184369.1533477711; gr_user_id=541cc591-0932-489a-9c83-6e5bad679989; douban-profile-remind=1; ct=y; dbcl2="62579451:XO4qG5tnovQ"; loc-last-index-location-id="118172"; ll="118172"; push_noty_num=0; __utma=30149280.1939184369.1533477711.1541842210.1541861713.18; ck=Tl0S; ap_v=0,6.0; _pk_ref.100001.8cb4=%5B%22%22%2C%22%22%2C1541923055%2C%22https%3A%2F%2Fwww.google.com%2F%22%5D; _pk_id.100001.8cb4=49d509047855b846.1533477711.44.1541923055.1541920123.'}
headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'zh-CN,zh;q=0.8',
    'Connection': 'keep-alive',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 '
    '(KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}


def movie_list(url):
    response = requests.get(url, cookies=cookies, headers=headers)
    response.encoding = 'utf-8'
    html = BeautifulSoup(response.text, 'html.parser')
    html1 = html.find('div', {'class': 'article'})
    tags = html1.find_all('dl', {'class': 'obu'})
    info = []
    for tag in tags:
        print('---------------------')
        uid = tag.a['href']
        print(uid)
        name = tag.a.img['alt']
        print(name)
        time.sleep(3)
        response2 = requests.get(uid, cookies=cookies, headers=headers)
        response2.encoding = 'utf-8'
        html2 = BeautifulSoup(response2.text, 'html.parser')
        html3 = html2.find('div', {'id': 'movie'})
        try:
            html4 = html3.find('span', {'class': 'pl'}).get_text()
        except:
            movie_num = "0部看过"
            print(movie_num)
        else:
            try:
                movie_num = re.findall(u"\d+部看过", html4)[0]
            except:
                movie_num = "0部看过"
                print(movie_num)
            else:
                print(movie_num)
        tup = (name, uid, movie_num)
     #  print(tup)
        info.append(tup)
     #  print(info)
    return info

#已csv格式保存数据
def save_data():

    headers = ['豆瓣账号', '账号地址', '所看电影部数']
    with open('zhangxiaobei_fans_movies_number.csv', encoding='UTF-8', mode='w') as f:
        f_csv = csv.writer(f)
        f_csv.writerow(headers)
        for url in urls:
            data_list = movie_list(url)
            for data in data_list:
                f_csv.writerow(data)


urls = ['https://www.douban.com/people/xzfd/rev_contacts?start=%d' %
        index for index in range(0, 57054, 70)]


if __name__ == '__main__':
    save_data()

大概数据结构如下

最后感觉时间太长了,没有跑完数据,欢迎提出来疑问,欢迎指出问题,欢迎提需求!!!

评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值