有这样大一个需求,获取豆瓣大V粉丝观影数。我开始觉得这种应该很好获取,通过使用shell利用grep和awk 几行就可以搞定了,后来我发觉需要使用登陆才能获取豆瓣用户关注人,我打算利用curl 带cookie来实现登陆,不过都失败了,所以我利用python实现登陆,然后调用shell来获取一些用户信息,具体代码如下:
python脚本:
#-*-coding:utf-8-*-
import requests
import time
import os
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36'}
cookies = {'cookie': 'XXXXX'} #xxx是刚才保存的cookies信息,粘贴在这里
url1 = 'https://www.douban.com/people/37136489/rev_contacts?start='#需要爬都用户地址前缀
for i in range(0,444,70):
num = str(i)
url = url1 + num #加上后缀就是需要爬去都用户地址信息
r = requests.get(url, cookies = cookies, headers = headers)
time.sleep(3)
with open('douban.txt', 'wb+') as f:
f.write(r.content)
os.system('bash test.sh'
shell脚本:
#/bin/bash
cat douban.txt |grep '<dd><a href="https://www.douban.com/people/' |awk -F '/people/' '{print $2}' |awk -F '/">' '{print $1}' >ids.txt
cat douban.txt |grep '<dd><a href="https://www.douban.com/people/' |awk -F '/">' '{print $2}' |awk -F '</a>' '{print $1}'>>name.txt
while read line;do
echo $line
id=$line
curl -s https://www.douban.com/people/$id/ > html
name=`cat html |grep "的主页</a></li>" |awk -F '/">' '{print $2}' |awk -F '的主页</' '{print $1}'`
address=`cat html |grep '常居: ' |awk -F '/">' '{print $2}' |awk -F '</' '{print $1}'`
time=`cat html |grep '加入</' |awk -F '<br/>' '{print $2}' |awk -F '加入<' '{print $1}'`
moive_num=`cat html |grep '看过</a>' |awk -F '部看过</' '{print $1}' |awk -F 'collect" target="_blank">' '{print $2}'`
picture_url=`cat html|grep 'class="userface"' |awk -F 'src="' '{print $2}' |awk -F '" class' '{print $1}'`
picture_name=$id_$name.jpg
echo "| $name | $id | $address | $time | $moive_num " >> information
wget -q -c -O $picture_name $picture_url
sleep 3
done < ids.txt
可以获取用户都名字、域名(ID)、常居地、注册时间、看的电影数
但是有个问题就是,我发觉貌似一些新的用户再不带cookie都情况下无法访问主页。所以我后来就全都使用python来实现了
完整的python代码
#!usr/bin/env python3
# -*- coding:utf-8-*-
import requests
from bs4 import BeautifulSoup
import re
import csv
import time
# name: get_douban_user_fans_info.py
# version: 1.0
# ceateTime: 2018-11-12
# description: 自己手动输入需要捕获的用户粉丝列表,且需要输入粉丝数,可以获取粉丝名字和主页url以及观影数
# author: mengyanhuangchao
# email: 406993906@qq.com
# adaptation centos 7.5、ubuntu 16.04、mac 10.13
cookies = {
'cookies': 'bid=fHap7LDQSG4; douban-fav-remind=1; __yadk_uid=U3rzmxaVjT9if7PVGwvevnzFCuFdTX98; push_doumail_num=0; ps=y; ue="406993906@qq.com"; __utmv=30149280.6257; _vwo_uuid_v2=D99C59093603A87BE66AAD4399EE7FD09|346053f8c63b3e66d43cc976afba85d7; __utmz=30149280.1540857646.4.4.utmcsr=baidu.com|utmccn=(referral)|utmcmd=referral|utmcct=/; _ga=GA1.2.1939184369.1533477711; gr_user_id=541cc591-0932-489a-9c83-6e5bad679989; douban-profile-remind=1; ct=y; dbcl2="62579451:XO4qG5tnovQ"; loc-last-index-location-id="118172"; ll="118172"; push_noty_num=0; __utma=30149280.1939184369.1533477711.1541842210.1541861713.18; ck=Tl0S; ap_v=0,6.0; _pk_ref.100001.8cb4=%5B%22%22%2C%22%22%2C1541923055%2C%22https%3A%2F%2Fwww.google.com%2F%22%5D; _pk_id.100001.8cb4=49d509047855b846.1533477711.44.1541923055.1541920123.'}
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.8',
'Connection': 'keep-alive',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
def movie_list(url):
response = requests.get(url, cookies=cookies, headers=headers)
response.encoding = 'utf-8'
html = BeautifulSoup(response.text, 'html.parser')
html1 = html.find('div', {'class': 'article'})
tags = html1.find_all('dl', {'class': 'obu'})
info = []
for tag in tags:
print('---------------------')
uid = tag.a['href']
print(uid)
name = tag.a.img['alt']
print(name)
time.sleep(3)
response2 = requests.get(uid, cookies=cookies, headers=headers)
response2.encoding = 'utf-8'
html2 = BeautifulSoup(response2.text, 'html.parser')
html3 = html2.find('div', {'id': 'movie'})
try:
html4 = html3.find('span', {'class': 'pl'}).get_text()
except:
movie_num = "0部看过"
print(movie_num)
else:
try:
movie_num = re.findall(u"\d+部看过", html4)[0]
except:
movie_num = "0部看过"
print(movie_num)
else:
print(movie_num)
tup = (name, uid, movie_num)
# print(tup)
info.append(tup)
# print(info)
return info
#已csv格式保存数据
def save_data():
headers = ['豆瓣账号', '账号地址', '所看电影部数']
with open('zhangxiaobei_fans_movies_number.csv', encoding='UTF-8', mode='w') as f:
f_csv = csv.writer(f)
f_csv.writerow(headers)
for url in urls:
data_list = movie_list(url)
for data in data_list:
f_csv.writerow(data)
urls = ['https://www.douban.com/people/xzfd/rev_contacts?start=%d' %
index for index in range(0, 57054, 70)]
if __name__ == '__main__':
save_data()
大概数据结构如下
最后感觉时间太长了,没有跑完数据,欢迎提出来疑问,欢迎指出问题,欢迎提需求!!!