###本系列实践目的:
打算先利用github上的项目huatian-funny,通过python抓取花田网上注册用户的数据,做个小实验,然后上传自己修改后的 huatian-funny 项目。
在 huatian-funny ,我们可以看到该项目的说明:
###1.准备
####需要 :
requests >=2.7.0,pymongo>=3.2.2,matplotlib>=1.4.3,Pillow>=3.2.0
####(1)安装requests 2.7.0
requests是python的一个HTTP客户端库.
源码安装 pip 或者easy_install,
>pip install requests
可以看到安的版本是2.10.0
####(2)安装matplotlib
见 python实践之准备 (一)的第4部分内容——安装matplotlib。这里不再赘述。
####(3)安装Pillow
>pip install pillow
(4)安装mongodb
可以从这里下载: mongodb下载。
下载完成后,运行 mongodb-win32-x86_64-2008plus-ssl-3.2.6-signed.msi,一路默认选下去,最后完成。
mongodb 默认安装在 C:\Program Files\MongoDB下。
Windows下 MongoDB 的默认目录是C:\data\db,需提前创建该目录。
####· 启动mongod 服务,双击运行mongod.exe 即可,或者启动时附加参数,
mongod.exe -journal -rest
如果不想用默认的C:\data\db目录,需要在启动服务器时使用–dbpath选项,如,
mongod.exe --dbpath yourpath
启动参数有:
–-dbpath:数据库目录;
–-logpath:log目录;
--journal:代表要写日志;
--rest:代表可以允许客户端通过rest API访问MongoDB Server;
启动后,命令窗口如下图所示:
最后一行显示等待连接。
####· 开始连接
双击运行mongo.exe,或者再打开一个命令端,输入mongo.exe
连接数据库,如图,
可进行的操作,更多操作请自行搜索。
show dbs
show databases
#显示所有数据库
再看刚才打开的mongod.exe命令窗口,连接数变成了1,如图
####(5) 安装pymongo
爬虫爬取的数据放在pymongo中。
安装pymongo
>pip install pymongo
升级pymongo
>pip install --upgrade pymongo
####(6)安装mongoDB可视化工具——Robomongo
Robomongo是MongoDB/GUI管理工具。
下载地址为 Robomongo,我下的是robomongo-0.9.0-rc8-windows-x86_64-c113244.exe ,双击运行,选择安装目录,我的是D:\softwares_diy\Robomongo 0.9.0-RC8\,继续,只有几步,最后选立即运行robomongo,出现下图,点击create,新建一个连接,确保启动了mongod服务(执行了mongod.exe)的前提下点击Test:
上图最后一行是 等待连接端口27017,然后回到robomongo,点击Test:
连接成功。如果连接的是本地的mongodb,直接点“close”,然后“save” 即可。
在robomongo管理页面上,点击 file->connect,出现刚才建立的连接:
选中连接,点“ connect”,可对该连接进行管理:
如果不是连接本地的mongo,那么通过SSH连接即可,输入IP 、用户名、密码即可:
###2.爬取数据
好的,现在我们已经成功安好了需要的组件,而且也打开了mongo数据库连接。
下载github 上的 huatian-funny 项目,解压缩后放到一个目录下,例如我的是D:\pythonExperiments\huatian-funny-master。
我做的修改:
-
spider.py 和 mark.py
由于我的python环境是python3.4 ,而该项目作者使用的是python2.x,而python2.x 和 python3.x的语法和库名有些不一样,因此我对spider.py mark.py 等py文件做了些许修改,使其可以正常运行。 -
该项目作者写的spider.py文件一次抓取很快就完成并停止了,经过修改后,spider.py 可以每隔5分钟自动执行一次,达到自动持续抓取数据的目的。
修改后的 spider.py ——爬取数据程序:
# -*- coding=utf-8 -*-
import urllib,urllib.parse
from apscheduler.schedulers.blocking import BlockingScheduler
import os
from requests import Session
from extension import mongo_collection
session = Session()
LOGIN_HEADERS = {
'Host': 'reg.163.com',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,'
'image/webp,*/*;q=0.8',
'Origin': 'http://love.163.com',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/49.0.2623.110 Safari/537.36',
'Content-Type': 'application/x-www-form-urlencoded',
'Referer': 'http://love.163.com/',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6,zh-TW;q=0.4',
'Cookie': '_ntes_nnid=d53195032b58604628528cd6a374d63f,1460206631682; '
'_ntes_nuid=d53195032b58604628528cd6a374d63f',
}
SEARCH_HEADERS = {
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6,zh-TW;q=0.4',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Host': 'love.163.com',
'Origin': 'http://love.163.com',
'Pragma': 'no-cache',
'Referer': 'http://love.163.com/search/user',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/49.0.2623.110 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
}
def login():
"""登陆花田"""
data = {
'username': 'yourEmailaddress@163.com',
'password': 'yourPassword',
'url': 'http://love.163.com/?checkUser=1&vendor=love.pLogin',
'product': 'ht',
'type': '1',
'append': '1',
'savelogin': '1',
}
response = session.post('https://reg.163.com/logins.jsp',
headers=LOGIN_HEADERS, data=urllib.parse.urlencode(data))
assert response.ok
def search():
"""按照上海各个区和年龄段进行搜索"""
for city in range(1, 20):
for age in range(22, 27, 2):
data = {
'province': '2',
'city': str(city),
'age': '{}-{}'.format(age, age + 1),
'condition': '1',
}
response = session.post('http://love.163.com/search/user/list',
headers=SEARCH_HEADERS, data=urllib.parse.urlencode(data))
if not response.ok:
print ('city:{} age:{} failed').format(city, age)
continue
users = response.json()['list']
for user in users:
mongo_collection.update({'id': user['id']}, user, upsert=True)
def loginAndSearch():
login()
search()
if __name__ == '__main__':
#每隔 5 分钟执行一次,你可以根据需要修改 interval。
scheduler = BlockingScheduler()
scheduler.add_job(loginAndSearch,'interval', minutes=5)
print ('Press Ctrl+{0} to exit'.format('Pause/Break' if os.name == 'nt' else 'C'))
try:
scheduler.start()
except (KeyboardInterrupt,SystemExit):
scheduler.shutdown()
修改后的 mark.py ——主观打分程序:
# -*- coding=utf-8 -*-
"""打分程序"""
import io
from urllib import request
from tkinter import messagebox,Tk, font, Label, Button, Radiobutton, IntVar
#import tkinter.font as Font
#from tkinter import *
from PIL import Image, ImageTk
from extension import mongo_collection, BUY_HOUSE, BUY_CAR,\
EDUCATION, INDUSTRY, SALARY, POSITION
master = None
tk_image = None
offset = 0
user, photo, url, buy_house, buy_car, age, height, salary, education, company, \
industry, school, position, satisfy, appearance = [None for i in range(15)]
def get_user(offset=0):
"""mongo中读取用户信息"""
global user
user = mongo_collection.find_one({}, skip=offset, limit=1, sort=[('url', -1)])
def init_master():
"""初始化主窗口"""
global master
master = Tk()
master.title(u'花田')
master.geometry(u'630x530')
master.resizable(width=False, height=False)
def place_image(image_ur):
"""获取用户头像"""
global tk_image
image_bytes = request.urlopen(image_ur).read()
data_stream = io.BytesIO(image_bytes)
pil_image = Image.open(data_stream)
tk_image = ImageTk.PhotoImage(pil_image)
def set_appearance():
"""设置头像评分"""
mongo_collection.update({'url': user['url']},
{'$set': {'appearance': appearance.get()}})
def set_satisfy():
"""设置是否满意"""
mongo_collection.update({'url': user['url']},
{'$set': {'satisfy': satisfy.get()}})
def update():
"""更新页面"""
global user, offset, photo, url, buy_house, buy_car, age, height, salary, \
education, company, industry, school, position, satisfy, appearance
image_url = u'{}&quality=85&thumbnail=410y410'.format(user['avatar'])
place_image(image_url)
print (offset)
photo['image'] = tk_image
url['text'] = user['url']
buy_house['text'] = BUY_HOUSE.get(user['house']) or user['house']
buy_car['text'] = BUY_CAR.get(user['car']) or user['car']
age['text'] = user['age']
height['text'] = user['height']
salary['text'] = SALARY.get(user['salary']) or user['salary']
education['text'] = EDUCATION.get(user['education']) or user['education']
company['text'] = user['company'] if user['company'] else u'--'
industry['text'] = INDUSTRY.get(user['industry']) or user['industry']
school['text'] = user['school'] if user['school'] else u'--'
position = POSITION.get(user['position']) or user['position']
satisfy.set(int(user.get(u'satisfy', -1)))
appearance.set(int(user.get(u'appearance', -1)))
def init():
"""初始化页面"""
global user, offset, photo, url, buy_house, buy_car, age, height, salary, \
education, company, industry, school, position, satisfy, appearance
get_user(offset)
image_url = u'{}&quality=85&thumbnail=410y410'.format(user['avatar'])
place_image(image_url)
photo = Label(master, image=tk_image)
photo.place(anchor=u'nw', x=10, y=40)
#url = Label(master, text=user['url'],font=Font(size=20, weight='bold'))
url = Label(master, font=("20"), text=user['url'])
url.place(anchor=u'nw', x=10, y=5)
buy_house = Label(master, text=BUY_HOUSE.get(user['house']) or user['house'])
buy_house.place(anchor=u'nw', x=490, y=50)
buy_car = Label(master, text=BUY_CAR.get(user['car']) or user['car'])
buy_car.place(anchor=u'nw', x=490, y=75)
age = Label(master, text=user['age'])
age.place(anchor=u'nw', x=490, y=100)
height = Label(master, text=user['height'])
height.place(anchor=u'nw', x=490, y=125)
salary = Label(master, text=SALARY.get(user['salary']) or user['salary'])
salary.place(anchor=u'nw', x=490, y=150)
education = Label(master, text=EDUCATION.get(user['education']) or user['education'])
education.place(anchor=u'nw', x=490, y=175)
company = Label(master, text=user['company'] if user['company'] else u'--')
company.place(anchor=u'nw', x=490, y=200)
industry = Label(master, text=INDUSTRY.get(user['industry']) or user['industry'])
industry.place(anchor=u'nw', x=490, y=225)
school = Label(master, text=user['school'] if user['school'] else u'--')
school.place(anchor=u'nw', x=490, y=250)
position = Label(master, text=POSITION.get(user['position']) or user['position'])
position.place(anchor=u'nw', x=490, y=275)
satisfy = IntVar()
satisfy.set(int(user.get(u'satisfy', -1)))
satisfied = Radiobutton(master, text=u"满意", variable=satisfy,
value=1, command=set_satisfy)
not_satisfied = Radiobutton(master, text=u"不满意", variable=satisfy,
value=0, command=set_satisfy)
satisfied.place(anchor=u'nw', x=450, y=10)
not_satisfied.place(anchor=u'nw', x=510, y=10)
appearance = IntVar()
appearance.set(int(user.get(u'appearance', -1)))
for i in range(1, 11):
score_i = Radiobutton(master, text=str(i), variable=appearance,
value=i, command=set_appearance)
score_i.place(anchor=u'nw', x=i * 40 - 30, y=460)
def handle_previous():
"""上一个用户"""
global offset
if offset <= 0:
showwarning(u'error', u'已经是第一个')
offset -= 1
get_user(offset)
update()
def handle_next():
"""下一个用户"""
global offset
offset += 1
get_user(offset)
if not user:
showwarning(u'error', u'已经是第后一个')
return
update()
def add_assembly():
"""添加组件"""
init()
#buy_house_static = Label(master, text=u'购房: ', fontt=font(size=15))
buy_house_static = Label(master, font=("15"), text=u'购房: ')
buy_house_static.place(anchor=u'nw', x=440, y=50)
buy_car_static = Label(master, font=("15"), text=u'购车: ')
buy_car_static.place(anchor=u'nw', x=440, y=75)
age_static = Label(master, font=("15"), text=u'年龄: ')
age_static.place(anchor=u'nw', x=440, y=100)
height_static = Label(master, font=("15"), text=u'身高: ')
height_static.place(anchor=u'nw', x=440, y=125)
salary_static = Label(master, font=("15"), text=u'工资: ')
salary_static.place(anchor=u'nw', x=440, y=150)
education_static = Label(master, font=("15"), text=u'学历: ')
education_static.place(anchor=u'nw', x=440, y=175)
company_static = Label(master, font=("15"), text=u'公司: ')
company_static.place(anchor=u'nw', x=440, y=200)
industry_static = Label(master, font=("15"), text=u'行业: ')
industry_static.place(anchor=u'nw', x=440, y=225)
school_static = Label(master, font=("15"), text=u'学校: ')
school_static.place(anchor=u'nw', x=440, y=250)
position_static = Label(master, font=("15"), text=u'职位: ')
position_static.place(anchor=u'nw', x=440, y=275)
previous = Button(master, text=u'上一个', command=handle_previous)
previous.place(anchor=u'nw', x=10, y=490)
next = Button(master, text=u'下一个', command=handle_next)
next.place(anchor=u'nw', x=520, y=490)
if __name__ == '__main__':
init_master()
add_assembly()
master.mainloop()
对于train.py我还木有进行修改调试,所以关于训练决策树的部分还木有实践。
参考: