python实践系列之(二)python爬取数据(上)

###本系列实践目的:

打算先利用github上的项目huatian-funny,通过python抓取花田网上注册用户的数据,做个小实验,然后上传自己修改后的 huatian-funny 项目。

huatian-funny ,我们可以看到该项目的说明:

这里写图片描述
这里写图片描述
这里写图片描述
这里写图片描述

这里写图片描述
这里写图片描述


###1.准备
####需要 :
requests >=2.7.0,pymongo>=3.2.2,matplotlib>=1.4.3,Pillow>=3.2.0
####(1)安装requests 2.7.0
requests是python的一个HTTP客户端库.
源码安装 pip 或者easy_install,

>pip install requests

这里写图片描述

可以看到安的版本是2.10.0

####(2)安装matplotlib
python实践之准备 (一)的第4部分内容——安装matplotlib。这里不再赘述。
####(3)安装Pillow
>pip install pillow

这里写图片描述

(4)安装mongodb

可以从这里下载: mongodb下载
下载完成后,运行 mongodb-win32-x86_64-2008plus-ssl-3.2.6-signed.msi,一路默认选下去,最后完成。
mongodb 默认安装在 C:\Program Files\MongoDB下。
Windows下 MongoDB 的默认目录是C:\data\db,需提前创建该目录。

####· 启动mongod 服务,双击运行mongod.exe 即可,或者启动时附加参数,

mongod.exe -journal -rest

如果不想用默认的C:\data\db目录,需要在启动服务器时使用–dbpath选项,如,

mongod.exe --dbpath yourpath
启动参数有:
–-dbpath:数据库目录;
–-logpath:log目录;
--journal:代表要写日志;
--rest:代表可以允许客户端通过rest API访问MongoDB Server;

启动后,命令窗口如下图所示:

这里写图片描述

最后一行显示等待连接。

####· 开始连接

双击运行mongo.exe,或者再打开一个命令端,输入mongo.exe 连接数据库,如图,

这里写图片描述

可进行的操作,更多操作请自行搜索。

show dbs
show databases
#显示所有数据库

再看刚才打开的mongod.exe命令窗口,连接数变成了1,如图

这里写图片描述

####(5) 安装pymongo
爬虫爬取的数据放在pymongo中。
安装pymongo

>pip install pymongo

升级pymongo

>pip install --upgrade pymongo

这里写图片描述

####(6)安装mongoDB可视化工具——Robomongo
Robomongo是MongoDB/GUI管理工具。
下载地址为 Robomongo,我下的是robomongo-0.9.0-rc8-windows-x86_64-c113244.exe ,双击运行,选择安装目录,我的是D:\softwares_diy\Robomongo 0.9.0-RC8\,继续,只有几步,最后选立即运行robomongo,出现下图,点击create,新建一个连接,确保启动了mongod服务(执行了mongod.exe)的前提下点击Test:

这里写图片描述

上图最后一行是 等待连接端口27017,然后回到robomongo,点击Test:

这里写图片描述

这里写图片描述
连接成功。如果连接的是本地的mongodb,直接点“close”,然后“save” 即可。
在robomongo管理页面上,点击 file->connect,出现刚才建立的连接:

这里写图片描述

选中连接,点“ connect”,可对该连接进行管理:

这里写图片描述

如果不是连接本地的mongo,那么通过SSH连接即可,输入IP 、用户名、密码即可:

这里写图片描述


###2.爬取数据
好的,现在我们已经成功安好了需要的组件,而且也打开了mongo数据库连接。

下载github 上的 huatian-funny 项目,解压缩后放到一个目录下,例如我的是D:\pythonExperiments\huatian-funny-master。

我做的修改:

  • spider.py 和 mark.py
    由于我的python环境是python3.4 ,而该项目作者使用的是python2.x,而python2.x 和 python3.x的语法和库名有些不一样,因此我对spider.py mark.py 等py文件做了些许修改,使其可以正常运行。

  • 该项目作者写的spider.py文件一次抓取很快就完成并停止了,经过修改后,spider.py 可以每隔5分钟自动执行一次,达到自动持续抓取数据的目的。

修改后的 spider.py ——爬取数据程序:

# -*- coding=utf-8 -*-
import urllib,urllib.parse
from apscheduler.schedulers.blocking import BlockingScheduler
import os
from requests import Session
from extension import mongo_collection

session = Session()
LOGIN_HEADERS = {
    'Host': 'reg.163.com',
    'Connection': 'keep-alive',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,'
              'image/webp,*/*;q=0.8',
    'Origin': 'http://love.163.com',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) '
                  'AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/49.0.2623.110 Safari/537.36',
    'Content-Type': 'application/x-www-form-urlencoded',
    'Referer': 'http://love.163.com/',
    'Accept-Encoding': 'gzip, deflate',
    'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6,zh-TW;q=0.4',
    'Cookie': '_ntes_nnid=d53195032b58604628528cd6a374d63f,1460206631682; '
              '_ntes_nuid=d53195032b58604628528cd6a374d63f',
}
SEARCH_HEADERS = {
    'Accept': '*/*',
    'Accept-Encoding': 'gzip, deflate',
    'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6,zh-TW;q=0.4',
    'Cache-Control': 'no-cache',
    'Connection': 'keep-alive',
    'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
    'Host': 'love.163.com',
    'Origin': 'http://love.163.com',
    'Pragma': 'no-cache',
    'Referer': 'http://love.163.com/search/user',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) '
                  'AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/49.0.2623.110 Safari/537.36',
    'X-Requested-With': 'XMLHttpRequest',
}


def login():
    """登陆花田"""
    data = {
        'username': 'yourEmailaddress@163.com',
        'password': 'yourPassword',
        'url': 'http://love.163.com/?checkUser=1&vendor=love.pLogin',
        'product': 'ht',
        'type': '1',
        'append': '1',
        'savelogin': '1',
    }
    response = session.post('https://reg.163.com/logins.jsp',
                            headers=LOGIN_HEADERS, data=urllib.parse.urlencode(data))
    assert response.ok


def search():
    """按照上海各个区和年龄段进行搜索"""
    for city in range(1, 20):
        for age in range(22, 27, 2):
            data = {
                'province': '2',
                'city': str(city),
                'age': '{}-{}'.format(age, age + 1),
                'condition': '1',
            }
            response = session.post('http://love.163.com/search/user/list',
                                    headers=SEARCH_HEADERS, data=urllib.parse.urlencode(data))
            if not response.ok:
                print ('city:{} age:{} failed').format(city, age)
                continue

            users = response.json()['list']
            for user in users:
                mongo_collection.update({'id': user['id']}, user, upsert=True)

def loginAndSearch():
	login()
	search()

if __name__ == '__main__':
     
	#每隔 5 分钟执行一次,你可以根据需要修改 interval。
    scheduler = BlockingScheduler()
    scheduler.add_job(loginAndSearch,'interval', minutes=5)
    print ('Press Ctrl+{0} to exit'.format('Pause/Break' if os.name == 'nt' else 'C'))
    try:
        scheduler.start()
    except (KeyboardInterrupt,SystemExit):
        scheduler.shutdown()

修改后的 mark.py ——主观打分程序:

# -*- coding=utf-8 -*-
"""打分程序"""

import io
from urllib import request
from tkinter import messagebox,Tk, font, Label, Button, Radiobutton, IntVar
#import tkinter.font as Font
#from tkinter import *
from PIL import Image, ImageTk
from extension import mongo_collection, BUY_HOUSE, BUY_CAR,\
    EDUCATION, INDUSTRY, SALARY, POSITION

master = None
tk_image = None

offset = 0
user, photo, url, buy_house, buy_car, age, height, salary, education, company, \
industry, school, position, satisfy, appearance = [None for i in range(15)]


def get_user(offset=0):
    """mongo中读取用户信息"""
    global user
    user = mongo_collection.find_one({}, skip=offset, limit=1, sort=[('url', -1)])


def init_master():
    """初始化主窗口"""
    global master
    master = Tk()
    master.title(u'花田')
    master.geometry(u'630x530')
    master.resizable(width=False, height=False)


def place_image(image_ur):
    """获取用户头像"""
    global tk_image
    image_bytes = request.urlopen(image_ur).read()
    data_stream = io.BytesIO(image_bytes)
    pil_image = Image.open(data_stream)
    tk_image = ImageTk.PhotoImage(pil_image)


def set_appearance():
    """设置头像评分"""
    mongo_collection.update({'url': user['url']},
                            {'$set': {'appearance': appearance.get()}})


def set_satisfy():
    """设置是否满意"""
    mongo_collection.update({'url': user['url']},
                            {'$set': {'satisfy': satisfy.get()}})


def update():
    """更新页面"""
    global user, offset, photo, url, buy_house, buy_car, age, height, salary, \
        education, company, industry, school, position, satisfy, appearance
    image_url = u'{}&quality=85&thumbnail=410y410'.format(user['avatar'])
    place_image(image_url)

    print (offset)

    photo['image'] = tk_image
    url['text'] = user['url']
    buy_house['text'] = BUY_HOUSE.get(user['house']) or user['house']
    buy_car['text'] = BUY_CAR.get(user['car']) or user['car']
    age['text'] = user['age']
    height['text'] = user['height']
    salary['text'] = SALARY.get(user['salary']) or user['salary']
    education['text'] = EDUCATION.get(user['education']) or user['education']
    company['text'] = user['company'] if user['company'] else u'--'
    industry['text'] = INDUSTRY.get(user['industry']) or user['industry']
    school['text'] = user['school'] if user['school'] else u'--'
    position = POSITION.get(user['position']) or user['position']

    satisfy.set(int(user.get(u'satisfy', -1)))
    appearance.set(int(user.get(u'appearance', -1)))


def init():
    """初始化页面"""
    global user, offset, photo, url, buy_house, buy_car, age, height, salary, \
        education, company, industry, school, position, satisfy, appearance
    get_user(offset)
    image_url = u'{}&quality=85&thumbnail=410y410'.format(user['avatar'])
    place_image(image_url)
    photo = Label(master, image=tk_image)
    photo.place(anchor=u'nw', x=10, y=40)
    #url = Label(master, text=user['url'],font=Font(size=20, weight='bold'))
    url = Label(master, font=("20"), text=user['url'])
    url.place(anchor=u'nw', x=10, y=5)
    buy_house = Label(master, text=BUY_HOUSE.get(user['house']) or user['house'])
    buy_house.place(anchor=u'nw', x=490, y=50)
    buy_car = Label(master, text=BUY_CAR.get(user['car']) or user['car'])
    buy_car.place(anchor=u'nw', x=490, y=75)
    age = Label(master, text=user['age'])
    age.place(anchor=u'nw', x=490, y=100)
    height = Label(master, text=user['height'])
    height.place(anchor=u'nw', x=490, y=125)
    salary = Label(master, text=SALARY.get(user['salary']) or user['salary'])
    salary.place(anchor=u'nw', x=490, y=150)
    education = Label(master, text=EDUCATION.get(user['education']) or user['education'])
    education.place(anchor=u'nw', x=490, y=175)
    company = Label(master, text=user['company'] if user['company'] else u'--')
    company.place(anchor=u'nw', x=490, y=200)
    industry = Label(master, text=INDUSTRY.get(user['industry']) or user['industry'])
    industry.place(anchor=u'nw', x=490, y=225)
    school = Label(master, text=user['school'] if user['school'] else u'--')
    school.place(anchor=u'nw', x=490, y=250)
    position = Label(master, text=POSITION.get(user['position']) or user['position'])
    position.place(anchor=u'nw', x=490, y=275)

    satisfy = IntVar()
    satisfy.set(int(user.get(u'satisfy', -1)))
    satisfied = Radiobutton(master, text=u"满意", variable=satisfy,
                            value=1, command=set_satisfy)
    not_satisfied = Radiobutton(master, text=u"不满意", variable=satisfy,
                                value=0, command=set_satisfy)
    satisfied.place(anchor=u'nw', x=450, y=10)
    not_satisfied.place(anchor=u'nw', x=510, y=10)

    appearance = IntVar()
    appearance.set(int(user.get(u'appearance', -1)))
    for i in range(1, 11):
        score_i = Radiobutton(master, text=str(i), variable=appearance,
                              value=i, command=set_appearance)
        score_i.place(anchor=u'nw', x=i * 40 - 30, y=460)


def handle_previous():
    """上一个用户"""
    global offset
    if offset <= 0:
        showwarning(u'error', u'已经是第一个')

    offset -= 1
    get_user(offset)
    update()


def handle_next():
    """下一个用户"""
    global offset

    offset += 1
    get_user(offset)
    if not user:
        showwarning(u'error', u'已经是第后一个')
        return
    update()


def add_assembly():
    """添加组件"""
    init()

    #buy_house_static = Label(master, text=u'购房: ', fontt=font(size=15))
    buy_house_static = Label(master, font=("15"), text=u'购房: ')
    buy_house_static.place(anchor=u'nw', x=440, y=50)
    buy_car_static = Label(master, font=("15"), text=u'购车: ')
    buy_car_static.place(anchor=u'nw', x=440, y=75)
    age_static = Label(master, font=("15"), text=u'年龄: ')
    age_static.place(anchor=u'nw', x=440, y=100)
    height_static = Label(master, font=("15"), text=u'身高: ')
    height_static.place(anchor=u'nw', x=440, y=125)
    salary_static = Label(master, font=("15"), text=u'工资: ')
    salary_static.place(anchor=u'nw', x=440, y=150)
    education_static = Label(master, font=("15"), text=u'学历: ')
    education_static.place(anchor=u'nw', x=440, y=175)
    company_static = Label(master, font=("15"), text=u'公司: ')
    company_static.place(anchor=u'nw', x=440, y=200)
    industry_static = Label(master, font=("15"), text=u'行业: ')
    industry_static.place(anchor=u'nw', x=440, y=225)
    school_static = Label(master, font=("15"), text=u'学校: ')
    school_static.place(anchor=u'nw', x=440, y=250)
    position_static = Label(master, font=("15"), text=u'职位: ')
    position_static.place(anchor=u'nw', x=440, y=275)
    previous = Button(master, text=u'上一个', command=handle_previous)
    previous.place(anchor=u'nw', x=10, y=490)
    next = Button(master, text=u'下一个', command=handle_next)
    next.place(anchor=u'nw', x=520, y=490)


if __name__ == '__main__':
    init_master()
    add_assembly()
    master.mainloop()

对于train.py我还木有进行修改调试,所以关于训练决策树的部分还木有实践。

参考:

  1. MongoDB与PyMongo的安装(Linux/Windows XP)
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值