数据挖掘的学习过程中一定绕不开的一个阶段性实践项目——前程无忧51job岗位招聘信息爬虫程序!
搞定这个之后可以尝试带有一定反爬机制的爬虫实践,比如需要登陆服务器才能进一步响应的网站,比如隔一段时间就弹出验证码的网站,有兴趣的可以去尝试一下哈!
# -*- coding:utf-8 -*-
import urllib
import re, codecs
import time, random
import requests
from lxml import html
from urllib import parse
key = 'python'
key = parse.quote(parse.quote(key))
headers = {'Host': 'search.51job.com',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
def get_links(page):
url = 'http://search.51job.com/list/000000,000000,0000,00,9,99,' + key + ',2,' + str(page) + '.html'
r = requests.get(url, headers, timeout=10)
s = requests.session()
s.keep_alive = False
r.encoding = 'gb