今天爬了隔壁航电。杭电的研究生导师网站感觉做的没理工好。理工的格式比较好,每个导师对应一个单独的URL,但是杭电是一个学院对应一个URL,所有的导师一起放在一个URL里面,而且每个导师相应信息(如名字,邮箱,电话等)的源码很难找到规律,所以对我这个刚会用xpath的小菜鸡就很不友好。琢磨了半个上午最后放弃了,决定干脆把一整个学院里面的所有导师信息一股脑儿全都爬下来。
下面贴上代码:
import requests
from lxml import etree
import random
from string import punctuation
import re
import time
import pymongo
from pymongo import MongoClient
def download(url):
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'}
time.sleep(1)
r = requests.get(url,headers=headers)
r.encoding='utf-8'
return etree.HTML(r.text)
def write_in_sql(st,xueyuan):
xueyuan_imf={xueyuan:st}
client = MongoClient()
db=client.hangdian_yandao_data
collection = db.xueyuan_imf
collection.insert_one(xueyuan_imf)
print('正在下载:'+xueyuan)
print(st)
def spider_deep(url,num,xueyuan):
list=[]
selector = download(url)
for i in range(1,num):
name = selector.xpath('string(//*[@id="wrapper"]/div[3]/div/div/div[2]/div/div/div[1]/div/div[2]/div/p[{}])'.format(i))
list.append(name)
st=''
for i in list:
st+=i
st+='\n'
write_in_sql(st,xueyuan)
def spider_url():
x_url = []
url = 'http://mechanical.hdu.edu.cn/53/list.htm'
selector = download(url)
add= selector.xpath('//*[@target="_blank"]/@href')[0:-1]
for i in add:
i='http://mechanical.hdu.edu.cn'+i
x_url.append(i)
page=[115,55,54,97,38,16]
xueyuan=['机械制造','机械设计','海洋工程','机电工程','车辆工程','磁性材料']
for i in range(0,6):
spider_deep(x_url[i],page[i],xueyuan[i])
spider_url()
要注意的是这里有Mongodb的数据库操作,所以想要运行代码的话需要先用命令行连接到数据库。
8说了,下个目标是下沙所有高校的导师信息
冲鸭