上次利用requests库模拟Ajax请求爬取拉勾网,结果没有成功,一直禁止我的爬虫访问,这次我将利用Selenium来爬取拉勾网,找到python的职位信息,并存入数据库。
找到拉勾网,分析网页源代码
找到链接我们需要获取的数据是:
那我们如何查看下一页呢??请看这里:
分析之后,写代码,获取链接
先获取第一个页面的链接:
代码:
from selenium import webdriver
from lxml import etree
browser = webdriver.Edge()
browser.get("https://www.lagou.com/jobs/list_python ?labelWords=&fromSearch=true&suginput=")
page = browser.page_source
html = etree.HTML(page)
urls = html.xpath("//a[@class = 'position_link']/@href")
print(urls)
之所以没有用Selenium自带的查找方式,是因为lxml库的查找方式比较快。
结果是:
这样我们就完成了第一页的查找;那如何查找第二页呢??第三页呢??
我们就得点击【下一页】的按钮了,并且在这期间要用到页面等待,防止页面还没加载出来,程序就急于获取信息而导致的抛出异常!!
找到【下一页】的代码如下:
while True:
button = browser.find_element_by_xpath("//div[@class = 'pager_container']//span[last()]")
if "pager_next_disabled" in button.get_attribute("class"):
break
button.click()
重点:找到【下一页】终止条件:pager_next_disabled在【下一页】的class属性中!!!
获取所有职位的信息的全部代码如下:
browser = webdriver.Edge()
browser.implicitly_wait(10)
browser.get(url)
job_urls = []
while True:
page = browser.page_source
html = etree.HTML(page)
job_url = html.xpath("//a[@class = 'position_link']/@href")
job_urls.extend(job_url)
button = browser.find_element_by_xpath("//div[@class = 'pager_container']//span[last()]")
if "pager_next_disabled" in button.get_attribute("class"):
break
button.click()
browser.close()
这样我们就能获得所有的职位的URL了
找到URL,提取信息
我们知道了所有的URL之后,就可以根据一个界面的信息来写对应的xpath语句,这样就能提取相应的信息了
提取代码:
import requests
from lxml import etree
import re
from selenium import webdriver
browser = webdriver.Edge()
browser.implicitly_wait(10)
browser.get("https://www.lagou.com/jobs/2108656.html")
page = browser.page_source
html = etree.HTML(page)
name = html.xpath("//div[@class = 'job-name']//span[@class = 'name']/text()")
salary = html.xpath("//dd[@class = 'job_request']//span[1]/text()")
didian = html.xpath("//dd[@class = 'job_request']//span[2]/text()")
jingyan = html.xpath("//dd[@class = 'job_request']//span[3]/text()")
xueli = html.xpath("//dd[@class = 'job_request']//span[4]/text()")
zhiwei = html.xpath("//dd[@class = 'job_request']//span[5]/text()")
job_advantage = html.xpath("//dd[@class = 'job-advantage']//p/text()")
job_detail = html.xpath("//div[@class = 'job-detail']//text()")
job_detail_new = ",".join(list(map(lambda x:x.strip(", \n"),job_detail))).strip(",")
print(job_detail_new)
work_addr = html.xpath("//div[@class = 'work_addr']//text()")
work_addr_new = "".join(list(map(lambda x:x.strip(),work_addr))).replace("查看地图","")
说实话,我提取的信息包含的杂质很多,有兴趣的可以尝试一下把信息更加具体化,欢迎评论哦!!这样就能提取一个页面的所有职位信息!!
综合一下,存入数据库
事先利用navicat建好了一个数据库(lagou),和一张表(job),如图所示:
注意点:建表的时候,一定要考虑好数据类型!!!
具体的存入数据库的代码如下:
conn = pymysql.connect(host="localhost", user="root", password="yanzhiguo140710", port=3306, db="lagou")
cur = conn.cursor()
keys = ",".join(job_dict.keys())
values = ",".join(['%s']*len(job_dict))
sql = "insert into job ({keys}) values ({value})".format(keys = keys,value = values)
cur.execute(sql,tuple(job_dict.values()))
conn.commit()
conn.close()
总结
这样就基本完成了所有的操作,最终的全部代码如下:
from selenium import webdriver
from lxml import etree
import pymysql
#获取总的URL信息
def get_URLS(url):
browser = webdriver.Edge()
browser.implicitly_wait(10)
browser.get(url)
job_urls = []
while True:
page = browser.page_source
html = etree.HTML(page)
job_url = html.xpath("//a[@class = 'position_link']/@href")
job_urls.extend(job_url)
button = browser.find_element_by_xpath("//div[@class = 'pager_container']//span[last()]")
if "pager_next_disabled" in button.get_attribute("class"):
break
button.click()
browser.close()
return job_urls
#对每一个URL进行信息提取
def getJob(job_url):
browser = webdriver.Edge()
browser.implicitly_wait(10)
browser.get(job_url)
page = browser.page_source
html = etree.HTML(page)
name = html.xpath("//div[@class = 'job-name']//span[@class = 'name']/text()")[0]
salary = html.xpath("//dd[@class = 'job_request']//span[1]/text()")[0]
didian = html.xpath("//dd[@class = 'job_request']//span[2]/text()")[0]
jingyan = html.xpath("//dd[@class = 'job_request']//span[3]/text()")[0 ]
xueli = html.xpath("//dd[@class = 'job_request']//span[4]/text()")[0]
zhiwei = html.xpath("//dd[@class = 'job_request']//span[5]/text()")[0]
job_advantage = html.xpath("//dd[@class = 'job-advantage']//p/text()")[0]
job_detail = html.xpath("//div[@class = 'job-detail']//text()")
job_detail_new = ",".join(list(map(lambda x: x.strip(", \n"), job_detail))).strip(",")
work_addr = html.xpath("//div[@class = 'work_addr']//text()")
work_addr_new = "".join(list(map(lambda x: x.strip(), work_addr))).replace("查看地图", "")
job_dict = {
"name":name,
"salary":salary,
"didian": didian,
"jingyan": jingyan,
"xueli": xueli,
"zhiwei": zhiwei,
"job_advantage": job_advantage,
"job_detail": job_detail_new,
"work_addr": work_addr_new
}
browser.close()
return job_dict
#存入数据库
def push_data(job_dict):
conn = pymysql.connect(host="localhost", user="root", password="yanzhiguo140710", port=3306, db="lagou")
cur = conn.cursor()
keys = ",".join(job_dict.keys())
values = ",".join(['%s']*len(job_dict))
sql = "insert into job ({keys}) values ({value})".format(keys = keys,value = values)
cur.execute(sql,tuple(job_dict.values()))
conn.commit()
conn.close()
if __name__ == '__main__':
url = "https://www.lagou.com/jobs/list_python?city=全国&cl=false&fromSearch=true&labelWords=&suginput="
job_urls = get_URLS(url)
for item in job_urls:
push_data(getJob(item))
在数据库中的效果库如下:
这样最终完成了我的作品!!!
最后,还是那句话,有兴趣的可以一起交流哦!!!