一、目标网站:
url=https://sc.chinaz.com/jianli/free.html
二、获取每一个简历模板的url和名称
import requests
from lxml import etree
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'}
url = 'https://sc.chinaz.com/jianli/free.html'
res = requests.get(url, headers=headers)
res.encoding = 'utf-8'
page_text = res.text
tree = etree.HTML(page_text)
tree2 = etree.HTML(page_text)
url_list = tree.xpath('//*[@id="container"]/div/a/@href')
name = tree2.xpath('//*[@id="container"]/div/a/img/@alt')
zip(url_list, name)
for k, v in zip(name, url_list):
print(k+':', v)
注意携带 UA反爬,这里使用xpath解析网页数据,获取URL
右击,复制xpath路径即可
三、拿到每一个简历的下载url
这里需要对每个简历的URL进行发送get请求,等待远程服务器返回数据
res2 = requests.get(detail_url, headers=headers).text
tree=etree.HTML(res2)
download_url=tree.xpath('//*[@id="down"]/div[2]/ul/li[1]/a/@href')
print("下载地址:", download_url)
仔细观察会发现,每一个下载地址都是一样的,所以我们只要拿到一个就行了
四、完成版
import requests
from lxml import etree
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'}
url = 'https://sc.chinaz.com/jianli/free.html'
res = requests.get(url, headers=headers)
res.encoding = 'utf-8'
page_text = res.text
tree = etree.HTML(page_text)
tree2 = etree.HTML(page_text)
url_list = tree.xpath('//*[@id="container"]/div/a/@href')
name = tree2.xpath('//*[@id="container"]/div/a/img/@alt')
zip(url_list, name)
for k, v in zip(name, url_list):
print(k+':', v)
detail_url = v
res2 = requests.get(detail_url, headers=headers).text
tree=etree.HTML(res2)
download_url=tree.xpath('//*[@id="down"]/div[2]/ul/li[1]/a/@href')
print("下载地址:", download_url)
这就是完成简单的爬取简历信息
五、总结
以上写的并为涉及到cookie反爬或需要做js逆向,其次我也没有进行登录,所以就不会发送Ajax到服务器,也就是说,网站也无法确实我是不是机器人。还有大家记得爬1页数据就行了,不要一直爬,网站把你的ip拉入黑名单的。