python爬虫 配合 php 做个练习。
描述一下:
- php作为服务端。
- python作为爬虫程序。
- 爬虫程序 去 爬 服务端 某接口的数据。
- 服务端检查 user-agent: 没有时 响应403,有时则返回数据。
- 接口数据:随便写一些。
php服务端接口代码:
接口:xx/paqustuinfo.php?name=siri
根据 姓名 去获取信息。
<?php
header("Content-type:text/html;charset=utf-8");
date_default_timezone_set("PRC");
$name = isset($_GET['name']) ? $_GET['name'] : "";
$name = trim($name);
if(empty($name) or $name=='""'){
$err = array('status' => 'error','msg' => '参数错误!');
echo json_encode($err,JSON_UNESCAPED_UNICODE);
die();
}
$arrstuinfo= array(
array('id' => 1,'name' => '小王','age' => 14,'sex' => 1),
array('id' => 2,'name' => '小李','age' => 14,'sex' => 1),
array('id' => 3,'name' => '小明','age' => 14,'sex' => 1),
array('id' => 4,'name' => '小爱','age' => 14,'sex' => 1),
array('id' => 5,'name' => '小华','age' => 14,'sex' => 1),
array('id' => 6,'name' => '小黄','age' => 14,'sex' => 1),
array('id' => 7,'name' => 'siri','age' => 14,'sex' => 1),
);
$currStu = array('status' => 'error','msg' => '没有这个人哦!');
foreach($arrstuinfo as $stu){
if($stu['name'] === $name){
$currStu = $stu;
break;
}
}
echo json_encode($currStu,JSON_UNESCAPED_UNICODE);
?>
python爬虫程序代码如下:
import requests
import json
import os
def show(dic):
if len(dic)<=2:
print(dic['msg'])
print()
return
num=33
print("*"*num)
print("| %s\t"*4 %('编号 ','姓名 ','年龄 ','性别 '),"|",sep="")
print("-"*num)
print("| %s\t" %dic['id'],end="")
print("|%s\t" %dic['name'], end="")
print("| %s\t" %dic['age'], end="")
sex="男" if dic['sex']==1 else "女"
print("| %s\t|" %sex, end="")
print()
print("-"*num)
print()
while True:
name=input("姓名:")
url = "xx/paqustuinfo.php?name=%s"%name
response = requests.get(url)
data = json.loads(response.text)
show(data)
效果如下:
但是作为 服务端 的我们不能这么轻易的就给 爬虫程序 获取信息,所以 服务端 要做限制处理。比如当 服务端 检测爬虫程序没有携带 user-agent 时,服务端 就报403错 或 返回别的信息等,那要怎么做呢??
php服务端接口要处理:
检查 user-agent :
<?php
header("Content-type:text/html;charset=utf-8");
date_default_timezone_set("PRC");
$ua = $_SERVER["HTTP_USER_AGENT"];
#本处只简单处理,不过多处理
if(substr($ua, 0,16)=="python-requests/" or substr($ua, 0,14)=="Python-urllib/"){
$err = array('status' => 'error','msg' => 'bug bug bug bug');
http_response_code(403);
die(json_encode($err,JSON_UNESCAPED_UNICODE));
}
$name = isset($_GET['name']) ? $_GET['name'] : "";
$name = trim($name);
if(empty($name) or $name=='""'){
$err = array('status' => 'error','msg' => '参数错误!');
echo json_encode($err,JSON_UNESCAPED_UNICODE);
die();
}
$arrstuinfo= array(
array('id' => 1,'name' => '小王','age' => 14,'sex' => 1),
array('id' => 2,'name' => '小李','age' => 14,'sex' => 1),
array('id' => 3,'name' => '小明','age' => 14,'sex' => 1),
array('id' => 4,'name' => '小爱','age' => 14,'sex' => 1),
array('id' => 5,'name' => '小华','age' => 14,'sex' => 1),
array('id' => 6,'name' => '小黄','age' => 14,'sex' => 1),
array('id' => 7,'name' => 'siri','age' => 14,'sex' => 1),
);
$currStu = array('status' => 'error','msg' => '没有这个人哦!');
foreach($arrstuinfo as $stu){
if($stu['name'] === $name){
$currStu = $stu;
break;
}
}
echo json_encode($currStu,JSON_UNESCAPED_UNICODE);
?>
python爬虫程序再次请求时:
没有携带 user-agent 时:
注:requests 模块由于不会报错,那我们可以调用 raise_for_status()
修改爬虫程序如下(关键部分):
#....
response = requests.get(url)
response.raise_for_status() ####
data = json.loads(response.text)
show(data)
再次看效果:
python爬虫程序再次请求时:
有携带 user-agent 时:
#....
hd = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36"}
response = requests.get(url, headers=hd) #携带 user-agent
response.raise_for_status() ####
data = json.loads(response.text)
show(data)
效果:
----结束----
仅学习。