爬虫1

最新推荐文章于 2024-09-27 10:11:28 发布

qq_45849275

最新推荐文章于 2024-09-27 10:11:28 发布

阅读量135

点赞数

文章标签： python xpath json

本文链接：https://blog.csdn.net/qq_45849275/article/details/103534378

版权

异常，导包，文件，爬虫，自动化，目录，数据库，框架
爬虫
import html
import requests
import lxml
from lxml import etree
url=“http://www.baidu.cn”
headers={}
cookies={}
referer=""
proxies={“http:http://192.168.10.12:8830”}
//一定要小心headers是否写错了，尤其是里面的空格
#发送请求
response=request.get(url=url,headers=headers,proxies=proxies,referer=refer，pararm=pararm)
#response=requset.post(data,url,headers,cookies,proxies)
#接收请求后的响应
data=response.text
data=response.content
data=response.text.decode(“UTF-8”)
data=response.content.decode(“utf-8”)
#筛选数据
1.正则表达式
2.beautifulsoup4
3.lxml
4.json

html=etree.HTML(data)
list=html.xpath("//img/@href")
print(list)
for i in list:
dict={“img”:i}
print(dict)
#如果是多个列表可以使用zip()函数进行封装，比如zip（list1，list2）

import json
from jsonpath import jsonpath
url=""
referer=""
headers={}#use-agent:
proxies={}
cookies=""

response=requests.get(url,headers,proxies,cookies,referer)
data=response.text.stript(“fectch(”)
information=data.stript("")
json_data=json.loads(information)
img_data=jsonpath(" $KaTeX parse error: Expected 'EOF', got '#' at position 9: ..img") #̲如果import导入的是jso\dots$ …path)
img_data=jsopath.jsonpath($…path)
print(img_data)
for i in img_data:
print(i)

以上是今天的爬虫总结，不适合新手观看，如有错误请多多指教，由于今天太晚了，今天的爬虫总结就到这里了，明天继续讲beautifulsoup4和selenium以及appium的自动化爬虫。