数据抓取常用工具

最新推荐文章于 2024-08-26 10:16:46 发布

倪畅

最新推荐文章于 2024-08-26 10:16:46 发布

阅读量4k

点赞数 1

分类专栏：爬虫

本文链接：https://blog.csdn.net/qq_34838643/article/details/105787301

版权

爬虫专栏收录该内容

3 篇文章 0 订阅

订阅专栏

爬虫的用途：

数据分析/人工智能数据集
作为社交软件冷启动
舆情监控
竞争对手监控

写爬虫的步骤：

数据抓取

库： requests、urllib、pycurl
工具：curl、wget、httpie

数据分析
数据存储

常用工具的使用：

1. curl：

安装：

apt install curl

安装的时候可能会报错，有可能是openssl没装

apt install openssl
apt install openssl-dev

使用：

curl www.baidu.com

终端返回服务端返回的数据，所以可以认为curl是一个终端上的浏览器，只不过不会对请求的数进行解析、渲染。

参数：

参数	说明	示例
-A	设置user-agent	curl -A “Chrome” http://www.baidu.com
-X	用指定方法请求	curl -x POST http://www.httpbin.org/post
-I	只返回请求的头信息	curl -I http://www.baidu.com
-d	以POST方法请求url，并发送相应的参数	-d a=1 -d b=2 -d “a=1&b=2” -d @ filename
-O	下载文件并以远程的文件名保存	curl -O http://www.httpbin.org/image/jpeg
-o	下载文件并以指定的文件名保存	curl -o fox.jpeg http://www.httpbin.org/image/jpeg
-L	跟随重定向	curl -IL http://www.baidu.com
-H	设置头信息	curl -o image.png -H “accept:image/png” http://www.httpbin.org/image
-k	允许发起不安全的SSL请求	curl -k https://www.12306.cn
-b	设置cookies	curl -b a=test http://www.httpbin.org/cookies
-v	显示连接过程中的所有信息

2. wget：

安装：

apt install wget

参数：

参数	说明	示例
-O	以指定文件名保存下载的文件	wget -O test.png http://www.httpbin.org/image/png
–limit-rate	以指定的速度下载目标文件	–limit-rate=200k
-c	断点续传
-b	后台下载
-U	设置User-Agent
–mirror	镜像某个目标网站
-p	下载页面中的所有相关资源

例：镜像下载整个网站保存到本地，并将链接的相对路径改为绝对路径

wget -c --mirror -U "Mozilla" -p --convert-links http://doc.python-requests.org

3. httpie

功能更加强大：

直观的语法
格式化和色彩化的终端输出
内置 JSON 支持
支持上传表单和文件
HTTPS、代理和认证
任意请求数据
自定义头部
持久性会话
类 Wget 下载
支持 Python 2.6, 2.7 和 3.x
支持 Linux, Mac OS X 和 Windows
插件
文档
测试覆盖率

安装

apt install httpie

基本操作

模拟提交表单
http -f POST yhz.me username=nate
 
显示详细的请求
http -v yhz.me
 
只显示Header
http -h yhz.me
 
只显示Body
http -b yhz.me
 
下载文件
http -d yhz.me
 
请求删除的方法
http DELETE yhz.me
 
传递JSON数据请求(默认就是JSON数据请求)
http PUT yhz.me name=nate password=nate_password
如果JSON数据存在不是字符串则用:=分隔，例如
http PUT yhz.me name=nate password=nate_password age:=28 a:=true streets:='["a", "b"]'
 
模拟Form的Post请求, Content-Type: application/x-www-form-urlencoded; charset=utf-8
http --form POST yhz.me name='nate'
模拟Form的上传, Content-Type: multipart/form-data
http -f POST example.com/jobs name='John Smith' file@~/test.pdf
 
修改请求头, 使用:分隔
http yhz.me  User-Agent:Yhz/1.0  'Cookie:a=b;b=c'  Referer:http://yhz.me/
 
认证
http -a username:password yhz.me
http --auth-type=digest -a username:password yhz.me
 
使用http代理
http --proxy=http:http://192.168.1.100:8060 yhz.me
http --proxy=http:http://user:pass@192.168.1.100:8060 yhz.me