[爬虫] requests / requests-html / curl / chrome js / wget 总结

长虹剑

已于 2023-03-28 16:19:36 修改

阅读量736

点赞数

分类专栏： python 文章标签： requests-html 爬虫上传下载 curl cookie

于 2020-11-05 22:34:14 首次发布

本文链接：https://blog.csdn.net/hongmaodaxia/article/details/109521924

版权

python 专栏收录该内容

23 篇文章 1 订阅

订阅专栏

文章目录

wget
- 常用参数
curl
chrome js
requests
- json 提交，cookie 使用，headers
- 递归爬取 svn
requests-html

相比于 python 的程序，其实 curl 还是挺快的，但是就是写起来不如 python 方便。
下面介绍自己爬虫时常用的这三种工具的例子

wget

好像不能用于提交

常用参数

--load-cookies: 加载cookie， wget --load-cookies=$fcookie
-c: 断点续传

wget --no-check-certificate --load-cookies=$fcookie -O $file

curl

常用参数介绍

-L: 跟随重定向
-s: 静默执行
-C -: 断电续传，后面那个 - 表示自动计算续传位置
-b: 加载 cookie 文件
-d: 就是传数据，如 key=value，可以有多个如 -d xx -d yy
-X: 一般 -X post 指定 post 提交
-H: 指定 headers
-F: 提交文件，好像也可以提交其他参数

例子

基本的爬取

curl -L -s -b $fcookie $url

其 cookie 格式是，netscape http 格式，不清楚是否支持其他格式
这种格式样子类似如下

.xxx.com   TRUE    /   FALSE   163496 client_key  65890b
.xxx.com   TRUE    /   FALSE   163732  clientid    3

模拟浏览器，json

curl -s -C - -X POST -b $fcookie -H "Content-Type: application/json"\
    -H "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36" \
    -d "{ \"description\": \"登录\", \"expire\": 18  }" $url

其实这些信息是需要通过在chrome中分析获得的。爬虫需要经常用chrome 控制台分析。

注意：有时候cookie 需要放在header中

-H "Cookie: $cookie"

提交文件

curl -s -b $fcookie -F 'file[]=@run.sh' -F'file[]=@output.txt' $url -o /dev/null

chrome js

由于在控制台中运行你的cookie 自然就有了

fetch(new Request(your_url_XXX,{
    method:'POST',
    headers: {'Content-Type': 'application/json'},
    body:JSON.stringify({description:"登录", expire:18})
})).then(resp=>resp.json()).then(json=>{console.log(json)})

requests

json 提交，cookie 使用，headers

data={"description":"登录", "expire":18}
header="Content-Type: application/json"
headers = {
        #'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36',
        'Content-Type': 'application/json',
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36',
}

cj = cookiejar.MozillaCookieJar(fcookie)
cj.load()

res = requests.post(your_url_XXX, cookies=cj, headers=headers, data=json.dumps(data))

print(res.text)
print(res.json()["value"])

递归爬取 svn

import requests
from bs4 import BeautifulSoup

rooturl="https://svn.blender.org/svnroot/bf-blender/trunk/lib/linux_centos7_x86_64/"

#arr=[[ rooturl, "" ]]
arr=[rooturl]
with open("download/urls.txt", "w") as fp:
    while len(arr):
        url = arr.pop(0)
        #url = e[0] + e[1]
        r = requests.get( url )
        Soup = BeautifulSoup(r.text, 'lxml')
        #print( Soup.find_all("a") )
        for a in Soup.find_all("a"):
            a = a.text
            if a=="..": continue
            nurl=url+a
            if a[-1]=="/":
                arr.append( nurl )
            else:
                fp.write( nurl+"\n" )

requests-html

与 requests 对比

requests 只能请求静态网页，现在的网页大部分是加载之后通过 js 代码动态渲染的，这样真实网页还没渲染完成，如果通过 request 的方法基本什么也获得不了。
而本博客将的库可以定义会话，并使用 chrome 内核进行网页渲染
request-html 的使用需要先安装
pip install requests-html --index-url https://pypi.douban.com/simple

加载网页

from requests_html import HTMLSession
session = HTMLSession()
url="xxx"
r = session.get( url  )
r.html.render()
with open("res.html", "w", encoding="utf-8") as fp:
    for i in r.html.html: fp.write(i)

chrome 文件下载失败

可以通过分析文件 anaconda3/lib/python3.7/site-packages/pyppeteer/chromium_downloader.py
然后自行下载
然后修改类似
chromiumExecutable[“mac”]=Path(“xxx/software/Chromium.app/Contents/MacOS/Chromium”)

比如在
https://storage.googleapis.com/chromium-browser-snapshots/Linux_x64/588429/chrome-linux.zip

POST 请求

from requests_html import HTMLSession
myobj={} # 请求的数据，可以在chrome中模拟一次，然后看到
r = session.post( url, data=myobj  )

# 请求的结果可以如下看到
# r.text /  r.json()

下载文件

def save_file(fnm, X):
    with open(fnm, "wb") as fp:
        fp.write(X.content)
 
 r = session.get( url  )
 save_file("res.acc", r)

上传文件

# 当然下面的写法是和图片名称无关的，要是有关系那么第一个文件名需要修改
def get_obj(fimg, tp="png"):
    data={"typeId":"1"}
    if tp == "png":
        files={"file": ( "a.png", open(fimg, "rb"), "image/png") }
    elif tp=="jpeg":
    	files={"file": ( "a.jpg", open(fimg, "rb"), "image/jpeg") }
    return data, files
    
data, files = get_obj(fimg)
r = requests.post(url, data, files=files)
print(r.json())
print(r.text)

还有一种从numpy 形式上传的方法
https://stackoverflow.com/questions/52908088/send-an-image-as-numpy-array-requests-post/52920154

from requests_html import HTMLSession
import re, os, sys
import cv2, io
import matplotlib.pyplot as plt

def get_obj(fimg, tp="png"):
    data={"typeId":"1"}
    if tp=="png":
        files={"file": ( "a.png", open(fimg, "rb"), "image/png") }
    elif tp=="np":
        buf = io.BytesIO()
        img_np=fimg
        plt.imsave(buf, img_np, format='png')
        image_data = buf.getvalue()
        files={"file": image_data }
	return data, files
	
img=cv2.imread(fimg)
data, files = get_obj(img, "np")
r = requests.post(url, data, files=files)