python爬虫取图_python爬虫取图片详解，

最新推荐文章于 2022-04-17 17:35:10 发布

weixin_39647180

最新推荐文章于 2022-04-17 17:35:10 发布

阅读量93

点赞数

文章标签： python爬虫取图

接下来会依次准备三个案例（如果要把每一个点都精通的话大约要花费一个月，我说的精通是指自己将代码不用查资料写出来，以下暂未整理）：

import requests,threading#多线程处理与控制

from lxml import etree

from bs4 import BeautifulSoup

#获取源码

def get_html(url):

url='http://www.doutula.com/?qqdrsign=01495'

#获取网络地址，但这个地方写死了，怎么办呢，因为我们还没有做多页

headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36'}

#上一步是模拟浏览器信息，固定格式，可记下来

request=requests.get(url=url,headers=headers)#对网址发送一个get请求

response=request.content#获取源码，比test稍微好一点

#print(response)

return response

#接下来是获取外页，即图片自身的源码

def get_img_html(html):

soup=BeautifulSoup(html,'lxml')#解析网页方式，自带html.pparser

all_a=soup.findall('a',class='list-group-item randomlist')#class是关键字所以此处加

for i in all_a:

print(i)#i是指

img_html=get_html(i['href'])#是用来获取超链接这一部分源码

print(img_html)

#http://www.doutula.com/article/list/?page=2

get_img_html(a)

好了，我们已经可以获取一部分的源码了，这样，我们接下来的工作是开始做多页

import requests,threading#多线程处理与控制

from lxml import etree

from bs4 import BeautifulSoup

def get_html(url):

#url='http://www.doutula.com/?qqdrsign=01495'#获取网络地址，但这个地方写死了，怎么办呢，因为我们还没有做多页

headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36'}

#上一步是模拟浏览器信息，固定格式，可记下来

request=requests.get(url=url,headers=headers)#对网址发送一个get请求

response=request.content#获取源码，比test稍微好一点

#print(response)

return response

#接下来是获取外页，即图片自身的源码

def get_img_html(html):

soup=BeautifulSoup(html,'lxml')#解析网页方式，自带html.pparser

all_a=soup.findall('a',class='list-group-item randomlist')#class是关键字所以此处加

for i in all_a:

print(i)#i是指

weixin_39647180

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。