Python爬虫详解：爬虫是什么，如何做爬虫？

最新推荐文章于 2023-12-28 13:55:31 发布

__Azure__

最新推荐文章于 2023-12-28 13:55:31 发布

阅读量2.2k

点赞数 4

分类专栏：笔记文章标签： python

本文链接：https://blog.csdn.net/qq_36837340/article/details/107703073

版权

笔记专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Python爬虫详解：爬虫是什么，如何做爬虫？

读前必看：

本篇文章为教学向文章，通俗易懂的语言难免有不合适的地方，请大佬自行退场

爬虫是什么？

去查网，你会看到这样的解释：

爬虫是一种按照一定的规则，自动地抓取万维网信息的程序或者脚本

这看起来很晦涩难懂，让我用通俗易懂的语言来解释一下：

爬虫是可以把网上的东西给你的程序，你想要的东西

废话不多说，我们打开python开始编

爬虫是怎么把你想要的东西给你的？

这里我们爬一爬B站的首页上的图片吧

第一步.得到所有的链接

首先爬虫要拿到链接，在这里我们要用到的模块是requests

requests是一个第三方库，这里我们要下载这个库

打开cmd，直接输入pip install requests就好了
在这里插入图片描述
然后导入模块

import requests

接下来把你要爬的网址写成变量，这里拿B站举例

import requests

url = "http://www.bilibili.com"

然后告诉机器你要爬这个网址，并把爬下来的数据写在变量里

import requests

url = "http://www.bilibili.com"
html = requests.get(url)

完事，这样机器就得到了网上的数据，就这么简单
接下来就该把你想要数据给你了

import requests

url = "http://www.bilibili.com"
html = requests.get(url)
html = html.text #把数据转为文本形式

切记！！！ 一定要把数据转化成文本形式，不然返回的只是请求值
我们打印一下看看

import requests

url = "http://www.bilibili.com"
html = requests.get(url)
html = html.text
print(html)

返回值：

<!DOCTYPE html><html lang="zh-CN"><head><meta charset="utf-8"><title>哔哩哔哩 (゜-゜)つロ 干杯~-bilibili</title><meta name="description" content="bilibili是国内知名的视频弹幕网站，这里有最及时的动漫新番，最棒的ACG氛围，最有创意的Up主。大家可以在这里找到许多欢乐。"><meta name="keywords" content="Bilibili,哔哩哔哩,哔哩哔哩动画,哔哩哔哩弹幕网,弹幕视频,B站,弹幕,字幕,AMV,MAD,MTV,ANIME,动漫,动漫音乐,游戏,游戏解说,二次元,游戏视频,ACG,galgame,动画,番组,新番,初音,洛天依,vocaloid,日本动漫,国产动漫,手机游戏,网络游戏,电子竞技,ACG燃曲,ACG神曲,追新番,新番动漫,新番吐槽,巡音,镜音双子,千本樱,初音MIKU,舞蹈MMD,MIKUMIKUDANCE,洛天依原创曲,洛天依翻唱曲,洛天依投食歌,洛天依MMD,vocaloid家族,OST,BGM,动漫歌曲,日本动漫音乐,宫崎骏动漫音乐,动漫音乐推荐,燃系mad,治愈系mad,MAD MOVIE,MAD高燃"><meta name="renderer" content="webkit"><meta http-equiv="X-UA-Compatible" content="IE=edge"><meta name="spm_prefix" content="333.851"><link rel="dns-prefetch" href="//s1.hdslb.com"><script type="text/javascript">function getIEVersion(){var e=99;if("Microsoft Internet Explorer"==navigator.appName){var t=navigator.userAgent;null!=new RegExp("MSIE ([0-9]{1,}[.0-9]{0,})").exec(t)&&(e=parseFloat(RegExp.$1))}return e}getIEVersion()<11&&(window.location.href="https://www.bilibili.com/blackboard/activity-I7btnS22Z.html")</script><script type="text/javascript">!function(){for(var ......

一千多行！
花花绿绿的都是链接和代码
对，我们要的就是链接
接下来怎么找图片的链接呢？

第二步.找图片链接

为了教学，不用正则表达式
首先得把他们分开

import requests

url = "http://www.bilibili.com"
html = requests.get(url)
html = html.text

list_html = html.split('"')

如果要让机器提取B站图片链接
那就得让他知道B站图片链接长啥样
B站图片链接长这样：

http://i0.hdslb.com/bfs/archive/0bf1a101af3a0014def2a3978ff68101c7002106.jpg
http://i1.hdslb.com/bfs/face/f0de237671f10a9ef735636250ca89bd50c180fb.jpg

看看它们有什么特点
他们都是http协议的链接
他们一定有"hdslb.com"
他们的后缀是".jpg"
那么我们就可以开干了！

import requests

url = "http://www.bilibili.com"
html = requests.get(url)
html = html.text

list_html = html.split('"')
urls = []
for i in list_html:
    if "http://" in i:
        if "hdslb.com" in i and ".jpg" in i:
            urls.append(i)

写个for循环依次检查每个链接是不是我们要找的链接
如果是就加进列表里
打印一下看看

['https://i0.hdslb.com/bfs/sycp/creative_img/202007/d07403ce674fb827c654006f44380879.jpg', 'https://i0.hdslb.com/bfs/sycp/creative_img/202007/d1b177d716d28038dfde1f7677ca0dbb.jpg', 'https://i0.hdslb.com/bfs/sycp/creative_img/202007/d1b177d716d28038dfde1f7677ca0dbb.jpg@412w_232h_1c_100q.jpg', 'http://i2.hdslb.com/bfs/face/6de7150ba0e78a33860538da88a6c5b787024579.jpg', 'http://i0.hdslb.com/bfs/archive/589488d99e16347892f94d56ec6e179843198318.jpg', '//i0.hdslb.com/bfs/archive/589488d99e16347892f94d56ec6e179843198318.jpg@412w_232h_1c_100q.jpg', 'http://i0.hdslb.com/bfs/face/34d7ccfe7fbfe16e2e764b7bd543d1bb566ff208.jpg', 'http://i0.hdslb.com/bfs/archive/cacb9830431d35e543438aef602900c8d911960f.jpg', '//i0.hdslb.com/bfs/archive/cacb9830431d35e543438aef602900c8d911960f.jpg@412w_232h_1c_100q.jpg', 'http://i0.hdslb.com/bfs/face/1a2554b2b54c17694fb7aa37900b28f7b7c96449.jpg', 'http://i0.hdslb.com/bfs/archive/f9c64dcf5d383b8340c1a237f8ea4d4a9dcef00e.jpg', '//i0.hdslb.com/bfs/archive/f9c64dcf5d383b8340c1a237f8ea4d4a9dcef00e.jpg@412w_232h_1c_100q.jpg', 'http://i0.hdslb.com/bfs/face/569dec5aeab5ad650ecdee0c1d8965512d32657c.jpg', 'http://i0.hdslb.com/bfs/archive/3c8bad9fb462fe76f2d8644a08ec3296ba932804.jpg', '//i0.hdslb.com/bfs/archive/3c8bad9fb462fe76f2d8644a08ec3296ba932804.jpg@412w_232h_1c_100q.jpg', 'http://i0.hdslb.com/bfs/face/1aa5fb297a1dafee8c12c809bed7eb9e8059a929.jpg', 'http://i0.hdslb.com/bfs/archive/1bd5e2c9c6d49df89a5e1d96ca7c7715c37a210c.jpg', '//i0.hdslb.com/bfs/archive/1bd5e2c9c6d49df89a5e1d96ca7c7715c37a210c.jpg@412w_232h_1c_100q.jpg', 'http://i2.hdslb.com/bfs/face/26c84ebb81c5bd1c1381bd75090bd7e39e0535fd.jpg', 'https://i0.hdslb.com/bfs/sycp/creative_img/202007/2cd187b3d851c2cb0906d12bfaa868d3.jpg', 'https://i0.hdslb.com/bfs/sycp/creative_img/202007/fc81f5f324f6f44a233272dd5c1e9f65.jpg', 'http://i0.hdslb.com/bfs/archive/28b001cd91e5b900f02bca8c93dfd1de609e8cfe.jpg', 'https://i0.hdslb.com/bfs/sycp/creative_img/202007/e2f0bd3a91bcadcd0d1a31c0cada55d1.jpg', 'http://i0.hdslb.com/bfs/archive/ab80cfa04e04a5a3e7e0c2604d2958b094f72e03.jpg', 'http://i0.hdslb.com/bfs/archive/a8144c19e221e3aca37c4c4baff31de0770f10db.jpg', 'http://i0.hdslb.com/bfs/archive/d348642d611e98021c4c13c30cf5b588a9cf5abb.jpg', 'http://i0.hdslb.com/bfs/archive/bb5f66e5f61d6b626fac43897fb9dd03c257820c.jpg', 'http://i0.hdslb.com/bfs/archive/bb5f66e5f61d6b626fac43897fb9dd03c257820c.jpg', 'http://i0.hdslb.com/bfs/face/cb620bbb9071974f37843134875d472b47532a97.jpg', 'http://i0.hdslb.com/bfs/archive/becf9e8d300838b7310bd26a7ddb25c627225a09.jpg', 'http://i2.hdslb.com/bfs/archive/becf9e8d300838b7310bd26a7ddb25c627225a09.jpg', 'http://i0.hdslb.com/bfs/face/cb620bbb9071974f37843134875d472b47532a97.jpg', '

OHHHHHHHHHH!!!
可是里面有好多重复的，这没关系
我们再改改

import requests

url = "http://www.bilibili.com"
html = requests.get(url)
html = html.text

list_html = html.split('"')
urls = []
for i in list_html:
    if "http://" in i:
        if "hdslb.com" and ".jpg" in i:
            urls.append(i)
photo_urls = []
for i in urls:
    if i not in photo_urls:
        photo_urls.append(i)

['https://i0.hdslb.com/bfs/sycp/creative_img/202007/d07403ce674fb827c654006f44380879.jpg', 'https://i0.hdslb.com/bfs/sycp/creative_img/202007/d1b177d716d28038dfde1f7677ca0dbb.jpg@412w_232h_1c_100q.jpg', 'http://i2.hdslb.com/bfs/face/6de7150ba0e78a33860538da88a6c5b787024579.jpg', 'http://i0.hdslb.com/bfs/archive/589488d99e16347892f94d56ec6e179843198318.jpg', '//i0.hdslb.com/bfs/archive/589488d99e16347892f94d56ec6e179843198318.jpg@412w_232h_1c_100q.jpg', 'http://i0.hdslb.com/bfs/face/34d7ccfe7fbfe16e2e764b7bd543d1bb566ff208.jpg', 'http://i0.hdslb.com/bfs/archive/cacb9830431d35e543438aef602900c8d911960f.jpg'

这下好了

第三步.保存

用with open的写入模式保存二进制数据：

import requests

url = "http://www.bilibili.com"
html = requests.get(url)
html = html.text

list_html = html.split('"')
urls = []
for i in list_html:
    if "http://" in i:
        if "hdslb.com" and ".jpg" in i:
            urls.append(i)
photo_urls = []
for i in urls:
    if i not in photo_urls:
        photo_urls.append(i)
cnt = 0
for i in photo_urls:
    cnt += 1
    img = requests.get(i)
    with open("Photo_{}.jpg".format(cnt),"wb") as f:
        f.write(img.content)