Python爬虫

最新推荐文章于 2020-12-09 05:46:18 发布

ekkie

最新推荐文章于 2020-12-09 05:46:18 发布

阅读量754

点赞数 1

本文链接：https://blog.csdn.net/ekkie/article/details/51167116

版权

真正的商用爬虫是非常复杂的，比如谷歌的爬虫，因为数据量太大要采用分布式系统，需要维护已爬的网址集合，要进行网页内容是否重复的判断，还要遵守各个网站的要求，爬虫本身还需要具有相当好的稳定性和抗干扰能力。这些内容都不包含在本文章中，这里将介绍的爬虫非常简单，我们最后会写一个能将漫画《十万个冷笑话》爬取到本地的爬虫。

我们会用到两个第三方模块，requests和beautifulsou。我们通常不直接从网上下载，而是采用python的包管理工具pip。在命令行中输入：

pip install requests

pip install beautifulsoup4

你就可以使用这两个第三方模块了。

我们来看看如何用python从网上取得网页，运行如下程序：

import requests

url = 'http://www.u17.com/comic/5553.html'

resp = requests.get(url)

print resp.content

最后会打印出如下内容

<!doctype html>

<html>

<head>

这里只显示一部分内容，我们可以看到这就是网页 http://www.u17.com/comic/5553.html的源码，说明我们成功将网页爬取到本地了。

但我们获得的只是html格式的内容，要从中提取需要的部分就需要beautifulsoup了。现在我们使用beautifulsoup提取 http://www.u17.com/comic/5553.html页面中每集漫画的地址。

import requests

from bs4 import BeautifulSoup

url = 'http://www.u17.com/comic/5553.html'

resp = requests.get(url)

soup = BeautifulSoup(resp.content, 'html.parser')

aa=soup.find('ul',{'id':'chapter'}).find_all('a')

chapters = []

for a in aa:

chapters.append(a['href'])

现在所有的地址都存在chapter字符数组中了，可以看到beautifulsoup使用起来非常方便，你只要找到内容对应的标签就能将内容提取出来。

当我们通过漫画的地址爬到相应的html源码，你会发现源码里并没有漫画图片的链接。其实有妖气里的漫画是通过javascipt动态生成的。但是你可以在html的一段<script>中发现image_list中有属性src，这是用base64加密过的，解密出来就是对应图片地址了。

我把这个过程写在一个函数中，请在原来的代码下加入以下内容：

import re

import base64

def image_urls(html):

s = re.search(r'image_list[^\n\r]+',resp.content).group(0)

urls_bs64 = re.findall(r'"src"\s*:\s*"([^"]+)"', s)

urls = []

for url in urls_bs64:

urls.append(base64.decodestring(url))

return urls

最后通过获得的图片地址，下载图片并保存，请在源代码中加入以下内容：

import os

for i in range(len(chapters)):

folder = str(i+1)

if not os.path.exists(folder):

os.makedirs(folder)

resp = requests.get(chapters[i])

urls = image_urls(resp.content)

for j in range(len(urls)):

url = urls[j]

suffix = url.split('.')[-1]

filename = str(j+1) + '.' + suffix

resp = requests.get(url)

f = open(os.path.join(folder, filename), 'wb')

f.write(resp.content)

f.close()

另外补充一些requests的高级使用方法：

修改user-agent
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36'}
resp = requests.get(url, headers = headers)
使用http代理
proxies = {'http': 'http://127.0.0.1:1080',
'https': 'https://127.0.0.1:1080'}
resp = requests.get(url, proxies = proxies)
使用socks5代理
需要添加第三方模块requesocks
import requesocks as requests
proxies = {'http': 'socks5://127.0.0.1:1080',
'https': 'socks5://127.0.0.1:1080'}
resp = requests.get(url, proxies = proxies)

ekkie

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Python爬虫

真正的商用爬虫是非常复杂的，比如谷歌的爬虫，首先因为数据量太大要采用分布式系统，其次还需要维护已爬的网址集合，再次要进行网页内容是否重复的判断，还要遵守各个网站的要求，爬虫本身还需要具有相当好的稳定性和抗干扰能力。这些内容都不包含在本文章中，这里将介绍的爬虫非常简单，我们最后会写一个能将漫画《十万个冷笑话》爬取到本地的爬虫。我们会用到两个第三方模块，requests和beautiful
复制链接

扫一扫