python3爬虫的总结及参考资料

最新推荐文章于 2024-07-24 16:36:32 发布

iamiman

最新推荐文章于 2024-07-24 16:36:32 发布

阅读量6.7k

点赞数 3

分类专栏： python 参考总结文章标签： python 爬虫 html beautifulSoup 伪装

参考总结同时被 2 个专栏收录

8 篇文章 0 订阅

订阅专栏

python

1 篇文章 0 订阅

订阅专栏

python3的爬虫一般都是利用urllib.request抓取网页和下载数据，然后用bs4中的BeautifulSoup进行html解析，下面是本人学习中借鉴的参考网帖以及对某些问题的总结。

1.使用beautifulSoup的查找功能的详细介绍

http://blog.csdn.net/abclixu123/article/details/38502993

2.BeautifulSoup处理网页

详细内容在http://www.tuicool.com/articles/RNFVrm

Python的BeautifulSoup包大家都知道吧，

import BeautifulSoup

soup = BeautifulSoup.BeautifulSoup(html)

利用这个包先把html里script，style给清理了：

[script.extract()for script in soup.findAll('script')]

[style.extract()for style in soup.findAll('style')]

#清理完成后，这个包有一个prettify()函数，把代码格式给搞的标准一些：

soup.prettify()

然后用正则表达式，把所有的HTML标签全部清理了：

reg1 = re.compile("<[^>]*>")

content = reg1.sub('',soup.prettify())

剩下的都是纯文本的文件了，通常是一行行的，把空白行给排除了，这样就会知道总计有多少行，每行的字符数有多少，我用excel搞了一些每行字符数的统计，如下图：

3.利用URL下载文件的几种方式

http://www.open-open.com/lib/view/open1420378937984.html

urllib.request.urlretrieve(url,‘example.pdf’)

例子：抓取巨潮资讯网的草案页面下pdf下载

from bs4 import BeautifulSoup
import urllib.request
import re
import io
import sys
sys.stdout =io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')         #改变标准输出的默认编码,输出就正常了
urlhand = urllib.request.urlopen(‘http://www.cninfo.com.cn/cninfo-new/disclosure/sse/bulletin_detail/true/1202367076?announceTime=2016-06-14’)
soup = BeautifulSoup(content,'html.parser')
# print(soup.get_text())#cmd中显示仍然乱码
getContent = soup.find(attrs ={'class':r'btn-blue bd-btn'})
# print(getContent)
tags = getContent('a')
# print(len(tags))
pattern = r'"/(cninfo-new.+?)"'
stringFinded = re.findall(pattern,str(tags))#用search的结果难处理，用findall得到list
downloadUrl = 'http://www.cninfo.com.cn/' +stringFinded[0]
print(downloadUrl)
 
##获得文件名
getName = soup.find('h2')
print(getName.encode('utf-8'))
pattern = r'(\d+?.+?)<br/>[\n\t]+?([\u4e00-\u9fa5\w]+?.+?)[\n\t]'#遇到%号,[\u4e00-\u9fa5\w]是用于匹配中文的，就会停下，点匹配符（.）能匹配任何内容除了\n
stringFinded = re.findall(pattern,str(getName))
print(stringFinded)
saveName =stringFinded[0][0]+stringFinded[0][1] +'.pdf'
urllib.request.urlretrieve(downloadUrl,saveName)

#获得全部text
# fin = open('./textResult.txt', 'w',encoding = 'utf-8')#因为要写入字符串，用utf格式写入
# fin.write(soup.get_text())#参数是str
# fin.close()

4.爬虫并行处理

http://www.oschina.net/translate/python-parallelism-in-one-line

5.下载文件出错的处理思路

利用try except 捕捉错误并且写入noOklist列表，把正常获取到的下载文件保存。

注意：

1. 触发异常的语句必须在try里面才能被捕捉到

2. Except 后可以不写exception的种类，捕捉索引，但是如果写了种类，超出种类的就会截断程序。

例如：

#处理包含pdf的巨潮网页，获得pdf下载地址
import xlrd
import re
import socket
from bs4 import BeautifulSoup
import urllib.request
socket.setdefaulttimeout(10)#设置超时时间
 
#记录所有在excel中取得的数据
pageList = [‘http://www.cninfo.com.cn/cninfo-new/disclosure/sse/bulletin_detail/true/1202367076?announceTime=2016-06-14’,‘http://www.cninfo.com.cn/cninfo-new/disclosure/szse_sme/bulletin_detail/true/1202534048?announceTime=2016-08-03%2008:15’]
pageName = [‘haha’, ‘xixi’]
#记录获得的pdf下载地址，用于自动下载
downloadName = []
downloadList = []
#记录pdf下载失败的地址，手动下载
notOkList = []
notOkName = []
for index in range(numPageList):
   url = pageList[index]
   print(index, url)
   name = pageName[index]
    #print('i am try' + str(index))
   try:
       html = urllib.request.urlopen(url)
       # print('i am hear' + str(index))
       content = html.read()
       soup = BeautifulSoup(content, 'html.parser')
       getContent = soup.find(attrs = {'class' : r'btn-blue bd-btn'})
       tags = getContent('a')
       pattern = r'"/(cninfo-new.+?)"'
       stringFinded = re.findall(pattern, str(tags))#用search的结果难处理，用findall得到list
       dlAddress = 'http://www.cninfo.com.cn/' + stringFinded[0]
       downloadList.append(dlAddress)
       downloadName.append(name)
   except TimeoutError:#
       notOkList.append(url)
       notOkName.append(name)
       continue

关于异常捕捉的补充：

如果不指明异常类型则会捕捉try内容内所有异常，可能是findall的异常，但是可以让程序通畅运行。缺点是无法得知是否出现除了timeout外的错误，所以一般在调试时建议保留异常类型，正式运行删去可以确保完整运行，捕捉所有异常到notOkList再另行处理

6.数据流保存文件名出错

一旦自定义的文件名中包含以上字符，则保存的文件名会只剩下左半部分，比如abc?d.pdf最后保存只剩下abc

处理方法：

patternIllegal = r'[^\\/\*\?"<>|]+'#[]内加^是代表了否定匹配，可以找出所有正常字符，最后再join就可以
results= re.findall(patternIllegal, preDifinedName)
fileName =docAddress + ''.join(results)

7.爬虫循环自动下载思路

可以把未成功的访问用except捕捉并且保留在noOklist中，利用while循环知道循环次数超出或者noOklist为空

UseDownloadList.py

#使用parallel获得的downList
import urllib.request
import re
import socket
socket.setdefaulttimeout(10)
currRound = 20
print(currRound)
while (currRound > 0):
   fhand = open('./downloadFile.txt', 'r', encoding = 'utf-8')
   fContent = fhand.read()
   fhand.close()
   patternUrl = r'downloadList\n(.+?)\ndownloadName'
   patternName = r'downloadName\n(.+)'
   urlList = re.findall(patternUrl, fContent, re.S)
   nameList = re.findall(patternName, fContent, re.S)
   urls = urlList[0].splitlines()
   names = nameList[0].splitlines()
   docAddress = './pdfStores/'
 
   failUrlList = []
   failNameList = []
   for index in range(len(urls)):
       url = urls[index]
       saveName = docAddress + names[index]
       try:
           print('hi', index)
           urllib.request.urlretrieve(url, saveName)
       except:
           print(url)
           print(saveName)
            failUrlList.append(url)
           failNameList.append(saveName)
           continue
 
   if (len(failUrlList) <= 5): break
 
   downloadFile = open('downloadFile.txt', 'w', encoding = 'utf-8')
   downloadFile.write('\ndownloadList\n')
   downloadFile.write('\n'.join(failUrlList))
   downloadFile.write('\ndownloadName\n')
   downloadFile.write('\n'.join(failNameList))
   downloadFile.close()
   currRound -= 1

8.下载速度过慢

下文并没有讲到完全解决方案，但是提供几种尝试方法，其中一种便为使用user-agent的伪装

http://www.crifan.com/use_python_urllib-urlretrieve_download_picture_speed_too_slow_add_user_agent_for_urlretrieve/comment-page-1/

知乎讨论区。后几个有提供方法

最后，再最后水一下不再更新的破爬虫相关博客：

 爬虫必备——requests

 01. 准备

 02. 简单的尝试

 番外篇. 搭建称手的Python开发环境

 05. 存储

 09. 通过爬虫找出我和轮子哥之间的最短关注链

作者：xlzd
链接：https://www.zhihu.com/question/28168585/answer/120205863
来源：知乎
著作权归作者所有，转载请联系作者获得授权。

9. 爬虫系统自动暂停

防止被检测下载频繁，可以设置爬虫暂停

Import time

time.sleep(500)

10.爬虫伪装--使用user-agent，添加header

浏览器f12查看network中的情况。

Request header的情况累死

Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8

Accept-Encoding:gzip, deflate, sdch

Accept-Language:zh-CN,zh;q=0.8

Connection:keep-alive

Host:www.cninfo.com.cn

Upgrade-Insecure-Requests:1

User-Agent:Mozilla/5.0 (Windows NT 10.0;WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101Safari/537.36

详细请看下面的讲解

https://jecvay.com/2014/09/python3-web-bug-series3.html

代码如下：

import urllib.request
 
url = 'http://www.baidu.com/'
req = urllib.request.Request(url, headers ={
   'Connection': 'Keep-Alive',
   'Accept': 'text/html, application/xhtml+xml, */*',
   'Accept-Language': 'en-US,en;q=0.8,zh-Hans-CN;q=0.5,zh-Hans;q=0.3',
   'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0)like Gecko'
})
oper = urllib.request.urlopen(req)
data = oper.read()
print(data.decode())
 
#写入的话，因为是urlopen返回byte流，因此直接以byte形式写入, open参数写’wb’，
 
data = oper.read()
           fhand = open(saveName, 'wb')
           fhand.write(data)
           fhand.close()

可以用type观察数据流的格式。利用decode(byte解码为字符串）来转换

>>> type(contents)

>>> type(contents.decode('utf-8'))

>>>type(contents.decode('utf-8','ignore'))

作者：Fel Peter

链接：https://www.zhihu.com/question/35838789/answer/65794367

来源：知乎

著作权归作者所有，转载请联系作者获得授权。