[python]抓取啄木鸟社区《活学活用wxPython》内容与图片

最新推荐文章于 2020-11-11 15:55:16 发布

seraph021724

最新推荐文章于 2020-11-11 15:55:16 发布

阅读量3.9k

点赞数

分类专栏： python 文章标签： python Python

本文链接：https://blog.csdn.net/seraph021724/article/details/8446265

版权

python 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

请参考crifan的博文如何用Python,C#等语言去实现抓取静态网页+抓取动态网页+模拟登陆网站这是我看到的关于爬取与模拟登陆最详尽的一个系列，总结整理了很多，获益不少编辑 20130105 瑾诚

因为元旦放假不一定总能上网，所以决定把《活学活用wxPython》抓下来，顺便练习一下。

感谢啄木鸟和pyug ，我一直是对着社区里的开源书自学的

另外还有两篇博文也列一下AstralWind的Python线程指南和deerchao的正则表达式30分钟入门教程在我写这个py的时候很有帮助

代码写得不是很好，希望看见的诸位能够给予指导

全文代码在github

因为是专为《活学活用wxPython》而写的，针对性很强，所以所有正则表达式也都是专门写的，不具有普适性

思路

1、获得所有章节的url列表

2、获取每一章节中所有图片地址

3、下载各章节：为了不至于重复下载，将url放入字典，{url:True}

4、下载图片：也放在字典里

代码

0、使用到的库

# -*- coding:gb2312 -*-
from sgmllib import SGMLParser
import urllib2,re
from urllib import urlretrieve
import threading
import time
import os

1、获取章节列表

class URLLister(SGMLParser):
    """
    urls:chapter url
    """
    match = ''
    dicurl = {}
    def start_a(self, attrs):
        for k, v in attrs:
            if k == 'href' and re.match(self.match,v):
                self.dicurl[v]=True

重载SGMLPareser，使用正则找到符合期望的url

2、获取图片列表

class IMGLister(SGMLParser):
    search = ''
    dicimg = {}
    def start_img(self, attrs):
        for k, v in attrs:
            if k == 'src' and re.search(self.search,v):
                self.dicimg[v] = True

3、下载章节内容

def spider(baseurl,dicurl,filepath):
    g_mutex = threading.Lock() 
    #进程锁
    url = ''

    for k in dicurl:
        g_mutex.acquire()
        if dicurl[k]:
            dicurl[k]=False
            url = baseurl + k
            break
        g_mutex.release()  
    if url is not '':    
        content = urllib2.urlopen(url).read()
        res = re.split('/',url)
        lenth = len(res)
        filename = res[lenth-1]
        filepath += filename+'.html'
        try:
            fw = open(filepath,'w')
            fw.write(content)
            print filepath + '文件输出成功'
            fw.close()  
        except IOError, e:
            print e
    time.sleep(1)

4、下载图片文件

def spiderimg(baseurl,dicimg,filepath):
    g_mutex = threading.Lock()
    url = ''
    for k in dicimg:
        g_mutex.acquire()
        if dicimg[k]:
            dicimg[k]=False
            url = baseurl + k
            break
        g_mutex.release()
    if url is not '':
        downfile(url,filepath)
    time.sleep(1)
    
def downfile(netpath,localpath):    
    
    filenamerule = re.compile(r'(?<=\btarget\b=)(.*\..*)$')
    filenameres = re.search(filenamerule, netpath)
    filename = filenameres.group(0)
    
    try:
        urlretrieve(netpath,localpath + filename)
        print localpath + filename + '保存成功'
    except IOError, e:
        print e

5、调用

下面写得很乱，主要包括开线程下载章节列表和开线程下载图片，图省事没有写到方法里

#begin 
url_base = 'http://wiki.woodpecker.org.cn'

print '打开网页...'+url_base+'/moin/WxPythonInAction'
content = urllib2.urlopen(url_base+'/moin/WxPythonInAction').read()

print '开始查找href...'

lister=URLLister()
lister.match = '/moin/WxPythonInAction/Chapter'
lister.feed(content)

listimg = IMGLister()
listimg.match = ''

global dicurl
global dicimg
dicurl = lister.dicurl

global g_mutex

threadpool = []

print '文件保存地址(such as d:\docs\)'
filepath = raw_input()

'''
filepathrule = re.compile(r'\\$')
res = re.search(filepathrule,filepath)
if res.group(0):
filepath += '\\'

print filepath
'''

try:        
    os.makedirs(filepath)
    print '文件夹不存在，已创建'
except:
    print '文件夹存在，继续执行'


for k in lister.dicurl:
    th = threading.Thread(target = spider, args = (url_base,dicurl,filepath))
    threadpool.append(th)
    
for th in threadpool:     
    th.start()
for th in threadpool: 
    threading.Thread.join(th)

print '文件下载完成，开始下载图片'

folder = 'images\\'
os.makedirs(filepath + folder)

for k in dicurl:
    
    url = url_base + k
    content = urllib2.urlopen(url).read()
    imglister = IMGLister()
    imglister.search = r'/moin/WxPythonInAction/\bChapter\w+\b\?action=AttachFile\&do=get\&target=(.*\..*)$'
    imglister.feed(content)

    dicimg = imglister.dicimg
    '''
    folderrule = re.compile(r'\bChapter\w+\b')
    for val in dicimg:        
        folderres = re.search(folderrule, val)
        folder = folderres.group(0)
        folder += '\\'
        break
    if not os.path.exists(filepath + folder):
        os.makedirs(filepath + folder)
    '''
    
    threadpool2 = []
    for val in imglister.dicimg:
        th = threading.Thread(target = spiderimg, args = (url_base, dicimg, filepath + folder))
        threadpool2.append(th)
    for th in threadpool2:
        th.start()
    for th in threadpool2:
        threading.Thread.join(th)
    print k + '图片下载完成'

6、问题

1、实际上应该是在下载章节列表的同时下载图片，但是失败了，还需要再研究研究

2、图片应该是放在各个章节的文件夹里，而不是统一放在images文件夹里

3、文件存放地址，只能是d:\docs\而不能是d:\docs，这个判断没有加- -

附录

附一张python的简单正则匹配。修改时间2013-01-15

语法	说明	示例
.	匹配除换行符 \n 以外的任意字符	b.c 匹配 bac,bdc
*	匹配前一个字符 0 次或多次	b*c 匹配 c，或者 bbbc
+	匹配前一个字符 1 次或多次	b+c 匹配 bc 或者 bbbc
？	匹配前一个字符 0 或 1 次	b?c 匹配 c 或者 bc
{m}	匹配前一个字符 m 次	b{2}c 匹配 bbc
{m,n}	匹配前一个字符 m 至 n 次	b{2,5}c 匹配 bbc 或者 bbbbc
[abc]	匹配 [] 内的任意字符	[bc] 匹配 b 或者 c
\d	匹配数字 [0-9]	b\dc 匹配 b1c 等
\D	匹配非数字，等价于 [^\d]	b\Dc 匹配 bAc
\s	匹配空白字符	b\sc 匹配 b c
\S	匹配非空白字符 [\^s]	b\Sc 匹配 bac
\w	匹配 [A-Za-z0-9_]	b\wc 匹配 bAc 等
\W	等价于 [^\w]	b\Wc 匹配 b c
\	转义字符，	b\\c 匹配 b\c
^	匹配字符串开头	^bc 匹配句首的 bc
$	匹配字符串末尾	bc$ 匹配以 bc 结尾的字符串
\A	仅匹配字符串开头	\Abc 匹配字符串开头的 bc
\Z	仅仅匹配字符串末尾	bc\Z 匹配字符串末尾的 bc
\|	匹配左右表达式任意一个	b\|c 匹配 b 或者 c