python 爬虫试手，好简单的爬虫

最新推荐文章于 2024-08-06 09:49:30 发布

5354xyz

最新推荐文章于 2024-08-06 09:49:30 发布

阅读量1k

点赞数

分类专栏： python学习文章标签：爬虫正则线程

本文链接：https://blog.csdn.net/xyz5354/article/details/38322417

版权

python学习专栏收录该内容

10 篇文章 0 订阅

订阅专栏

目标是爬一下这个网站上点一些句子http://www.duanwenxue.com/

但是由于效果不是很好，刚入门。其实可以根据所有的连接，进行任意跳转，但现在只是根据某一块跟正则匹配的地方

的连接来获取跳转连接。具体看图

下面上代码，初学者求指导：

# -*- coding: utf-8 -*-
import urllib2  
import urllib  
import re  
import threading  
import time
import Queue

class Wenxue_Spider_Model(threading.Thread):
    "id在set里面，不作任何处理，不在set里面，则put到queue后，add到set里面"
    def __init__(self,queue,firstSourceId,sett,count):
        self.ok = False
        self.firstSourceId=firstSourceId
        self.myqueue = queue
        self.sett = sett
        self.count = count
        self.first_page_url = "http://www.duanwenxue.com/article/307683.html"
        threading.Thread.__init__(self)
        
    def getHtml(self,url):
        webPage=urllib.urlopen(url)
        html=webPage.read()
        webPage.close()
        return html
    def processContent(self,content):
        "去掉内容里面HTML标签"
        re_h=re.compile('</?\w+[^>]*>')
        content=re_h.sub('',content)
        return content
    def getSourceId(self,url):
        "从一个连接里获取到文章的id"
        re_id='.*/article/(.*?)\.html'
        sourceId = re.compile(re_id).findall(url)
        #print type(sourceId),"---"
        return sourceId
    def getContent(self,sourceId):
        url = "".join(["http://www.duanwenxue.com/article/",str(sourceId),".html"])
        #print url
        html = self.getHtml(url)
        #这是正则识别内容和链接
        reg='<div id=.*?class=.*?>\s*<div id="s-article-main01" class=.*?></div>\s*<p>(.*)</p>\s*<div class=.*?>\s*<h3>.*</h3>\s*<p>.*<a href="(.*?)" target="_blank">.*</a></p>\s*<p>.*<a href="(.*?)" target="_blank">.*</a></p>\s*</div>\s*<div id="s-article-main02" class="content-in-bottom"></div>\s*</div>\s*<div class=.*?>\s*<span>.*<a href="(.*?)">.*</a> </span> <span>(.*|.*<a href=(.*?)>.*</a> )</span>'
        self.wenxueContent=re.compile(reg).findall(html)
        for res in self.wenxueContent:
            content = self.processContent(res[0])
            print content
            for v in range(1,len(res)):
                thisSourceId=self.getSourceId(res[v])
                #print thisSourceId,"/"
                for _id in thisSourceId:
                    if _id not in self.sett:
                        #print _id
                        self.sett.add(_id)
                        self.myqueue.put(_id)
                        self.count = self.count+1
                    else:
                        pass
        self.ok=True
    
    def run(self):
        while True:
            print self.count
            if self.myqueue.qsize() > 0:
                #print self.sett
                sourceid = self.myqueue.get()
                self.getContent(sourceid)
        

#起始连接
sourceId = "307683"
#所有爬到的，未读取的连接放到这里面
q=Queue.Queue(maxsize = 0)
#set用来判断之前有没有采集过，这个学过其它语言的都应该知道set干嘛的
sett = set(["307683"])
#计数
count = 1
q.put(sourceId)
#wenxueModel = Wenxue_Spider_Model(q,sourceId)
#这里本来想开启多个线程的，但是取到的数目不是很稳定，不知道为啥，
#还这个爬虫只能爬到515条数据，还不如一页一页的收集按页收集来得全，所以这只是个原理介绍
Wenxue_Spider_Model(q,sourceId,sett,count).start() <strong>



</strong>

这是爬虫的效果：