python多线程爬虫界面_多线程Python Web爬虫陷入困境

我正在编写一个Python网络搜寻器,并且希望使其成为多线程的.现在我已经完成了基本部分,下面是它的作用:

>线程从队列中获取URL;

>线程从页面中提取链接,检查链接是否存在于一个池(一组)中,然后将新链接放入队列和池中;

>线程将url和http响应写入csv文件.

但是,当我运行搜寻器时,它最终总是会卡住,无法正确退出.我已经阅读了Python的官方文档,但仍然没有头绪.

下面是代码:

#!/usr/bin/env python

#!coding=utf-8

import requests, re, urlparse

import threading

from Queue import Queue

from bs4 import BeautifulSoup

#custom modules and files

from setting import config

class Page:

def __init__(self, url):

self.url = url

self.status = ""

self.rawdata = ""

self.error = False

r = ""

try:

r = requests.get(self.url, headers={'User-Agent': 'random spider'})

except requests.exceptions.RequestException as e:

self.status = e

self.error = True

else:

if not r.history:

self.status = r.status_code

else:

self.status = r.history[0]

self.rawdata = r

def outlinks(self):

self.outlinks = []

#links, contains URL, anchor text, nofollow

raw = self.rawdata.text.lower()

soup = BeautifulSoup(raw)

outlinks = soup.find_all('a', href=True)

for link in outlinks:

d = {"follow":"yes"}

d['url'] = urlparse.urljoin(self.url, link.get('href'))

d['anchortext'] = link.text

if link.get('rel'):

if "nofollow" in link.get('rel'):

d["follow"] = "no"

if d not in self.outlinks:

self.outlinks.append(d)

pool = Queue()

exist = set()

thread_num = 10

lock = threading.Lock()

output = open("final.csv", "a")

#the domain is the start point

domain = config["domain"]

pool.put(domain)

exist.add(domain)

def crawl():

while True:

p = Page(pool.get())

#write data to output file

lock.acquire()

output.write(p.url+" "+str(p.status)+"

")

print "%s crawls %s" % (threading.currentThread().getName(), p.url)

lock.release()

if not p.error:

p.outlinks()

outlinks = p.outlinks

if urlparse.urlparse(p.url)[1] == urlparse.urlparse(domain)[1] :

for link in outlinks:

if link['url'] not in exist:

lock.acquire()

pool.put(link['url'])

exist.add(link['url'])

lock.release()

pool.task_done()

for i in range(thread_num):

t = threading.Thread(target = crawl)

t.start()

pool.join()

output.close()

任何帮助,将不胜感激!

谢谢

马库斯

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值