多线程版爬取故事网

最新推荐文章于 2024-09-25 11:06:23 发布

ao91544

最新推荐文章于 2024-09-25 11:06:23 发布

阅读量147

点赞数

文章标签：数据库 python

原文链接：http://www.cnblogs.com/vorphan/p/7468727.html

版权

前言：
为了能以更高效的速度爬取，尝试采用了多线程
本博客参照代码及PROJECT来源：http://kexue.fm/archives/4385/

源代码：

 1 #! -*- coding:utf-8 -*-
 2 import requests as rq
 3 import re
 4 import time
 5 import datetime
 6 import pymysql
 7 from multiprocessing.dummy import Pool,Queue #dummy子库是多线程库
 8 import html
 9 from urllib.request import urlopen
10 from bs4 import BeautifulSoup
11 unescape = html.unescape #用来实现对HTML字符的转移
12 
13 tasks = Queue() #链接队列
14 tasks_pass = set() #已队列过的链接
15 results = {} #结果变量
16 count = 0 #爬取页面总数
17 tasks.put('/index.html') #把主页加入到链接队列
18 tasks_pass.add('/index.html') #把主页加入到已队列链接
19 
20 def main(tasks):
21     global results,count,tasks_pass #多线程可以很轻松地共享变量
22     while True:
23         url = tasks.get() #取出一个链接
24         url = 'http://wap.xigushi.com'+url
25         html = urlopen(url)
26         bsObj = BeautifulSoup(html.read(), "lxml")
27         if (bsObj.meta.attrs['charset']=='gb2312'):
28             web = rq.get(url).content.decode('gbk')  # 这里的编码要看实际情形而定
29         else:
30             web = rq.get(url).content.decode('utf8')  # 这里的编码要看实际情形而定
31 
32         urls = re.findall('href="(/.*?)"', web) #查找所有站内链接
33         for u in urls:
34             if (u not in tasks_pass): #把还没有队列过的链接加入队列
35                 if ((re.search('images', url)) is None):
36                     tasks.put(u)
37                     tasks_pass.add(u)
38                 else:
39                     print(u, '---------------------------skipping--------------------------------------------')
40             else:
41                 pass
42 
43         text = bsObj.title.get_text()
44         print(datetime.datetime.now(), '   ', url, '   ', text)
45         db = pymysql.connect("localhost", "testuser", "test123", "TESTDB", charset='gbk')
46         dbc = db.cursor()
47         sql = "insert ignore into data1(url,title) values(%s,%s);"
48         data = (url, text)
49         dbc.execute(sql, data)
50         dbc.close()
51         db.commit()
52         db.close()
53         count += 1
54         if count % 100 == 0:
55             print(u'%s done.'%count)
56 
57 pool = Pool(10, main, (tasks,)) #多线程爬取，4是线程数
58 total = 0
59 while True: #这部分代码的意思是如果20秒内没有动静，那就结束脚本
60     time.sleep(60)
61     if len(tasks_pass) > total:
62         total = len(tasks_pass)
63     else:
64         break
65 
66 pool.terminate()
67 print("terminated normally")

BUG:

数据库并发写入：
解答来源：https://stackoverflow.com/questions/6650940/interfaceerror-0
通过将游标的创建移入线程，并在线程内关闭，跑出来的结果比之前好一些，但奇怪的是多几行还是会出现并发报错，奇怪，还能错一半的？猜测将连接也放入线程会好些，或者干脆不用commit提交？结果是不用commit都没有写入数据库...我以为开启了自动提交呢...
已解决，将数据库连接和游标都放在线程内创建
神奇地跳过一些数据库里面没有的链接：
原来是过滤问题...水平真是...
已解决，修改URL过滤方式
编码问题真是头都大了...
```
encoding error : input conversion failed due to input error, bytes 0xB1 0x80 0xB5 0xC4
```
为什么改了那么多次还有...显然是gbk转utf8问题，可是我判断了啊，还是有些网页就是比较乱...

又是编码问题：

UnicodeEncodeError: 'gbk' codec can't encode character '\u30fb' in position 86: illegal multibyte sequence

网络错误：

urllib.error.HTTPError: HTTP Error 503: Forwarding failure

转载于:https://www.cnblogs.com/vorphan/p/7468727.html

ao91544

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫