聚茶吧-汇聚茶的地方
自己想着做一个以“茶”为主题的网站,从别的具有相同主题(茶叶)的站点上抓取文章,汇聚成一个站点,充实站点的内容, 靠内容取胜。名字就叫聚茶吧(域名是jucha8.com, 取"聚茶吧"的谐音)。准备使用爬虫抓取100万左右的文章,充实茶叶相关的内容,尽可能的引流吧……
python爬虫的实现
那么,聚茶吧爬虫代码是如何实现的呢?假设要抓取的站点是A,则:
import requests
from lxml import etree
import time
from threading import Thread
from app.models import Article
class Spider(Thread):
def __init__(self, name, start, end):
super(Spider, self).__init__()
self.articles = []
self._start = start
self._end = end
self.name = name
def run(self):
for i in range(self._start, self._end):
url = 'http://www.a.com/category/show-{0}.html'.format(i)
self.crawl(url)
Article.save_many(self.articles)
self.articles = []
def crawl(self, url):
print 'thread-%s' % self.name, url
resp = requests.get(url)
html = etree.HTML(resp.text)
title_e = html.xpath(u'//h1')
if not title_e:
return
title = title_e[0].text
ps_list = html.xpath("//div[@id='article']")
body = []
if not ps_list:
return
ps = ps_list[0].xpath('string(.)')
if ps.strip():
paras = ps.strip().split('\n')
for p in paras:
body.append(u'<p>{0}</p>'.format(p))
if len(title) > 128:
title = title[:128]
self.articles.append(Article(**{'title': title, 'body': u''.join(body)}).data)
if len(self.articles) > 20:
Article.save_many(self.articles)
self.articles = []
time.sleep(1)
if __name__ == '__main__':
s1 = Spider(1, 2, 8600)
s2 = Spider(2, 8600, 17200)
s3 = Spider(3, 17200, 25800)
s4 = Spider(4, 25800, 34300)
s1.start()
s2.start()
s3.start()
s4.start()
s1.join()
s2.join()
s3.join()
s4.join()
print u'Done'
对数据库的操作,是基于我的一个开源项目: Hare来实现的。