前段时间项目使用了scrapy,这里做个简单的demo,使用scrapy抓取下安居客的内容,
关于怎么搭建scrapy的工程和scrapy部分概念的使用,请查看https://blog.csdn.net/mingover/article/details/80717974
全部源代码:
https://github.com/huawumingguo/scrapy_demo
分析安居客页面
是否要登陆?
我们发现安居客查看是不用登陆的,我们把内容搞简单些,只抓取一个城市的内容,如广州:
https://guangzhou.anjuke.com/sale/rd1/?kw=&from=sugg
上面是一个房源列表,里面有下一页,我们对第一页进行抓取,然后把详情 url,和下一页url都加到抓取目标中来.
简略的流程:
def parse():
获取详情列表items
for items:
yield Request(url=parse.urljoin(response.url, itemurl[0]), meta={"refer_url": response.url},
callback=self.parse_detail, headers=self.headers)
获取下一页的链接nexturl
yield Request(url=parse.urljoin(response.url, nexturl), callback=self.parse, headers=self.headers)
def parse_detail():
解析列表后,扔给pipeline
item = citemloader.load_item()
yield item
pipeline拿到相关内容后,写入mysql/其它持久化
处理安居客相应内容
直接代码:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from urllib import parse
from scrapy.loader import ItemLoader
from anjuke.items import *
import re
import logging
from anjuke.utils.LoggerUtil import LogUtils
loggerDataName = "anjuke"
log_dataInfo_path = "logs/anjuke.log"
log = LogUtils.createLogger(loggerDataName, log_dataInfo_path)
class AnjukeGzSpider(scrapy.Spider):
name = 'anjuke_gz'
allowed_domains = ['anjuke.com']
start_urls = ['https://guangzhou.anjuke.com/sale/']
headers = {
"HOST": "guangzhou.anjuke.com",
"Referer": "https://guangzhou.anjuke.com",
'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0"
}
def parse(self, response):
items = response.css("#houselist-mod-new > li")
log.info("正在处理_list:%s,条数:%s" %( response.url ,len(items)))
# 获取列表
for itemnode in items:
itemurl = itemnode.css('a.houseListTitle::attr(href)').extract()
itemtitle = itemnode.css('a.houseListTitle::attr(title)').extract()
log.info("list中显示:%s,标题:%s" % (itemurl, itemtitle))
# print('%s,链接为:%s' % (itemtitle[0], itemurl[0]))
yield Request(url=parse.urljoin(response.url, itemurl[0]), meta={"refer_url": response.url},
callback=self.parse_detail, headers=self.headers)
# 获取下一页
nexturlArr = response.css('#content > div.sale-left > div.multi-page > a.aNxt::attr(href)').extract()
if nexturlArr:
nexturl = nexturlArr[0]
yield Request(url=parse.urljoin(response.url, nexturl), callback=self.parse, headers=self.headers)
def parse_detail(self, response):
# "https://guangzhou.anjuke.com/prop/view/A1285389340?from=filter&spread=commsearch_p&position=361&kwtype=filter&now_time=1529215280"
refer_url = response.meta.get("refer_url", '')
log.info("正在处理_detail:%s,refer_url为:%s" %( response.url,refer_url))
match_re = re.match(".*view/(.*)\?.*", response.url)
id = match_re.group(1)
citemloader = OrderItemLoader(item=OrderItem(), response=response)
citemloader.add_value("id", id)
citemloader.add_css("title", "#content > div.clearfix.title-guarantee > h3::text")
citemloader.add_css("community_id",
"#content > div.wrapper > div.wrapper-lf.clearfix > div.houseInfoBox > div > div.houseInfo-wrap > div > div.first-col.detail-col > dl:nth-child(1) > dd > a::attr(href)")
citemloader.add_css("community_name",
"#content > div.wrapper > div.wrapper-lf.clearfix > div.houseInfoBox > div > div.houseInfo-wrap > div > div.first-col.detail-col > dl:nth-child(1) > dd > a::text")
citemloader.add_css("area1",
"#content > div.wrapper > div.wrapper-lf.clearfix > div.houseInfoBox > div > div.houseInfo-wrap > div > div.first-col.detail-col > dl:nth-child(2) > dd > p > a:nth-child(1)::text")
citemloader.add_css("area2",
"#content > div.wrapper > div.wrapper-lf.clearfix > div.houseInfoBox > div > div.houseInfo-wrap > div > div.first-col.detail-col > dl:nth-child(2) > dd > p > a:nth-child(2)::text")
citemloader.add_css("build_time",
"#content > div.wrapper > div.wrapper-lf.clearfix > div.houseInfoBox > div > div.houseInfo-wrap > div > div.first-col.detail-col > dl:nth-child(3) > dd::text")
citemloader.add_css("address",
"#content > div.wrapper > div.wrapper-lf.clearfix > div.houseInfoBox > div > div.houseInfo-wrap > div > div.first-col.detail-col > dl:nth-child(2) > dd > p::text")
citemloader.add_css("housetype",
"#content > div.wrapper > div.wrapper-lf.clearfix > div.houseInfoBox > div > div.houseInfo-wrap > div > div.first-col.detail-col > dl:nth-child(4) > dd::text")
citemloader.add_css("housestructure",
"#content > div.wrapper > div.wrapper-lf.clearfix > div.houseInfoBox > div > div.houseInfo-wrap > div > div.second-col.detail-col > dl:nth-child(1) > dd::text")
citemloader.add_css("space",
"#content > div.wrapper > div.wrapper-lf.clearfix > div.houseInfoBox > div > div.houseInfo-wrap > div > div.second-col.detail-col > dl:nth-child(2) > dd::text")
citemloader.add_css("building_floors",
"#content > div.wrapper > div.wrapper-lf.clearfix > div.houseInfoBox > div > div.houseInfo-wrap > div > div.second-col.detail-col > dl:nth-child(4) > dd::text")
citemloader.add_css("house_floor",
"#content > div.wrapper > div.wrapper-lf.clearfix > div.houseInfoBox > div > div.houseInfo-wrap > div > div.second-col.detail-col > dl:nth-child(4) > dd::text")
citemloader.add_css("direction_face",
"#content > div.wrapper > div.wrapper-lf.clearfix > div.houseInfoBox > div > div.houseInfo-wrap > div > div.second-col.detail-col > dl:nth-child(3) > dd::text")
citemloader.add_css("unit_price",
"#content > div.wrapper > div.wrapper-lf.clearfix > div.houseInfoBox > div > div.houseInfo-wrap > div > div.third-col.detail-col > dl:nth-child(1) > dd::text")
citemloader.add_css("consult_first_pay",
"#content > div.wrapper > div.wrapper-lf.clearfix > div.houseInfoBox > div > div.houseInfo-wrap > div > div.third-col.detail-col > dl:nth-child(2) > dd::text")
citemloader.add_css("decoration_degree",
"#content > div.wrapper > div.wrapper-lf.clearfix > div.houseInfoBox > div > div.houseInfo-wrap > div > div.third-col.detail-col > dl:nth-child(4) > dd::text")
import datetime
nowTime = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
citemloader.add_value("refer_url", refer_url)
citemloader.add_value("now_time", nowTime)
item = citemloader.load_item()
yield item
pass
这里花时间最多的是解析各种css,主要是在item中的正则表达式中处理
把item给定义好
def re_community_id(value):
match_re = re.match(".*view/(\d+)$", value)
result = match_re.group(1)
return result
def re_address(value):
try:
result = re.match(".*(\s+)(.*)", value).group(2)
except Exception as e:
print(e)
result = 'aaaa'
return result
def re_build_time(value):
return re.match("(.*)年", value).group(1)
def re_space(value):
return re.match("(.*)平方米", value).group(1)
def re_building_floors(value):
return re.match(".*共(\d+)层", value).group(1)
def re_house_floor(value):
# 一般是 高层(共10层) 这样,也有 共5层
target = re.match("(.*)\(", value)
if target:
return target.group(1)
else:
return value.strip()
def re_unit_price(value):
return re.match("(.*)元.*", value).group(1)
def get_second(arr):
if len(arr) > 0:
if len(arr) > 1:
return arr[1]
else:
return arr[0]
else:
return "cccc"
def re_consult_first_pay(value):
try:
target = value.strip()
result = re.match("(.*)万", target).group(1)
except Exception as e:
print(e)
result = 0
return int(float(result) * 10000)
def re_consult_month_pay(value):
return re.match("(.*)元.*", value).group(1)
def getval(value):
return value
def getridof_blank(value):
if value:
valstr = value[0]
target = ''.join(valstr.split())
return target
class OrderItemLoader(ItemLoader):
# 自定义itemloader
default_output_processor = TakeFirst()
class OrderItem(scrapy.Item):
id = scrapy.Field()
title = scrapy.Field(
output_processor=getridof_blank
)
community_id = scrapy.Field(
input_processor=MapCompose(re_community_id),
)
community_name = scrapy.Field()
area1 = scrapy.Field()
area2 = scrapy.Field()
address = scrapy.Field(
input_processor=MapCompose(re_address),
output_processor=get_second
)
build_time = scrapy.Field(
input_processor=MapCompose(re_build_time),
)
housetype = scrapy.Field()
housestructure = scrapy.Field(
output_processor=getridof_blank
)
space = scrapy.Field(
input_processor=MapCompose(re_space),
)
building_floors = scrapy.Field(
input_processor=MapCompose(re_building_floors),
)
house_floor = scrapy.Field(
input_processor=MapCompose(re_house_floor)
)
direction_face = scrapy.Field()
unit_price = scrapy.Field(
input_processor=MapCompose(re_unit_price)
)
consult_first_pay = scrapy.Field(
input_processor=MapCompose(re_consult_first_pay)
)
decoration_degree = scrapy.Field()
hosedesc = scrapy.Field()
refer_url = scrapy.Field()
now_time = scrapy.Field()
这里itemloader就是解析html的内容,要有耐心慢慢调,这里就是最花时间的地方
扔到mysql中
持久化可以有多种方式了, json或mysql之类的都行,这里提供几种:
class JsonWithEncodingPipeline(object):
#自定义json文件的导出
def __init__(self):
self.file = codecs.open('anjuke_orderlist.json', 'w', encoding="utf-8")
def process_item(self, item, spider):
lines = json.dumps(dict(item), ensure_ascii=False) + "\n"
self.file.write(lines)
self.file.flush()
return item
def spider_closed(self, spider):
self.file.close()
class MysqlPipeline(object):
#采用同步的机制写入mysql
def __init__(self):
self.conn = MySQLdb.connect('192.168.0.106', 'root', 'root', 'article_spider', charset="utf8", use_unicode=True)
self.cursor = self.conn.cursor()
def process_item(self, item, spider):
insert_sql = """
insert into jobbole_article(title, url, create_date, fav_nums)
VALUES (%s, %s, %s, %s)
"""
self.cursor.execute(insert_sql, (item["title"], item["url"], item["create_date"], item["fav_nums"]))
self.conn.commit()
class MysqlTwistedPipline(object):
def __init__(self, dbpool):
self.dbpool = dbpool
@classmethod
def from_settings(cls, settings):
dbparms = dict(
host = settings["MYSQL_HOST"],
db = settings["MYSQL_DBNAME"],
user = settings["MYSQL_USER"],
passwd = settings["MYSQL_PASSWORD"],
charset='utf8',
cursorclass=MySQLdb.cursors.DictCursor,
# use_unicode=True,
)
dbpool = adbapi.ConnectionPool("MySQLdb", **dbparms)
return cls(dbpool)
def process_item(self, item, spider):
#使用twisted将mysql插入变成异步执行
query = self.dbpool.runInteraction(self.do_insert, item)
query.addErrback(self.handle_error, item, spider) #处理异常
def handle_error(self, failure, item, spider):
#处理异步插入的异常
print (failure)
def do_insert(self, cursor, item):
#执行具体的插入
#根据不同的item 构建不同的sql语句并插入到mysql中
insert_sql, params = item.get_insert_sql()
# targetparams = []
# try:
# for param in params:
# if isinstance(param, str):
# targetparam = param.encode('utf8').decode('utf8')
# targetparams.append(targetparam)
# else:
# targetparams.append(targetparam)
# except Exception as e:
# print (e)
# raise e
insert_sql = insert_sql.strip()
try:
cursor.execute(insert_sql, params)
except Exception as e:
print (e)
pass
怎么保证去重,不会是原来的数据
一些说明
怎么迁移workon 的目录?
workon 默认的目录是C:\Users\xxxx\Envs
把WROKON_HOME的路径自定义到指定目录即可
之前已经定义好的workon 则在老目录中直接COPY过去即可
shell 要加上user_agent
参考: http://www.bubuko.com/infodetail-2145085.html
scrapy shell -s USER_AGENT=”Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36” https://guangzhou.anjuke.com/prop/view/A1275284529?from=filter&spread=commsearch_p&position=1&kwtype=filter&now_time=1529290160
我们用第三方工具来搞,的downloadmiddleware
class RandomUserAgentMiddlware(object):
#随机更换user-agent
def __init__(self, crawler):
super(RandomUserAgentMiddlware, self).__init__()
self.ua = UserAgent()
self.ua_type = crawler.settings.get("RANDOM_UA_TYPE", "random")
@classmethod
def from_crawler(cls, crawler):
return cls(crawler)
def process_request(self, request, spider):
def get_ua():
return getattr(self.ua, self.ua_type)
tmpua = get_ua()
request.headers.setdefault('User-Agent', tmpua)
怎么用代理?
实际上,代理就是request.meta[“proxy”] = “xxxx”
加上就行了,不过一般网上的代理都是一段时间就失效的,我们需要构建一个代理池,来动态获取,遇到不行的,就略过.
参考代码中的GetProxy.py
怎么抓取html内容
response.css 如果是里面想抓所有的html , ::text是只能获取文字的,html内容不会被抓取
那么,不加::text即可
运行pip报错:Fatal error in launcher: Unable to create process using ‘”’
https://blog.csdn.net/cjeric/article/details/73518782
关于字符串编码
https://www.cnblogs.com/geekard/archive/2012/10/04/python-string-endec.html
运行过程中存好这些jobs
scrapy crawl anjuke_gz -s JOBDIR=jobs/job1
如果要关闭,就ctrl + c
怎么把一些参数传给下一步数据
Request时候加上meta,传给callback中的 response对象
yield Request(url=parse.urljoin(response.url, itemurl[0]), meta={"refer_url": response.url},
callback=self.parse_detail, headers=self.headers)
....
response中拿到:
refer_url = response.meta.get("refer_url", '')
通过job 后,是否可以不去重复执行了?
是的,关掉重启后,指定同一个job,会在之前的基础上去抓取
后记
发现安居客一个奇怪的问题,找了几个城市,发现个个城市都是50页,感觉安居客的房源不太OK,这些代码只可以留来简单学习。
https://www.zhihu.com/question/20594581/answer/15585452
在linux的安装包情况
安装sqlite3
sqlite3有点兼容问题,不能通过pip搞定
https://blog.csdn.net/sparkexpert/article/details/69944835
https://github.com/sloria/TextBlob/issues/173
实际上就是安装下libsqlite3-dev,然后重新编译python3
1,sudo apt-get install libsqlite3-dev
2,(这一步这边没有执行)(Or you can install more packages as suggested on the pyenv wiki: apt-get install -y make build-essential libssl-dev zlib1g-dev libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm libncurses5-dev xz-utils tk-dev
3,Now in the downloaded python source rebuild and install python with the following command: ./configure --enable-loadable-sqlite-extensions && make && sudo make install
安装 Twisted,这货也有个bug,不能直接pip安装.
https://pypi.org/project/Twisted/#files 这里下载后手动安装,安装过程百度下
pip install scrapy
pip install fake_useragent
pip install requests
pip install kafka
pip install redis
pip install pillow
apt-get install libmysqlclient-dev
pip install mysqlclient