解决Sentry不收集爬虫产生的Exception问题

文章目录


最近业务部门反馈说收到短信sentry报警数值和打开sentry页面看到的数据不一样。
我们报警数值是取的接口:

http://101.1.251.101:9100/api/0/projects/sentry/mail/stats/?stat=received

线下安装了sentry开始复现。《sentry安装与配置》:https://clevercode.blog.csdn.net/article/details/105880652 。

分析了nginx请求日志,发现sentry客户端请求的sentry服务的接口是: http://10.1.20.101:9100/api/4/store/ ,分析sentry请求的nginx日志,发现了http的状态码中有大量的403,查看源码403,对应的APIRateLimited产生的。但是源码抛出APIRateLimited发现有10处代码都可以产生。如何找到具体是哪一处代码抛出的这个APIRateLimited报错。

这时候想到抓包工具《流量录制与回放工具–GoReplay》:https://blog.csdn.net/CleverCode/article/details/101423570 。
对10.1.20.101的9100端口进行http抓取请求包和响应包。

/Data/apps/gor/gor --input-raw :9100 --output-file /tmp/sentry.gor --input-raw-track-response --http-allow-url /store/

分析sentry.gor日志。发现403的时候,响应包出现:{“error”:“Event dropped due to filter”}

2 eb2958ceb5974ace946df51e345cf0edaa8476bf 1588233398959704818 30501033
HTTP/1.0 403 FORBIDDEN
Content-Length: 39
Expires: Thu, 30 Apr 2020 07:56:38 GMT
X-Content-Type-Options: nosniff
Content-Language: en
X-Sentry-Error: Event dropped due to filter
Vary: Accept-Language, Cookie
Last-Modified: Thu, 30 Apr 2020 07:56:38 GMT
X-XSS-Protection: 1; mode=block
Cache-Control: max-age=0
X-Frame-Options: deny
Content-Type: application/json

{"error":"Event dropped due to filter"}

在源码中查:Event dropped due to filter。找到了位置:/Data/apps/ops4env/lib/python2.7/site-packages/sentry/web/api.py 找到 raise APIForbidden(‘Event dropped due to filter’)

vi /Data/apps/ops4env/lib/python2.7/site-packages/sentry/web/urls.py

url(r'^api/(?P<project_id>[\w_-]+)/store/$', api.StoreView.as_view(),

查看对应的视图
vi /Data/apps/ops4env/lib/python2.7/site-packages/sentry/web/api.py

class StoreView(APIView):
   
    def post(self, request, **kwargs):
        try:
            data = request.body
        except Exception as e:
            logger.exception(e)
            # We were unable to read the body.
            # This would happen if a request were submitted
            # as a multipart form for example, where reading
            # body yields an Exception. There's also not a more
            # sane exception to catch here. This will ultimately
            # bubble up as an APIError.
            data = None

        response_or_event_id = self.process(request, data=data, **kwargs)
        if isinstance(response_or_event_id, HttpResponse):
            return response_or_event_id
        return HttpResponse(json.dumps({
            'id': response_or_event_id,
        }), content_type='application/json')

   

    def process(self, request, project, auth, helper, data, **kwargs):
        
        #.......

        if helper.should_filter(project, data, ip_address=remote_addr):
            app.tsdb.incr_multi([
                (app.tsdb.models.project_total_received, project.id),
                (app.tsdb.models.project_total_blacklisted, project.id),
                (app.tsdb.models.organization_total_received, project.organization_id),
                (app.tsdb.models.organization_total_blacklisted, project.organization_id),
            ])
            metrics.incr('events.blacklisted')
            event_filtered.send_robust(
                ip=remote_addr,
                project=project,
                sender=type(self),
            )
            raise APIForbidden('Event dropped due to filter')

        #.......

        return event_id

说明出现raise APIForbidden(‘Event dropped due to filter’) 是因为helper.should_filter(project, data, ip_address=remote_addr)条件为真。看代码的意思是命中了筛选规则。

在线下的环境中,修改sentry中的should_filter源码,记录一下日志信息。
vi /Data/apps/ops4env/lib/python2.7/site-packages/sentry/coreapi.py

    def add_log(self,msg):
        """

        """
        import sys
        f = open("/tmp/sentry.log",'a+')
        f.write(msg)
        f.close()



    def should_filter(self, project, data, ip_address=None):
        # TODO(dcramer): read filters from options such as:
        # - ignore errors from spiders/bots
        # - ignore errors from legacy browsers
        if ip_address and not is_valid_ip(ip_address, project):
            return True

        for filter_cls in filters.all():
            filter_obj = filter_cls(project)
            self.add_log("s1 should_filter:" + str(filter_obj) + "\n")
            if filter_obj.is_enabled():
                self.add_log("m1 is_enabled:" + str(filter_obj) + "\n")
            if filter_obj.is_enabled() and filter_obj.test(data):
                return True

        return False

发现有4中筛选规则,但是启用的只有WebCrawlersFilter

s1 should_filter:<sentry.filters.browser_extensions.BrowserExtensionsFilter object at 0x7fce50056350>
s1 should_filter:<sentry.filters.web_crawlers.WebCrawlersFilter object at 0x7fce500564d0>
m1 is_enabled:<sentry.filters.web_crawlers.WebCrawlersFilter object at 0x7fce500564d0>
s1 should_filter:<sentry.filters.legacy_browsers.LegacyBrowsersFilter object at 0x7fce50056350>
s1 should_filter:<sentry.filters.localhost.LocalhostFilter object at 0x7fce500564d0>

查WebCrawlersFilter源码发现。default = True 。说明默认是开启爬虫过滤的。

vim /Data/apps/ops4env/lib/python2.7/site-packages/sentry/filters/web_crawlers.py

from __future__ import absolute_import

import re

from .base import Filter

# not all of these agents are guaranteed to execute JavaScript, but to avoid
# overhead of identifying which ones do, and which ones will over time we simply
# target all of the major ones
CRAWLERS = re.compile(r'|'.join((
   # various Google services
   r'AdsBot',
   # Google Adsense
   r'Mediapartners',
   # Google+ and Google web search
   r'Google',
   # Bing search
   r'BingBot',
   # Baidu search
   r'Baiduspider',
   # Yahoo
   r'Slurp',
   # Sogou
   r'Sogou',
   # facebook
   r'facebook',
   # Alexa
   r'ia_archiver',
   # Generic bot
   r'bot[\/\s\)\;]',
   # Generic spider
   r'spider[\/\s\)\;]',
)), re.I)


class WebCrawlersFilter(Filter):
   id = 'web-crawlers'
   name = 'Filter out known web crawlers'
   description = 'Some crawlers may execute pages in incompatible ways which then cause errors that are unlikely to be seen by a normal user.'
   default = True

   def get_user_agent(self, data):
       try:
           for key, value in data['sentry.interfaces.Http']['headers']:
               if key.lower() == 'user-agent':
                   return value
       except LookupError:
           return ''

   def test(self, data):
       # TODO(dcramer): we could also look at UA parser and use the 'Spider'
       # device type
       user_agent = self.get_user_agent(data)
       if not user_agent:
           return False
       return bool(CRAWLERS.search(user_agent))

修改源码 default = False,关闭爬虫筛选。重启sentry,

# export SENTRY_CONF="/Data/apps/sentry"
# sentry run web

技术交流

CleverCode是一名架构师,技术交流,咨询问题,请加CleverCode创建的qq群(架构师俱乐部):517133582。加群和腾讯,阿里,百度,新浪等公司的架构师交流。【架构师俱乐部】宗旨:帮助你成长为架构师!
在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值