由于今天线上influxdb写入失败,查看日志出现大量500,而且是针对一个database的,比较郁闷,如果是都是500可以理解,单一库500 不太好查。

本来是python调用方法写入的,debug比较困难。所以就简单的curl写入一个测试表试试:

日志记录为:(注意出了问题查看日志,日志位置 找influxdb的配置文件)

[httpd] xxxxxxx - - [24/May/2019:14:01:08 +0800] "POST /write?db=api HTTP/1.1" 500 0 "-" "python-requests/2.19.0" 52f34cb3-7de9-11e9-bb7d-000000000000 34190
[httpd] xxxxxxx - - [24/May/2019:14:01:08 +0800] "POST /write?db=api HTTP/1.1" 500 0 "-" "python-requests/2.18.4" 52f82dfd-7de9-11e9-bb7e-000000000000 6943
[httpd] xxxxxxx - - [24/May/2019:14:01:08 +0800] "POST /write?db=api HTTP/1.1" 500 0 "-" "python-requests/2.18.4" 53078b69-7de9-11e9-bb7f-000000000000 24363
[httpd] xxxxxxx - - [24/May/2019:14:01:08 +0800] "POST /write?db=api HTTP/1.1" 500  0 "-" "python-requests/2.18.4" 53091b36-7de9-11e9-bb80-000000000000 21805
[httpd] xxxxxxx - - [24/May/2019:14:01:08 +0800] "POST /write?db=api HTTP/1.1" 500  0 "-" "python-requests/2.18.4" 53096c47-7de9-11e9-bb81-000000000000 31066


0> curl -i -XPOST 'http://i.influxdb:8086/write?db=api' -u user: --data-binary 'test,hostname=xxxxxxx Nginx_connection=11,Nginx_Waiting=8,Nginx_request=77,Php_connection=24,Php_listen_queue=0,Php_idle_process=299,Php_active_process=1'
HTTP/1.1 500 Internal Server Error
Content-Type: application/json
Request-Id: 8d6bac64-7dd1-11e9-8378-000000000000
X-Influxdb-Version: 1.3.6
Date: Fri, 24 May 2019 03:11:01 GMT
Content-Length: 76


{"error":"engine: cache-max-memory-size exceeded: (1073741936/1073741824)"}

出现了500状态码,重点是输出了这个:

cache-max-memory-size exceeded

戳这个地方 这个连接有说明:


Write-Retries on Errors

Telegraf retries writing on certain types of failures, such as cache-max-memory-size. While InfluxDB maintains a cache on input (configured by cache-max-memory-siz CMD option). When the limit is reached, InfluxDB rejects any further writes with the following message: ("cache-max-memory-size exceeded: (%d/%d)", n, limit)

This is how the response looks – InfluxDB configuration parameter cache-max-memory-size = 104:

HTTP/1.1 500 Internal Server Error
Content-Encoding: gzip
Content-Type: application/json
Request-Id: 582b0222-f064-11e7-801f-000000000000
X-Influxdb-Version: 1.2.2
Date: Wed, 03 Jan 2018 08:59:02 GMT
Content-Length: 87

{"error":"engine: cache-max-memory-size exceeded: (1440/104)"}Copy

It makes sense to retry the request when the cache gets emptied. Here is a list of errors returned by InfluxDB to help indicate that it does not make sense to retry the write. This is due to the points having already been written or the repeated write would fail again.


大致意思是这个是最单的缓存内存,这部分内存用于缓存重试的请求。现在是不够用了,简单暴力的方法是调整大点试试。

实践出真知:


  # cache-max-memory-size = 1048576000
   cache-max-memory-size = 2048576000

翻个倍 重启下试试

然后测试 就恢复了。 问题解决 就是这部分设置太小导致。  另外也暴露一个问题:

influxdb软件挺好的,我这使用也简单,比较粗放,配置文件全部都是默认的。

查了下配置文件配置说明:


















### Welcome to the InfluxDB configuration file.

 
# The values in this file override the default values used by the system if
# a config option is not specified. The commented out lines are the configuration
# field and the default value used. Uncommenting a line and changing the value
# will change the value used at runtime when the process is restarted.
 
#
# 将某行的注释符号去除并修改其值,那么在下次influxdb进程重启后,你修改的数值就会在程序的运行缓存中生效(runtime)
# 如果你不去除注释符号,则可以查看这个参数的默认数值。
 
# Once every 24 hours InfluxDB will report usage data to usage.influxdata.com
# The data includes a random ID, os, arch, version, the number of series and other
# usage data. No data from user databases is ever transmitted.
# Change this option to true to disable reporting.
# reporting-disabled = false
 
# 必须要告知你,当influxdb启动后,每24小时influxdb会向usage.influxdata.com汇报一些使用数据
# 这些上报的数据包括随机ID,运行系统,influxdb版本以及数据库中series的数据
# 但是用户数据库中的数据是不会被发送的,当然你可以取消下面这个reporting-disabled = false 的注释来禁用自动数据上报
  
# Bind address to use for the RPC service for backup and restore.
# bind-address = "127.0.0.1:8088"
 
# 设置用户备份与修复(数据)的RPC服务地址
  
###
### [meta]
###
### Controls the parameters for the Raft consensus group that stores metadata
### about the InfluxDB cluster.
###
 
# 元数据 meta
# 存储有关InfluxDB集群元数据的 Raft consensus group 的控制参数将在下面被配置。
  
[meta]
   # Where the metadata/raft database is stored
   dir  "/var/lib/influxdb/meta"
   # 元数据/raft 数据库被存储的路径  即meta目录
 
   # Automatically create a default retention policy when creating a database.
   # retention-autocreate = true
   # 当创建一个新的数据库时自动为其创建一个默认的rentention policy(保留策略)
 
   # If log messages are printed for the meta service
   # logging-enabled = true
   # 是否为meta服务打印日志
 
###
### [data]
###
### Controls where the actual shard data for InfluxDB lives and how it is
### flushed from the WAL. "dir" may need to be changed to a suitable place
### for your system, but the WAL settings are an advanced configuration. The
### defaults should work for most systems.
###
 
# 数据 data
# 这里的参数将会修改InfluxDB的分片数据的具体存放路径,以及这部分数据从WAL(Write Ahead Log)
# 刷新的(策略)。你需要根据系统的情况来为"dir"修改一个合适的路径。(换言之,你应该大致估计在你保存策略下
#,你的数据需要多少磁盘空间) 当然WAL相关的配置是一种高级设置,而默认的配置应能满足大多数的
# 使用场景
 
[data]
   # The directory where the TSM storage engine stores TSM files.
   dir  "/var/lib/influxdb/data"
   # TSM存储引擎存储TSM文件的目录
 
   # The directory where the TSM storage engine stores WAL files.
   wal- dir  "/var/lib/influxdb/wal"
   # TSM存储引擎存储WAL文件的目录
 
   # The amount of time that a write will wait before fsyncing.  A duration
   # greater than 0 can be used to batch up multiple fsync calls.  This is useful for slower
   # disks or when WAL write contention is seen.  A value of 0s fsyncs every write to the WAL.
   # Values in the range of 0-100ms are recommended for non-SSD disks.
   # wal-fsync-delay = "0s"
   # 这个时间参数是在fsyncing之前一个写入操作将等待的时间。(fsync解释为:帧同步脉冲 操作 被同步 把文件在内存中的部分写回磁盘)
   # 大于0的参数可以用来整理多重fsync操作。当系统磁盘读写速度较慢或出现WAL写入延缓时,可以考虑配置该参数
   # 对于非固态硬盘,这个参数推荐的配置范围是0~100毫秒
  
   # The type of shard index to use for new shards.  The default is an in-memory index that is
   # recreated at startup.  A value of "tsi1" will use a disk based index that supports higher
   # cardinality datasets.
   # index-version = "inmem"
   # 对于influxdb新产生的分片索引的类型将在这里被设置。初始默认的索引类型是一种 in-memory 索引,如果你修改成
   #  tsi1 将会对于高基数数据支持的更好。(high-cardinality 是描述tag rentention-policy measurement组合成series规模的度量值)
 
   # Trace logging provides more verbose output around the tsm engine. Turning
   # this on can provide more useful output for debugging tsm engine issues.
   # trace-logging-enabled = false
   # 打印追踪日志,也就是打印debug日志。如果打开的话会创建大量冗长的、有关tsm存储引擎工作细节的日志。
  
   # Whether queries should be logged before execution. Very useful for troubleshooting, but will
   # log any sensitive data contained within a query.
   # query-log-enabled = true
   # 这个配置决定在执行query查询动作前是否记录为日志。对于解决问题很有帮助
   # 但是会把所有含有query的敏感数据都记录下来