前面在运行spark streaming任务时,发现一个很奇怪的问题:在web ui的streaming监控界面上的input rate一直为0,显示接收到的records也一直为0,但是任务却有数据输出,且生成数据正常。
下面是问题查找之路:
确认输入输出数据:对任务的输出数据进行验证,发现数据生成正确。检查任务的输入数据,发现输入数据也正常。
理解监控界面的指标含义:input rate是指receiver接收的数据量:
In Input Rate row, you can show and hide details of each input stream.
If there are input streams with receivers, the numbers of all the receivers and active ones are displayed. The average event rate for all registered streams is displayed (as Avg: [avg] events/sec).为什么receiver没有接收到数据而有正确的数据输出:原来对于file stream是没有对应的receiver的,所以也就没有从receiver接收的数据。
DStream as it represented the stream of data received from the netcat server. Every input DStream (except file stream, discussed later in this section) is associated with a Receiver (Scala doc, Java doc) object which receives the data from a source and stores it in Spark’s memory for processing. File streams do not require running a receiver so there is no need to allocate any cores for receiving file data.
总结:这个问题的关键还是在于对spark框架的不熟悉,使用框架的同时还是得知其所以然!