1.Introduction
The target of the Bug Report Classification(BRC) is that classify the bug report through the software report log to justify whether the report is an anomaly.
2.Explornation
2.1 Data Pre-process
The data pre-process tricks of the BRC tasks can be summaried as follow.
- Transform all of the letter in the log into lower.
- Extract the root of every word so that we can compress the vocabulary list. for exampled, [interesting->interest,interesting->interest,interested->interest].And the implementation method can refer the Porter Stemmer algorithm (some reference blogs).
- Some other method you can try (blog)
2.2 Classification model
Currently, I have tried a lot of methods, so I will list some main algorithms which achieve much better performace. It is need to be stated that some methodology are inspired by the the Lee hongyi courses anomaly detection chapter (link).
- Think the BRC task as the binary classification task. However, the distribution of the different class maybe such imbalanced.Specially, the number of the anomaly report is quite less. So we should solve the imbalanced class preoblem. In Keras I have tried two motheds, first is setting the class_weight in model.fit() function, so that we can banlance the sampling of two kinds examples. And another method is that we can refer the design of the Cost-sensitive loss function. And we can give the anomaly class more weight in the contribution in the loss function(Attention,in this way,we should not use the balanced sampling and you can draw the conclusion through the experiment.)
- And in the above method, the output of the neural netword is a real number. And we should set a threshold so that we can divide the result into two class, and you will know that the threshold will become a hyperparameter. And we should use some skill to adjust it. And you can use another way to avoid the trouble of the adjustment of the hyperparameter. You can transform the binary classification into multi-class classification. The only thing you should do is that change the output of the NN into two neuron. And adding a softmax function in the end of the output to get the probability score of two class. And then use the argmax() function to get the result.