我有用于文本分类的大型SVC模型(?50Mb cPickles),并且我正在尝试各种在生产环境中使用它们的方法.对一批文档进行分类非常有效(使用predict和predict_proba大约每分钟1k个文档).
但是,对单个文档的预测是另一回事,如对this question的注释所述:
Are you doing predictions in batches? The SVC.predict method, unfortunately, incurs a lot of overhead because it has to reconstruct a LibSVM data structure similar to the one that the training algorithm produced, shallow-copy in the support vectors, and convert the test samples to a LibSVM format that may be different from the NumPy/SciPy formats. Therefore, prediction on a single sample is bound to be slow. – 07001
我已经将SVC模型用作Flask Web应用程序了,所以一部分开销已经消失了(取消分配),但是单个文档的预测时间仍然偏高(0.25s).
我已经看过预测方法中的代码,但是无法弄清楚是否有一种方法可以“预热”它们,从而在服务器启动时提前重建LibSVM数据结构……有什么想法吗?
def predict(self, X):
"""Perform classification on samples in X.
For an one-class model, +1 or -1 is returned.
Parameters
----------
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Returns
-------
y_pred : array, shape = [n_samples]
Class labels for samples in X.
"""
y = super(BaseSVC, self).predict(X)
return self.classes_.take(y.astype(np.int))