如何解决sklearn加载libsvm格式数据数组越界？

最新推荐文章于 2021-01-30 08:42:34 发布

wh_springer

最新推荐文章于 2021-01-30 08:42:34 发布

阅读量2.2k

点赞数 2

分类专栏：数据挖掘文章标签： load_svmlight_file sklearn 越界

本文链接：https://blog.csdn.net/wh_springer/article/details/85007921

版权

数据挖掘专栏收录该内容

4 篇文章 0 订阅

订阅专栏

在使用sklearn加载大数据量的libsvm文件函数load_svmlight_file发生了内存越界错误，样本数超过1千万。

具体报错：

OverflowError: signed integer is greater than maximum.

这个问题比较奇怪，之前一直没有问题，只是每个样本都add了固定的128维特征后才出现上述报错。

通过对sklearn源码分析，sklearn使用scipy的csr稀疏矩阵存储形式，索引数组使用了int作为下标，因此限定了数组的最大长度为2147483647，如果样本数 * 每个样本的特征数超过2147483647，数组就会越界，报上述错误。以下为_svmlight_format.pyx 中定义数组的代码，可以看到indices数组和indptr都是用i（int）类型。

# Special-case float32 but use float64 for everything else;
# the Python code will do further conversions.
if dtype == np.float32:
    data = array.array("f")
else:
    dtype = np.float64
    data = array.array("d")
indices = array.array("i")
indptr = array.array("i", [0])
query = np.arange(0, dtype=np.int64)

为了解决上述问题，可以参考liblinear的加载libsvm格式文件的代码，打开liblinear-2.21/python/commonutil.py, 其中提供了svm_read_problem函数，该函数使用long类型做数据下标，可以避免数据量太大导致越界的错误。

def svm_read_problem(data_file_name,return_scipy=False):
    """
    svm_read_problem(data_file_name, return_scipy=False) -> [y, x], y: list, x: list of dictionary
    svm_read_problem(data_file_name, return_scipy=True)  -> [y, x], y: ndarray, x: csr_matrix

    Read LIBSVM-format data from data_file_name and return labels y
    and data instances x.
    """
    if scipy != None and return_scipy:
        prob_y = array('d')
        prob_x = array('d')
        row_ptr = array('l', [0])
        col_idx = array('l')
    else:
        prob_y = []
        prob_x = []
        row_ptr = [0]
        col_idx = []

该代码中col_idx，row_ptr分别等同于上述代码indices，indptr。svm_read_problem参数中不像load_svmlight_file有一个feature_size参数，可以修改svm_read_problem函数添加feature_size参数，如下：

def svm_read_problem(data_file_name,n_features,return_scipy=False):
    """
    svm_read_problem(data_file_name, return_scipy=False) -> [y, x], y: list, x: list of dictionary
    svm_read_problem(data_file_name, return_scipy=True)  -> [y, x], y: ndarray, x: csr_matrix

    Read LIBSVM-format data from data_file_name and return labels y
    and data instances x.
    """
    if scipy != None and return_scipy:
        prob_y = array('d')
        prob_x = array('d')
        row_ptr = array('l', [0])
        col_idx = array('l')
    else:
        prob_y = []
        prob_x = []
        row_ptr = [0]
        col_idx = []
    indx_start = 1
    for i, line in enumerate(open(data_file_name)):
        line = line.split(None, 1)
        # In case an instance with all zero features
        if len(line) == 1: line += ['']
        label, features = line
        prob_y.append(float(label))
        if scipy != None and return_scipy:
            nz = 0
            for e in features.split():
                ind, val = e.split(":")
                if ind == '0':
                    indx_start = 0
                val = float(val)
                if val != 0:
                    col_idx.append(int(ind)-indx_start)
                    prob_x.append(val)
                    nz += 1
            row_ptr.append(row_ptr[-1]+nz)
        else:
            xi = {}
            for e in features.split():
                ind, val = e.split(":")
                xi[int(ind)] = float(val)
            prob_x += [xi]
    if scipy != None and return_scipy:
        prob_y = scipy.frombuffer(prob_y, dtype='d')
        prob_x = scipy.frombuffer(prob_x, dtype='d')
        col_idx = scipy.frombuffer(col_idx, dtype='l')
        row_ptr = scipy.frombuffer(row_ptr, dtype='l')
        prob_x = sparse.csr_matrix((prob_x, col_idx, row_ptr),(row_ptr.shape[0]-1,n_features))
    return (prob_y, prob_x)

wh_springer

关注

2
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
如何解决sklearn加载libsvm格式数据数组越界？

在使用sklearn加载大数据量的libsvm文件函数load_svmlight_file发生了内存越界错误，样本数超过1千万。具体报错：OverflowError: signed integer is greater than maximum.这个问题比较奇怪，之前一直没有问题，只是每个样本都add了固定的128维特征后才出现上述报错。通过对sklearn源码分析，sklear...
复制链接

扫一扫