1、方法说明
数据集有可能是以arff格式(weka用的)保存,一般的机器学习使用numpy,pandas和sklearn多一些,无法直接读取文件,所以需要scipy.io.arff.loadarff过渡下。
2、代码示例
from scipy.io import arff
import pandas as pd
file_name='/Users/schillerxu/Documents/sourcecode/python/pandas/CM1.arff'
data,meta=arff.loadarff(file_name)
#print(data)
print(meta)
df=pd.DataFrame(data)
print(df.head())
#print(df)
#保存为csv文件
# out_file='/Users/schillerxu/Documents/sourcecode/python/pandas/CM1.csv'
# output=pd.DataFrame(df)
# output.to_csv(out_file,index=False)
程序运行的结果如下:
[Running] python -u "/Users/schillerxu/Documents/sourcecode/python/pandas/arff_to_csv.py"
Dataset: CM1
LOC_BLANK's type is numeric
BRANCH_COUNT's type is numeric
CALL_PAIRS's type is numeric
LOC_CODE_AND_COMMENT's type is numeric
LOC_COMMENTS's type is numeric
CONDITION_COUNT's type is numeric
CYCLOMATIC_COMPLEXITY's type is numeric
CYCLOMATIC_DENSITY's type is numeric
DECISION_COUNT's type is numeric
DECISION_DENSITY's type is numeric
DESIGN_COMPLEXITY's type is numeric
DESIGN_DENSITY's type is numeric
EDGE_COUNT's type is numeric
ESSENTIAL_COMPLEXITY's type is numeric
ESSENTIAL_DENSITY's type is numeric
LOC_EXECUTABLE's type is numeric
PARAMETER_COUNT's type is numeric
HALSTEAD_CONTENT's type is numeric
HALSTEAD_DIFFICULTY's type is numeric
HALSTEAD_EFFORT's type is numeric
HALSTEAD_ERROR_EST's type is numeric
HALSTEAD_LENGTH's type is numeric
HALSTEAD_LEVEL's type is numeric
HALSTEAD_PROG_TIME's type is numeric
HALSTEAD_VOLUME's type is numeric
MAINTENANCE_SEVERITY's type is numeric
MODIFIED_CONDITION_COUNT's type is numeric
MULTIPLE_CONDITION_COUNT's type is numeric
NODE_COUNT's type is numeric
NORMALIZED_CYLOMATIC_COMPLEXITY's type is numeric
NUM_OPERANDS's type is numeric
NUM_OPERATORS's type is numeric
NUM_UNIQUE_OPERANDS's type is numeric
NUM_UNIQUE_OPERATORS's type is numeric
NUMBER_OF_LINES's type is numeric
PERCENT_COMMENTS's type is numeric
LOC_TOTAL's type is numeric
Defective's type is nominal, range is ('Y', 'N')
LOC_BLANK BRANCH_COUNT CALL_PAIRS ... PERCENT_COMMENTS LOC_TOTAL Defective
0 6.0 9.0 2.0 ... 4.00 25.0 b'N'
1 15.0 7.0 3.0 ... 39.22 32.0 b'Y'
2 27.0 9.0 1.0 ... 47.27 33.0 b'Y'
3 7.0 3.0 2.0 ... 0.00 12.0 b'N'
4 51.0 25.0 13.0 ... 11.67 106.0 b'N'
[5 rows x 38 columns]
[Done] exited with code=0 in 0.664 seconds
可以明显看到meta保存的是数据集的基本信息。