本文系Subson原创,转载请注明。
前言
该方法针对已明确类别的文档组成的数据集进行质量提升,其代价是数据集变小。
数据集【dataset】由大量的文档【document】构成,文档由单词【word】构成,所有不同的词共同构成数据集的词典【dictionary】,将数据集聚类为多个族簇【cluster】。
主要思想
计算数据集的词典中所有词的熵【entropy】,熵越小说明该词的类别区分性越强。对于每一个词,在熵的基础上,通过比较该词在各个类中的频数,取较大者确定该词所属的类,即将该词确定为该类的特征词【feature】。
熵计算公式:
公式说明:符号
c
表示类别,符号
确定每个类的所有特征词之后,从每个类中选出
代码说明
-
该方法需要数据集以JSON的格式的文件输入,并且文档以分词结果表示。例如:
-
{“text”: “blog data perform database”, “cluster”: “computer”}
说明:text表示文档,cluster表示文档的类别。
数据预处理
数据预处理类Pretreatment的构造函数包含7个参数,dataDir表示数据源路径,removeNonWord表示是否移除不规范的单词,removeLowFrequencyWords表示是否移除出现次数过少的单词,lessNum表示单词出现次数小于lessNum的单词为低频词,removeSubstandardDocuments表示是否移除不规范的文档,minLen和maxLen分别表示单词数小于minLen或者大于maxLen的文档为不规范文档,scrDir表示处理后的数据存放地址。
该模块并不是必须的,可以选择不运行或舍弃。
/**
* @author Subson Bigod
* @email subsontding@gmail.com
**/
public class Pretreatment {
private List<String> dataSet = new ArrayList<String>();
public Pretreatment(
String dataDir,
boolean removeNonWord,
boolean removeLowFrequencyWords,
int lessNum,
boolean removeSubstandardDocuments,
int minLen,
int maxLen,
String scrDir)
throws Exception {
dataSet = FileUtil.fileToList(dataDir);
if (removeNonWord) {
long st = System.currentTimeMillis();
removeNonWord();
long et = System.currentTimeMillis();
System.out.println("Remove non-word takes " +
(et-st)/1000.0 + " seconds.");
}
if (removeLowFrequencyWords) {
long st = System.currentTimeMillis();
removeLowFrequencyWords(lessNum);
long et = System.currentTimeMillis();
System.out.println("Remove low frequency word takes " +
(et-st)/1000.0 + " seconds.");
}
if (removeSubstandardDocuments) {
long st = System.currentTimeMillis();
removeSubstandardDocuments(minLen, maxLen);
long et = System.currentTimeMillis();
System.out.println("Remove substandard documents takes " +
(et-st)/1000.0 + " seconds.");
}
FileUtil.listToFile(dataSet, scrDir);
}
private void removeNonWord () throws Exception {
for (int i = 0; i < dataSet.size(); i++) {
String doc = dataSet.get(i);
JSONObject obj = new JSONObject(doc);
String text = obj.getString("text");
String cluster = obj.getString("cluster");
StringBuffer sb = new StringBuffer();
sb.append("{\"text\": \"");
for (String word : text.split("\\s+")) {
if (StrUtil.isLetter(word)) continue;
if (StrUtil.isLetters(word))
sb.append(word).append(" ");
}
sb.append("\", \"cluster\": \"").append(cluster).append("\"}");
dataSet.set(i, sb.toString());
}
}
private void removeLowFrequencyWords (int lessNum) throws Exception {
List<String> LFWords = new ArrayList<String>();
new DocumentSet(dataSet, lessNum, LFWords);
for (int i = 0; i < dataSet.size(); i++) {
String doc = dataSet.get(i);
JSONObject obj = new JSONObject(doc);
String text = obj.getString("text");
String cluster = obj.getString("cluster");
StringBuffer sb = new StringBuffer();
sb.append("{\"text\": \"");
for (String word : text.split("\\s+")) {
if (LFWords.contains(word)) continue;
sb.append(word).append(" ");
}
sb.append("\", \"cluster\": \"").append(cluster).append("\"}");
dataSet.set(i, sb.toString());
}
}
private void removeSubstandardDocuments (int minLen, int maxLen)
throws Exception {
for (int i = 0; i < dataSet.size(); i++) {
String doc = dataSet.get(i);
JSONObject obj = new JSONObject(doc);
String text = obj.getString("text");
int len = text.split("\\s+").length;
if (len<minLen || len>maxLen) dataSet.remove(i);
}
}
}
特征选择获取
特征选择获取类FeatureObtain的构造函数包含2个参数,dataDir表示数据源路径,increment表示频数增量,在计算熵值的时候,为了避免对数计算错误(存在对数的真数为0的情况)以及以下特殊情况而设计。
例如,当数据集词典中的多个单词在各个类别中的频数为以下情况:
Words | computer | society | math | H(w) when inc=0.0 | H(w) when inc=0.5 | H(w) when inc=1.0 |
---|---|---|---|---|---|---|
blog | 1+inc | 0+inc | 0+inc | NaN | 1.371 | 1.5 |
data | 10+inc | 0+inc | 0+inc | NaN | 0.513 | 0.7732 |
perform | 0+inc | 100+inc | 0+inc | NaN | 0.0897 | 0.1576 |
database | 1000+inc | 0+inc | 0+inc | NaN | 0.0124 | 0.0228 |
形如上表,当某词仅仅在某一个类中出现时,不论其出现次数多寡,该词的熵都为0。显然某词在仅在某类中出现多次更具有特征性,故而设计一个增量,让这些词区别开来。
另外,该类提供多个方法用于保存其它信息,例如所有类下的单词频数、所有升序排列的特征等,限于篇幅,这些函数方法并未罗列,如有兴趣请至文末下载完整版。
/**
* @author Subson Bigod
* @email subsontding@gmail.com
**/
public class FeatureObtain {
private HashMap<String, HashMap<String, Integer>> clusterToWordToFreMapMap;
private DocumentSet documentSet;
private List<String> clusters;
private HashMap<String, Integer> dictToFreMap;
private List<String> dictionary;
private HashMap<String, Double> dicToEntropyMap;
public FeatureObtain(
String dataDir,
double increment)
throws Exception {
clusterToWordToFreMapMap = new
HashMap<String, HashMap<String, Integer>>();
documentSet = new DocumentSet(
dataDir,
clusterToWordToFreMapMap);
clusters = documentSet.getClusters();
dictToFreMap = documentSet.getDictionaryToFreMap();
dictionary = CollectionUtil.keysToList(dictToFreMap);
dicToEntropyMap = new HashMap<String, Double>();
calculateEntropy(increment);
}
public List<String> getClusters() {
return clusters;
}
private void calculateEntropy(double increment) throws Exception {
for (String cluster : clusters) {
HashMap<String, Integer> wordToFreMap =
clusterToWordToFreMapMap.get(cluster);
for (String dic : dictionary) {
double entropy = 0.0;
if (dicToEntropyMap.containsKey(dic)) {
entropy = dicToEntropyMap.get(dic);
}
double fre;
if (wordToFreMap.containsKey(dic)) {
fre = wordToFreMap.get(dic) + increment;
} else {
fre = increment;
}
double allFre = dictToFreMap.get(dic) + 3 * increment;
entropy += entropy(fre/allFre, 2.0);
dicToEntropyMap.put(dic, entropy);
}
}
}
private double entropy(double p_c_w, double base)
throws Exception {
if (p_c_w < 0)
throw new Exception("Log of negative numbers is not defined.");
if (p_c_w == 0.0) return 0.0;
return -p_c_w * Math.log(p_c_w) / Math.log(base);
}
public Map<String, List<Map.Entry<String, Double>>> getClusterToFeaturesMap () {
Map<String, String> dicCusterMap = new HashMap<String, String>();
Map<String, Integer> dicFreMap = new HashMap<String, Integer>();
for (String cluster : clusters) {
HashMap<String, Integer> wordToFreMap =
clusterToWordToFreMapMap.get(cluster);
for (String dic : dictionary) {
if (wordToFreMap.containsKey(dic)) {
if ((!dicFreMap.containsKey(dic)) ||
wordToFreMap.get(dic) > dicFreMap.get(dic)) {
dicFreMap.put(dic, wordToFreMap.get(dic));
dicCusterMap.put(dic, cluster);
}
}
}
}
HashMap<String, HashMap<String, Double>> clusterWordToEntropy =
new HashMap<String, HashMap<String, Double>>();
for (String dic : dictionary) {
if (!dicCusterMap.containsKey(dic)) continue;
String cluster = dicCusterMap.get(dic);
double entropt = dicToEntropyMap.get(dic);
HashMap<String, Double> map;
if (clusterWordToEntropy.containsKey(cluster)) {
map = clusterWordToEntropy.get(cluster);
} else {
map = new HashMap<String, Double>();
}
map.put(dic, entropt);
clusterWordToEntropy.put(cluster, map);
}
HashMap<String, List<Map.Entry<String, Double>>> result =
new HashMap<String, List<Map.Entry<String, Double>>>();
for (String cluster : clusters) {
List<Map.Entry<String, Double>> clusterFeatures =
CollectionUtil.sortMapByValue(clusterWordToEntropy.get(cluster));
result.put(cluster, clusterFeatures);
}
return result;
}
}
数据集质量提升
数据集质量提升类DSQImpro的构造函数包含5个参数,dataDir表示数据源路径,clusterFeatures表示每个类的所有特征(其按照升序排列),influentialFeatureNum表示起作用(构成影响)的特征数量,subClusterDocumentLength表示每个类的所需要的文档长度,subDataSet表示得到的新的质量提升后的数据集。
/**
* @author Subson Bigod
* @email subsontding@gmail.com
**/
public class DSQImpro {
private List<String> dataSet = new ArrayList<String>();
public DSQImpro(
String dataDir,
Map<String, List<String>> clusterFeatures,
int influentialFeatureNum,
Map<String, Integer> subClusterDocumentLength,
ArrayList<String> subDataSet)
throws Exception {
dataSet = FileUtil.fileToList(dataDir);
List<String> clusters = CollectionUtil.keysToList(clusterFeatures);
int maxfeatureNum = 100000000;
for (String cluster : clusters) {
int featureNum = clusterFeatures.get(cluster).size();
if (maxfeatureNum > featureNum) maxfeatureNum = featureNum;
}
if (influentialFeatureNum > maxfeatureNum)
influentialFeatureNum = maxfeatureNum;
boolean isEnd = false;
List<String> subDataSetMessage = new ArrayList<String>();
for (int i = influentialFeatureNum; i > 0; i--) {
if (isEnd) break;
for (int j = i; j < influentialFeatureNum; j++) {
if (isEnd) break;
int accordTimes = 0;
subDataSetMessage.clear();
subDataSet.clear();
for (String cluster : clusters) {
ArrayList<String> subClusterDocuments =
getMoreContainFeaturesDocuments(
cluster,
clusterFeatures.get(cluster),
j,
i);
if (subClusterDocuments.size() <
subClusterDocumentLength.get(cluster)) {
break;
} else {
accordTimes++;
}
subDataSetMessage.add("Cluster " + cluster +
"\'s documents Number of subDataSet is " +
subClusterDocuments.size() + ".");
subDataSet.addAll(subClusterDocuments);
if (accordTimes==clusters.size()) {
isEnd = true;
for (String eachSubDataSetMessage : subDataSetMessage) {
System.out.println(eachSubDataSetMessage);
}
System.out.println(
"getMoreContainFeaturesDocuments("+j+","+i+").");
}
}
}
}
}
private ArrayList<String> getMoreContainFeaturesDocuments (
String clustering,
List<String> clusterFeatures,
int logicOrFeatureNum,
int logicAndFeatureNum)
throws Exception {
ArrayList<String> subDataSet = new ArrayList<String>();
for (int i = 0; i < dataSet.size(); i++) {
String line = dataSet.get(i);
JSONObject obj = new JSONObject(line);
String text = obj.getString("text");
String cluster = obj.getString("cluster");
if (cluster.equals(clustering)) {
int containFeaturesNum = 0;
for (int j = 0; j < logicOrFeatureNum; j++) {
if (text.contains(clusterFeatures.get(j)+" ")) {
containFeaturesNum++;
if (containFeaturesNum >= logicAndFeatureNum) {
subDataSet.add(line);
break;
}
}
}
}
}
return subDataSet;
}
}
值得注意的是,clusterFeatures需要的是Map<String, List<String>>,但是特征选择得到的每个类的所有特征采用Map<String, List<Map.Entry<String, Double>>>进行包装的,故需要转换。另外,subClusterDocumentLength参数需要采用Map<String, Integer>进行包装。例如:
//change the type of the features of each cluster for DSQImpro's input
//"Map<String, List<Map.Entry<String, Double>>>" to "Map<String, List<String>>"
Map<String, List<String>> clusterFeatureList =
new HashMap<String, List<String>>();
for (String cluster : featureObtain.getClusters()) {
List<Map.Entry<String, Double>> featureToEntropy =
clusterFeatures.get(cluster);
List<String> features = new ArrayList<String>();
for (Entry<String, Double> feature : featureToEntropy) {
features.add(feature.getKey());
}
clusterFeatureList.put(cluster, features);
}
//initialize parameters for DSQImpro's input
//the "Map<String, Integer>" save the numbers of document of each cluster
Map<String, Integer> subClusterDocumentLength =
new HashMap<String, Integer>();
for (String cluster : featureObtain.getClusters()) {
subClusterDocumentLength.put(cluster, 200);
}
后记
该数据集质量提升方法可用于提升单个数据集质量,同样也可以提升多个相对应关联数据集的质量。限于笔者水平有限,如有错误与建议,望请不惜赐教。
联系方式:subsontding@gmail.com
具体java代码实现见DSQImpro。
本文系Subson原创,转载请注明。