mahout中的org.apache.mahout.classifier.sgd

转载 2013年12月05日 16:04:56

Package org.apache.mahout.classifier.sgd
一,接口概要
1,Interface Gradient
Provides the ability to inject a gradient into the SGD logistic regresion. Typical uses of this are to use a ranking score such as AUC instead of a normal loss function.
2,Interface PriorFunction
A prior is used to regularize the learning algorithm. This allows a trade-off to be made between complexity of the model being learned and the accuracy with which the model fits the training data. There are different definitions of complexity which can be approximated using different priors. For large sparse systems, such as text classification, the L1 prior is often used which favors sparse models.
3,Interface RecordFactory
A record factor understands how to convert a line of data into fields and then into a vector.
二,类概要
1 public abstract class AbstractOnlineLogisticRegression extends AbstractVectorClassifierimplements OnlineLearner
通用的逻辑回归分类器定义,返回特征向量的概率,分类器使用1到n-1中的编码,第0个列没有被存储。
提供了基于SGD的算法来学习逻辑回归分类器,但是省略了所有的学习率。 Any extension of this abstract class must define the overall and per-term annealing for themselves.
All Implemented Interfaces:
OnlineLearner
Direct Known Subclasses:
OnlineLogisticRegression
1.1 lambda
public AbstractOnlineLogisticRegression lambda(double lambda)Chainable configuration option.

Parameters:
lambda - New value of lambda, the weighting factor for the prior distribution.
Returns:
This, so other configurations can be chained.
1.2 link
public Vector link(Vector v)Computes the inverse link function, by default the logistic link function.

Parameters:
v - The output of the linear combination in a GLM. Note that the value of v is disturbed.
Returns:
A version of v with the link function applied.
1.3 link
public double link(double r)Computes the binomial(二项式) logistic inverse link function.

Parameters:
r - The value to transform.
Returns:
The logit of r.
1.4 classifyNoLink
public Vector classifyNoLink(Vector instance)Description copied from class: AbstractVectorClassifier
Classify a vector, but don't apply the inverse link function. For logistic regression and other generalized linear models, this is just the linear part of the classification.

Overrides:
classifyNoLink in class AbstractVectorClassifier
Parameters:
instance - A feature vector to be classified.
Returns:
A vector of scores. If transformed by the link function, these will become probabilities.
一个向量的得分,如果使用link function函数,这些会变为概率
1.5 classifyScalarNoLink
public double classifyScalarNoLink(Vector instance)
1.6 classify
public Vector classify(Vector instance)
Returns n-1 probabilities, one for each category but the 0-th. The probability of the 0-th category is 1 - sum(this result).

Specified by:
classify in class AbstractVectorClassifier
Parameters:
instance - A vector of features to be classified.
Returns:
A vector of probabilities, one for each of the first n-1 categories.
1.7 classifyScalar
public double classifyScalar(Vector instance)
Returns a single scalar probability in the case where we have two categories. Using this method avoids an extra vector allocation as opposed to calling classify() or an extra two vector allocations relative to classifyFull().
对于两类情况下,返回单以标量概率
Specified by:
classifyScalar in class AbstractVectorClassifier
Parameters:
instance - The vector of features to be classified.
Returns:
The probability of the first of two categories.
Throws:
java.lang.IllegalArgumentException - If the classifier doesn't have two categories.
See Also:
AbstractVectorClassifier.classify(Vector)
1.8 train
public void train(long trackingKey,
                  java.lang.String groupKey,
                  int actual,
                  Vector instance)Description copied from interface: OnlineLearner
更新模型,使用一个目标变量值和一个特征向量
Updates the model using a particular target variable value and a feature vector.
There may an assumption that if multiple passes through the training data are necessary that the tracking key for a record will be the same for each pass and that there will be a relatively large number of distinct tracking keys and that the low-order bits of the tracking keys will not correlate with any of the input variables. This tracking key is used to assign training examples to different test/training splits.

Examples of useful tracking keys include id-numbers for the training records derived from a database id for the base table from the which the record is derived, or the offset of the original data record in a data file.


Specified by:
train in interface OnlineLearner
Parameters:
trackingKey - The tracking key for this training example.
groupKey - An optional value that allows examples to be grouped in the computation of the update to the model.
actual - The value of the target variable. This value should be in the half-open interval [0..n) where n is the number of target categories.
instance - The feature vector for this example.
1.9 train
public void train(long trackingKey,
                  int actual,
                  Vector instance)
Description copied from interface: OnlineLearner
Updates the model using a particular target variable value and a feature vector.
There may an assumption that if multiple passes through the training data are necessary that the tracking key for a record will be the same for each pass and that there will be a relatively large number of distinct tracking keys and that the low-order bits of the tracking keys will not correlate with any of the input variables. This tracking key is used to assign training examples to different test/training splits.

Examples of useful tracking keys include id-numbers for the training records derived from a database id for the base table from the which the record is derived, or the offset of the original data record in a data file.
Specified by:
train in interface OnlineLearner
Parameters:
trackingKey - The tracking key for this training example.
actual - The value of the target variable. This value should be in the half-open interval [0..n) where n is the number of target categories.
instance - The feature vector for this example.
1.10 train
train
public void train(int actual,
                  Vector instance)
Description copied from interface: OnlineLearner
Updates the model using a particular target variable value and a feature vector.
There may an assumption that if multiple passes through the training data are necessary, then the training examples will be presented in the same order. This is because the order of training examples may be used to assign records to different data splits for evaluation by cross-validation. Without the order invariance, records might be assigned to training and test splits and error estimates could be seriously affected.

If re-ordering is necessary, then using the alternative API which allows a tracking key to be added to the training example can be used.


Specified by:
train in interface OnlineLearner
Parameters:
actual - The value of the target variable. This value should be in the half-open interval [0..n) where n is the number of target categories.
instance - The feature vector for this example.

2 AdaptiveLogisticRegression
维护一个普通的OnlineLogisticRegression学习器池,池中的每一个元素都有不同的学习率。
一个主意是学习器池实际维护一个CrossFoldLearners(包含数个OnlineLogisticRegression对象)。
这些池允许我们进行性能估计如果对数据做很多次时。如果有好的参数,你或许更喜欢运行一个有这些设置的CrossFoldLearne。
在这里合适的实用是AUC,AUC的实用意味着OnlineLogisticRegression最合适二目标变量的分类问题。
可以通过扩展OnlineAuc来处理非二分类案例。
构造方法:
public AdaptiveLogisticRegression(int numCategories,
                                  int numFeatures,
                                  PriorFunction prior)
方法概要
2.1 train
public void train(int actual,
                  Vector instance)
Description copied from interface: OnlineLearner
Updates the model using a particular target variable value and a feature vector.
更新模型,使用目标变量和一个特征向量
There may an assumption that if multiple passes through the training data are necessary, then the training examples will be presented in the same order. This is because the order of training examples may be used to assign records to different data splits for evaluation by cross-validation. Without the order invariance, records might be assigned to training and test splits and error estimates could be seriously affected.

If re-ordering is necessary, then using the alternative API which allows a tracking key to be added to the training example can be used.

Specified by:
train in interface OnlineLearner
Parameters:
actual - The value of the target variable. This value should be in the half-open interval [0..n) where n is the number of target categories.
instance - The feature vector for this example.
2.2 train
public void train(long trackingKey,
                  int actual,
                  Vector instance)Description copied from interface: OnlineLearner
Updates the model using a particular target variable value and a feature vector.
There may an assumption that if multiple passes through the training data are necessary that the tracking key for a record will be the same for each pass and that there will be a relatively large number of distinct tracking keys and that the low-order bits of the tracking keys will not correlate with any of the input variables. This tracking key is used to assign training examples to different test/training splits.

Examples of useful tracking keys include id-numbers for the training records derived from a database id for the base table from the which the record is derived, or the offset of the original data record in a data file.


Specified by:
train in interface OnlineLearner
Parameters:
trackingKey - The tracking key for this training example.
actual - The value of the target variable. This value should be in the half-open interval [0..n) where n is the number of target categories.
instance - The feature vector for this example.
2.3 public void train(long trackingKey,
                  java.lang.String groupKey,
                  int actual,
                  Vector instance)Description copied from interface: OnlineLearner
Updates the model using a particular target variable value and a feature vector.
There may an assumption that if multiple passes through the training data are necessary that the tracking key for a record will be the same for each pass and that there will be a relatively large number of distinct tracking keys and that the low-order bits of the tracking keys will not correlate with any of the input variables. This tracking key is used to assign training examples to different test/training splits.

Examples of useful tracking keys include id-numbers for the training records derived from a database id for the base table from the which the record is derived, or the offset of the original data record in a data file.

Specified by:
train in interface OnlineLearner
Parameters:
trackingKey - The tracking key for this training example.
groupKey - An optional value that allows examples to be grouped in the computation of the update to the model.
actual - The value of the target variable. This value should be in the half-open interval [0..n) where n is the number of target categories.
instance - The feature vector for this example.
2.4 auc
public double auc()
What is the AUC for the current best member of the population. If no member is best, usually because we haven't done any training yet, then the result is set to NaN.

Returns:
The AUC of the best member of the population or NaN if we can't figure that out.

3 Class CrossFoldLearner
public class CrossFoldLearner extends AbstractVectorClassifier implements OnlineLearner
Does cross-fold validation of log-likelihood and AUC on several online logistic regression models. Each record is passed to all but one of the models for training and to the remaining model for evaluation. In order to maintain proper segregation between the different folds across training data iterations, data should either be passed to this learner in the same order each time the training data is traversed or a tracking key such as the file offset of the training record should be passed with each training example.

相关文章推荐

Canopy集群算法(org.apache.mahout.clustering.canopy.CanopyDriver)(转载)

 2014年2月14日 刘 小飞 发表回复 原创文章,转载请注明: 转载自慢慢的回味 本文链接地址: Canopy集群算法(org.apache.mahout.c...

Apache Mahout Cookbook

  • 2014-01-19 09:37
  • 6.25MB
  • 下载

[转]基于 Apache Mahout 构建社会化推荐引擎

基于 Apache Mahout 构建社会化推荐引擎文档选项打印本页将此页作为电子邮件发送样例代码级别: 中级赵 晨婷, 软件工程师, IBM马 春娥, 软件工程师, IBM2010 年 1 月 21...

Apache Mahout Cookbook

  • 2015-06-22 17:47
  • 6.26MB
  • 下载

Apache Mahout Cookbook 免积分

  • 2016-04-03 09:10
  • 197KB
  • 下载

基于Apache Mahout构建社会化推荐引擎

一.推荐引擎简介 推荐引擎利用特殊的信息过滤(IF,Information Filtering)技术,将不同的内容(例如电影、音乐、书籍、新闻、图片、网页等)推荐给可能感兴趣的用户。通常情况下,推荐...

Apache Mahout 简介

Apache Mahout 简介 通过可伸缩、商业友好的机器学习来构建智能应用程序 当研究院和企业能获取足够的专项研究预算之后,能从数据和用户输入中学习的智能应用程序将变得更加常见。人们对机器学习...
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:深度学习:神经网络中的前向传播和反向传播算法推导
举报原因:
原因补充:

(最多只允许输入30个字)