learningToRank-introduction

Learning To Rank 

  1. LTR Introduction http://www.cnblogs.com/bentuwuying/p/6681943.html
    1. Sort Problem
      Training Data: Label produce
      Feature: relevance(BM25) and importance (PageRank)
      Evaluation: NDCG,MAP
    2. Formulation: 
       ListWise Loss Function
    3. Learning To Rank Methods?
      1)  Pointwise: Subset Ranking, McRank, Prank, OC SVM
      2)  Pairwise: Ranking SVM, RankBoost, RankNet, GBRank, IR SVM, Lambda Rank, LambdaMart
      3)  Listwise: ListNet, ListMLE, AdaRank, SVM MAP, Soft Rank
  2. LTR Algorithm:RankSVM and IRSVM  http://www.cnblogs.com/bentuwuying/p/6683832.html
    1. RankSVM: pairwise: For each query, different doc pair can produce different feature vector and by this way produce positive and negative samples.
    2. IRSVM: different doc pair have different loss weight
  3. LTR Algorithm: GBRank  http://www.cnblogs.com/bentuwuying/p/6684585.html
    1. loss function: preference pairs:

    2. function gradient descent

      The detail is:
    3. result:
  4. LTR Algorithm: RankNet, LambdaRank, LambdaMart http://www.cnblogs.com/bentuwuying/p/6690836.html  http://blog.csdn.net/huagong_adu/article/details/40710305
    1. RankNet?
      1. The problem has been changed to a Probability problem
      2. Loss function:



    2. LambdaRank:
      1. The can be describe by:
              
      2. It can also import another evaluating indicator?NDCG, ERR..?
      3. The Loss Function:  
    3. LambdaMart:  https://liam0205.me/2016/07/10/a-not-so-simple-introduction-to-lambdamart/
      1.  Combine the Lambda and the MART(multi additive regression tree)

        LTR-TestDataSet

        1. Learning To Rank (LETOR4.0)  https://www.microsoft.com/en-us/research/project/letor-learning-rank-information-retrieval/?from=http%3A%2F%2Fresearch.microsoft.com%2Fen-us%2Fum%2Fbeijing%2Fprojects%2Fletor%2Fletor4.0%2Fevaluation%2Feval-score-4.0.pl.txtJuly 2009
                 LETOR is a package of benchmark data sets for research on LEarning TO Rank, which contains standard features, relevance judgments, data partitioning, evaluation tools, and several baselines. Version 1.0 was released in April 2007. Version 2.0 was released in Dec. 2007. Version 3.0 was released in Dec. 2008. This version, 4.0, was released in July 2009. Very different from previous versions (V3.0 is an update based on V2.0 and V2.0 is an update based on V1.0), LETOR4.0 is a totally new release. It uses the Gov2 web page collection (~25M pages) and two query sets from Million Query track of TREC 2007 and TREC 2008. We call the two query sets MQ2007 and MQ2008 for short. There are about 1700 queries in MQ2007 with labeled documents and about 800 queries in MQ2008 with labeled documents.
          1. Each row is a query-document pair. The first column is relevance label of this pair, the second column is query id, the following columns are features, and the end of the row is comment about the pair, including id of the document. The larger the relevance label, the more relevant the query-document pair. A query-document pair is represented by a 46-dimensional feature vector. Here are several example rows from MQ2007 dataset:
             

            2 qid:10032 1:0.056537 2:0.000000 3:0.666667 4:1.000000 5:0.067138 … 45:0.000000 46:0.076923 #docid = GX029-35-5894638 inc = 0.0119881192468859 prob = 0.139842

             
            1 qid:10032 1:0.593640 2:1.000000 3:0.000000 4:0.000000 5:0.600707 … 45:0.500000 46:0.000000 #docid = GX256-43-0740276 inc = 0.0136292023050293 prob = 0.400738
            0 qid:10032 1:0.279152 2:0.000000 3:0.000000 4:0.000000 5:0.279152 … 45:0.250000 46:1.000000 #docid = GX030-77-6315042 inc = 1 prob = 0.341364
          2. It can be download here: https://onedrive.live.com/?authkey=%21ACnoZZSZVfHPJd0&id=8FEADC23D838BDA8%21107&cid=8FEADC23D838BDA8
          3. LETOR4.0 contains four kinds of data sets for different train model, and you can download it by the above weblink:

            Setting Datasets
            Supervised ranking MQ2007
            MQ2008
            Semi-supervised ranking MQ2007-semi
            MQ2008-semi
            Rank aggregation MQ2007-agg
            MQ2008-agg
            Listwise ranking MQ2007-list
            MQ2008-list
        2. Microsoft Learning to Rank Datasets: https://www.microsoft.com/en-us/research/project/mslr/?from=http%3A%2F%2Fresearch.microsoft.com%2Fen-us%2Fprojects%2Fmslr%2Fdownload.aspx  June 16, 2010.
          1. The datasets are machine learning data, in which queries and urls are represented by IDs. The datasets consist of feature vectors extracted from query-url pairs along with relevance judgment labels:

            (1) The relevance judgments are obtained from a retired labeling set of a commercial web search engine (Microsoft Bing), which take 5 values from 0 (irrelevant) to 4 (perfectly relevant).

            (2) The features are basically extracted byus, and are those widely used in the research community.

            In the data files, each row corresponds to a query-url pair. The first column is relevance label of the pair, the second column is query id, and the following columns are features. The larger value the relevance label has, the more relevant the query-url pair is. A query-url pair is represented by a 136-dimensional feature vector.

          2. The datasets were released on June 16, 2010. It can be download here:
            Datasets                    Size       MD5
            MSLR-WEB10K     ~ 1.2G       97c5d4e7c171e475c91d7031e4fd8e79
            MSLR-WEB30K     ~ 3.7G       4beae4bee0cd244fc9b2aff355a61555
          3. The 136-feature list has been shown on the web https://www.microsoft.com/en-us/research/project/mslr/?from=http%3A%2F%2Fresearch.microsoft.com%2Fen-us%2Fprojects%2Fmslr%2Fdownload.aspx

        LTR-code:

        1. RankLib-java:  https://sourceforge.net/p/lemur/wiki/RankLib/
          1. This website contains each version of RankLib  https://sourceforge.net/projects/lemur/files/lemur/ . 

            it contains some algorithm below:

            1. 0: MART(gradient boosted regression tree)
            2. 1: RankNet
            3. 2: RankBoost
            4. 3: AdaRank
            5. 4: Coordinate Ascent
            6. 5: LambdaMART
            7. 6: ListNet
            8. 7: Random Forests
          2. The file format is as follow: https://sourceforge.net/p/lemur/wiki/RankLib%20File%20Format/
            3 qid:1 1:1 2:1 3:0 4:0.2 5:0 # 1A
            2 qid:1 1:0 2:0 3:1 4:0.1 5:1 # 1B 
            1 qid:1 1:0 2:1 3:0 4:0.4 5:0 # 1C
          3. How to use this code: https://sourceforge.net/p/lemur/wiki/RankLib%20How%20to%20use/

        2. pyltr-pythonhttps://github.com/jma127/pyltr
          1. pyltr is a Python learning-to-rank toolkit with 

            1. ranking models: LambdaMART

            2. evaluation metrics: NDCG, MAP, ERR, AUC_ROC 

            3. data wrangling helpers:

              1. Query groupers and validators (pyltr.util.group.check_qidspyltr.util.group.get_groups)

              2. Data loaders(e.g. pyltr.data.letor.read)

          2. The file format is as : LETOR dataset
          3. How to use this code : ReadMe.rst/Example part
        3. elasticsearch-learning-to-rank : https://github.com/o19s/elasticsearch-learning-to-rank


RankLib-intro

The entrance to the jar is Evaluator.

The fit() part:

ciir.umass.edu.learning

1.DataPoint: Each query-entity pair represent by a DataPoint object, DataPoint contains label, id(qid), fvals(feature values), and description.

2.RankList: When we train a ListWise mode, the List object is RankList. A RankList object contains a List<DataPoint> rl(List for a set of DataPoint which have the same qid.)

3. Ranker: This class implements the generic Ranker interface, other ranking algorithm are extend this class.

   The main function contains: init(), learn(), save(String file), load(String file), eval(DataPoint p).

4. Ensemble: This is a model instance for tree type model which contains following LTR algorithm: MART(GBDT), lambdaMART , RFRanker(Random forest)

    It mainly contains two List : trees(List <RegressionTree>), and weights (List<float>). When save a Ensemble to a file, the toString() function has been used. 

   The original tree model can be constructed from the string which read from the file.

5. LambdaMART: the lambda algorithm. The implemented method contains: init(), learn(),eval(DataPoint),toString(),load(String file),

    The init(): get the sorted feature value, create a table of candidate thresholds.and compute the feature Histogram.

    The learn():  fit each regression tree and add it in Ensemble. Find the best performance trees' number on the validation data.

6. FeatureHistogram: Record all values for each feature, So we can get the threshold value. This class can speed up the decision tree training speed.

7. RegressionTree: the tree root node contains a Split object. The core function is fit(). 

8. Split: tree node. will contains the avgLabel when the treeNode is a leaf and otherwise it contains a feature_id and threshold. It contains the left and right Split.

 

The score part: 

ciir.umass.edu.metric

Contains various list evaluation indicators.

 

The util part:

ciir.umass.edu.utilities

contains The file read/output function , Sorter function.



  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值