learningToRank-introduction_opinion ranking data set trec 2008-CSDN博客

本文链接：https://blog.csdn.net/ice_yj/article/details/79236103

Learning To Rank

LTR Introduction http://www.cnblogs.com/bentuwuying/p/6681943.html
1. Sort Problem
  Training Data: Label produce
  Feature: relevance(BM25) and importance (PageRank)
  Evaluation: NDCG,MAP
2. Formulation:
  ListWise Loss Function
3. Learning To Rank Methods?
  
  1) Pointwise: Subset Ranking, McRank, Prank, OC SVM
  
  2) Pairwise: Ranking SVM, RankBoost, RankNet, GBRank, IR SVM, Lambda Rank, LambdaMart
  
  3) Listwise: ListNet, ListMLE, AdaRank, SVM MAP, Soft Rank
LTR Algorithm:RankSVM and IRSVM http://www.cnblogs.com/bentuwuying/p/6683832.html
1. RankSVM: pairwise: For each query, different doc pair can produce different feature vector and by this way produce positive and negative samples.
2. IRSVM: different doc pair have different loss weight
LTR Algorithm: GBRank http://www.cnblogs.com/bentuwuying/p/6684585.html
1. loss function: preference pairs:
2. function gradient descent
  
  The detail is:
3. result:
LTR Algorithm: RankNet, LambdaRank, LambdaMart http://www.cnblogs.com/bentuwuying/p/6690836.html http://blog.csdn.net/huagong_adu/article/details/40710305
1. RankNet?
  1. The problem has been changed to a Probability problem
  2. Loss function:
2. LambdaRank:
  1. The can be describe by:
  2. It can also import another evaluating indicator?NDCG, ERR..?
  3. The Loss Function:
3. LambdaMart: https://liam0205.me/2016/07/10/a-not-so-simple-introduction-to-lambdamart/
  1. Combine the Lambda and the MART(multi additive regression tree)
    
    LTR-TestDataSet
    1. Learning To Rank (LETOR4.0) https://www.microsoft.com/en-us/research/project/letor-learning-rank-information-retrieval/?from=http%3A%2F%2Fresearch.microsoft.com%2Fen-us%2Fum%2Fbeijing%2Fprojects%2Fletor%2Fletor4.0%2Fevaluation%2Feval-score-4.0.pl.txtJuly 2009
      LETOR is a package of benchmark data sets for research on LEarning TO Rank, which contains standard features, relevance judgments, data partitioning, evaluation tools, and several baselines. Version 1.0 was released in April 2007. Version 2.0 was released in Dec. 2007. Version 3.0 was released in Dec. 2008. This version, 4.0, was released in July 2009. Very different from previous versions (V3.0 is an update based on V2.0 and V2.0 is an update based on V1.0), LETOR4.0 is a totally new release. It uses the Gov2 web page collection (~25M pages) and two query sets from Million Query track of TREC 2007 and TREC 2008. We call the two query sets MQ2007 and MQ2008 for short. There are about 1700 queries in MQ2007 with labeled documents and about 800 queries in MQ2008 with labeled documents.
      1. Each row is a query-document pair. The first column is relevance label of this pair, the second column is query id, the following columns are features, and the end of the row is comment about the pair, including id of the document. The larger the relevance label, the more relevant the query-document pair. A query-document pair is represented by a 46-dimensional feature vector. Here are several example rows from MQ2007 dataset:
        
        2 qid:10032 1:0.056537 2:0.000000 3:0.666667 4:1.000000 5:0.067138 … 45:0.000000 46:0.076923 #docid = GX029-35-5894638 inc = 0.0119881192468859 prob = 0.139842
        
        1 qid:10032 1:0.593640 2:1.000000 3:0.000000 4:0.000000 5:0.600707 … 45:0.500000 46:0.000000 #docid = GX256-43-0740276 inc = 0.0136292023050293 prob = 0.400738
        
        0 qid:10032 1:0.279152 2:0.000000 3:0.000000 4:0.000000 5:0.279152 … 45:0.250000 46:1.000000 #docid = GX030-77-6315042 inc = 1 prob = 0.341364
      2. It can be download here: https://onedrive.live.com/?authkey=%21ACnoZZSZVfHPJd0&id=8FEADC23D838BDA8%21107&cid=8FEADC23D838BDA8
      3. LETOR4.0 contains four kinds of data sets for different train model, and you can download it by the above weblink:
        
        Setting Datasets
        Supervised ranking MQ2007
        MQ2008
        Semi-supervised ranking MQ2007-semi
        MQ2008-semi
        Rank aggregation MQ2007-agg
        MQ2008-agg
        Listwise ranking MQ2007-list
        MQ2008-list
    2. Microsoft Learning to Rank Datasets: https://www.microsoft.com/en-us/research/project/mslr/?from=http%3A%2F%2Fresearch.microsoft.com%2Fen-us%2Fprojects%2Fmslr%2Fdownload.aspx June 16, 2010.
      1. The datasets are machine learning data, in which queries and urls are represented by IDs. The datasets consist of feature vectors extracted from query-url pairs along with relevance judgment labels:
        (1) The relevance judgments are obtained from a retired labeling set of a commercial web search engine (Microsoft Bing), which take 5 values from 0 (irrelevant) to 4 (perfectly relevant).
        
        (2) The features are basically extracted byus, and are those widely used in the research community.
        
        In the data files, each row corresponds to a query-url pair. The first column is relevance label of the pair, the second column is query id, and the following columns are features. The larger value the relevance label has, the more relevant the query-url pair is. A query-url pair is represented by a 136-dimensional feature vector.
      2. The datasets were released on June 16, 2010. It can be download here:
        Datasets   Size    MD5
        MSLR-WEB10K     ~ 1.2G       97c5d4e7c171e475c91d7031e4fd8e79
        MSLR-WEB30K     ~ 3.7G       4beae4bee0cd244fc9b2aff355a61555
      3. The 136-feature list has been shown on the web https://www.microsoft.com/en-us/research/project/mslr/?from=http%3A%2F%2Fresearch.microsoft.com%2Fen-us%2Fprojects%2Fmslr%2Fdownload.aspx
    LTR-code:
    1. RankLib-java: https://sourceforge.net/p/lemur/wiki/RankLib/
      1. This website contains each version of RankLib https://sourceforge.net/projects/lemur/files/lemur/ .
        
        it contains some algorithm below:
        
        0: MART(gradient boosted regression tree)
        1: RankNet
        2: RankBoost
        3: AdaRank
        4: Coordinate Ascent
        5: LambdaMART
        6: ListNet
        7: Random Forests
      2. The file format is as follow: https://sourceforge.net/p/lemur/wiki/RankLib%20File%20Format/
        3 qid:1 1:1 2:1 3:0 4:0.2 5:0 # 1A
        2 qid:1 1:0 2:0 3:1 4:0.1 5:1 # 1B
        1 qid:1 1:0 2:1 3:0 4:0.4 5:0 # 1C
      3. How to use this code: https://sourceforge.net/p/lemur/wiki/RankLib%20How%20to%20use/
    2. pyltr-python: https://github.com/jma127/pyltr
      1. pyltr is a Python learning-to-rank toolkit with
        
        ranking models: LambdaMART
        
        evaluation metrics: NDCG, MAP, ERR, AUC_ROC
        
        data wrangling helpers:
        
        Query groupers and validators (pyltr.util.group.check_qids, pyltr.util.group.get_groups)
        
        Data loaders(e.g. pyltr.data.letor.read)
      2. The file format is as : LETOR dataset
      3. How to use this code : ReadMe.rst/Example part
    3. elasticsearch-learning-to-rank : https://github.com/o19s/elasticsearch-learning-to-rank

RankLib-intro

The entrance to the jar is Evaluator.

The fit() part:

ciir.umass.edu.learning

1.DataPoint: Each query-entity pair represent by a DataPoint object, DataPoint contains label, id(qid), fvals(feature values), and description.

2.RankList: When we train a ListWise mode, the List object is RankList. A RankList object contains a List<DataPoint> rl(List for a set of DataPoint which have the same qid.)

3. Ranker: This class implements the generic Ranker interface, other ranking algorithm are extend this class.

The main function contains: init(), learn(), save(String file), load(String file), eval(DataPoint p).

4. Ensemble: This is a model instance for tree type model which contains following LTR algorithm: MART(GBDT), lambdaMART , RFRanker(Random forest)

It mainly contains two List : trees(List <RegressionTree>), and weights (List<float>). When save a Ensemble to a file, the toString() function has been used.

The original tree model can be constructed from the string which read from the file.

5. LambdaMART: the lambda algorithm. The implemented method contains: init(), learn(),eval(DataPoint),toString(),load(String file),

The init(): get the sorted feature value, create a table of candidate thresholds.and compute the feature Histogram.

The learn(): fit each regression tree and add it in Ensemble. Find the best performance trees' number on the validation data.

6. FeatureHistogram: Record all values for each feature, So we can get the threshold value. This class can speed up the decision tree training speed.

7. RegressionTree: the tree root node contains a Split object. The core function is fit().

8. Split: tree node. will contains the avgLabel when the treeNode is a leaf and otherwise it contains a feature_id and threshold. It contains the left and right Split.

The score part:

ciir.umass.edu.metric

Contains various list evaluation indicators.

The util part:

ciir.umass.edu.utilities

contains The file read/output function , Sorter function.

Setting	Datasets
Supervised ranking	MQ2007
Supervised ranking	MQ2008
Semi-supervised ranking	MQ2007-semi
Semi-supervised ranking	MQ2008-semi
Rank aggregation	MQ2007-agg
Rank aggregation	MQ2008-agg
Listwise ranking	MQ2007-list
Listwise ranking	MQ2008-list

learningToRank-introduction

elasticsearch-learning-to-rank : https://github.com/o19s/elasticsearch-learning-to-rank