相关性检索算法

最新推荐文章于 2023-06-05 01:53:47 发布

dianlong4020

最新推荐文章于 2023-06-05 01:53:47 发布

阅读量678

点赞数

文章标签： java php c/c++

原文链接：http://www.cnblogs.com/mazhimazhi/p/7499445.html

版权

招聘大神级广告业务开发有如下需求：

职位描述：

1、负责高性能广告引擎的设计、研发及持续优化；

2、广告结算系统、数据分析系统的开发及持续优化

3、与广告算法团队一起，研发和优化广告召回、排序、竞价、频率/流量控制等核心投放策略；

4、对现有系统的不足进行分析,找到目前系统的瓶颈,改进提高系统性能

岗位要求 :

- 计算机或相关专业本科以上学历，具备在线广告或者相关领域一年以上工作经验

- 熟悉分布式系统，数据存储及网络通信技术

- 深入理解算法与常用数据结构

- 熟悉linux开发环境，熟练使用C++或Java

- 了解统计模型，有机器学习，数据挖掘等相关技术工作经验的优先

- 有过大规模（千万级在线用户以上）AdExchange，DSP系统开发经验的优先

关键字的词频（Term Frequency）TF，如某个词在某个有1000个词的文档上出现过35次，那么TF=0.035

逆文件频率指数（Inverse Document Frequency）IDF：公式为IDF(m关键字）=log(总文档数/出现m关键词的文档个数)

那么一个查询包含m个关键字，那么这个查询和某个文档的相似度可以表示为：

TF1*IDF1+TF2*IDF2+...+TFm*IDFm

或者可以用矢量简单的表示，如

doc = (X1,X2,...,Xm)

其中，Xm=TFm*IDFm。

当在广告中使用时可以看作一个请求中的标签（IP地理位置、用户性别等）与广告定向标签的匹配度，也就是相似度。一般上下文定向广告和内容推荐产品中使用的最多的就是WAND(Weak AND)算法，这个算法利用两个上界过滤掉大部分无法胜出的广告。

（1）某个关键词t在所有的文档上贡献值的上界

（2）某个文档中所有关键词的上界的和

算法最主要的是next函数，其伪代码如下：

Function next(θ)
		  repeat
			    /* Sort the terms in non decreasing order of DocID */
			    sort(terms, posting)
			    
			    /* Find pivot term - the first one with accumulated UB ≥ θ */
			    pivotTerm ← findPivotTerm(terms, θ)
			    
			    if (pivotTerm = null) 
			    	return (NoMoreDocs)
			    			
			    pivot ← posting[pivotTerm].DocID
			    if (pivot = lastID) 
			    	return (NoMoreDocs)
			    			
			    if (pivot ≤ curDoc)
				       /* pivot has already been considered, advance one of the preceding terms */
				       aterm ← pickTerm(terms[0..pivotTerm])
				       // 返回aterm倒排索引中的DocID，满足DocID >= n
				       posting[aterm] ← aterm.iterator.next(curDoc+1)
			    else /* pivot > curDoc */
				      if (posting[0].DocID = pivot) // 注:这个是sort之后的第一个term位置的doc id
				          /* Success, all terms preceding pivotTerm belong to the pivot */
				          curDoc ← pivot
				          return (curDoc, posting)
				      else
				          /* not enough mass yet on pivot, advance one of the preceding terms */
				          aterm ← pickTerm(terms[0..pivotTerm])
				          posting[aterm] ← aterm.iterator.next(pivot)
		  end repeat

下面用Java代码来实现，首先新建基本的数据对象，如下：

public class TermInfo implements Comparable<TermInfo> {

	private Integer termNum; // 当前项的编号
	private Double termWeight = 0.0d; // 当前项term所占的权重
	private Integer[] array; // 含有些顶的一些
	private int pointer;  // 指向array中的某一项，表示需要比较的元素

	public TermInfo(Integer[] array,  Integer termNum,Double termWeight) {
		this.array = array;
		this.termWeight = termWeight;
		this.termNum = termNum;
	}
	
	public Double getTermWeight(){
		return termWeight;
	}

	public TermInfo(Integer[] array, Integer termNum) {
		this(array, termNum,0.0d);
	}

	public void seek() throws Exception {
		if (!isEnd()) {
			pointer++;
		}else{
			throw new ArrayIndexOutOfBoundsException(pointer+1);
		}
		
	}

	public void seek(int num) throws Exception {
		if (!isEnd() && (pointer + num) < array.length) {
			pointer = pointer + num;
		}else{
			throw new ArrayIndexOutOfBoundsException(pointer+num);
		}
		
	}
	
	public boolean seekToGreaterTermNum(int tnum) throws Exception {
		while (!isEnd() && getComparableElement() < tnum) {
			seek();
		}
		return getComparableElement() >= tnum;
	}

	public boolean isEnd() {
		return array.length <= pointer;
	}

	public int getComparableElement() throws Exception {
		if (!isEnd()) {
			return array[pointer];
		}
		return -1;
	}

	@Override
	public int compareTo(TermInfo o) {
		try {
			return  this.getComparableElement()-o.getComparableElement();
		} catch (Exception e) {
			e.printStackTrace();
		}
		return -1;
	}

	@Override
	public String toString() {
		
		return " \n TermInfo [termWeight=" + termWeight + ", term=" + termNum + ", array=" + toStringArray(array)
				+ ", pointer=" + pointer + "]\n";
	}
	
	public String toStringArray(Integer[] a) {
		
		if (a == null){
			return "null";
		}
		
		int iMax = a.length - 1;
		if (iMax == -1){
			return "[]";
		}

		StringBuilder b = new StringBuilder();
		b.append('[');
		for (int i = 0;; i++) {
			
			if(i==pointer){
				b.append("("+a[i]+")");
			}else{
				b.append(a[i]);
			}
			
			if (i == iMax){
				return b.append(']').toString();
			}
			b.append(", ");
		}
	}	
	
}

然后还需要一个操作TermInfo对象的工具类，如下：

public class TermInfoUtil {
	
	public static int pickTerm(List<TermInfo> inverseList, int index) throws Exception {
		int curAd = -1;
		if (index == 0) {
			if (!inverseList.get(index).isEnd()) {
				inverseList.get(index).seek();
				curAd = inverseList.get(index).getComparableElement();
			} else {
				return -1;
			}
		} else {
			curAd = inverseList.get(index - 1).getComparableElement();
		}
		return curAd;
	}


	public static void seekBatchToGreaterTermNum(List<TermInfo> inverseList,int termNum) throws Exception {
		
		Iterator<TermInfo> iterator = inverseList.iterator();
		while(iterator.hasNext()){
			boolean result = iterator.next().seekToGreaterTermNum(termNum);
			if(!result){
				iterator.remove();
			}
		}
		
		Collections.sort(inverseList);
	}

}

下面可以准备一些基础的数据进行算法的验证，如下：

private static final Map<Integer, Double> term_ub = new HashMap<Integer, Double>();
private static final Map<Integer, Integer[]> map = new ConcurrentHashMap<Integer, Integer[]>();

static {

	term_ub.put(1001, 0.5);
	term_ub.put(2001, 1.0);
	term_ub.put(3001, 2.0);
	term_ub.put(4001, 3.0);
	term_ub.put(5001, 4.0);

	map.put(1001, new Integer[] { 1, 3, 26 });
	map.put(2001, new Integer[] { 1, 2, 4, 10, 100 });
	map.put(3001, new Integer[] { 2, 3, 6, 34, 56 });
	map.put(4001, new Integer[] { 1, 4, 5, 23, 70, 200 });
	map.put(5001, new Integer[] { 5, 14, 78 });
}

这里1001、2001等就是标签的编号，term_ub中存储着这些标签的权重。例如某个请求过来时能够识别出地域信息，我们就可以为不同的地域赋予不同的权重。

而map中存储着各个标签编号到广告id的倒排索引，也就是含有1001标签定向的广告id有1、3和26。实际中我们可以将这些信息存储到缓存集群中，当新建广告添加不同的定向标签时可以直接添加到缓存的倒排索引中，这样服务器集群就需要定时更新一些倒排信息。

我们可以使用guava包提供的API来实现本地和线上定时的缓存更新，如下：

LoadingCache<Integer, Integer[]> localCatchedKey = CacheBuilder.newBuilder()
			.maximumSize(3000)
			.expireAfterWrite(10, TimeUnit.SECONDS) 
			.build(new CacheLoader<Integer, Integer[]>() {
				public Integer[] load(Integer key) throws Exception {
					return map.get(key);
				}
			});

下面就是算法实现的主要部分，如下：

void retrieve(Set<Integer> query, Set<Integer> docIDs) throws Exception {

		double minScore = 3.5d; // 初始化剪枝阈值0，对倒排结果进行快速剪枝，从而得到提速的效果

		// Pair[score,currentAdId]
		FixSizedPriorityQueue<Pair<Double, Integer>> heap = new FixSizedPriorityQueue<Pair<Double, Integer>>(3);

		List<TermInfo> posting = getData(query);

		while (posting.size() > 0) {

			boolean flag = do_next(posting, minScore);
			if (flag == false) {
				break;
			}
			System.out.println("符合条件的广告Id为：" + currentAdId);
			double score = full_evaluate(currentAdId);
			// 如果score>topn堆中最小值则将该doc入堆
			heap.add(new Pair<Double, Integer>(score, currentAdId));
			// 更新堆阈值
			// minScore = heap.getTopElement().getValue1();
			heap.getTopElement().getValue1();
		}

		// 将top-K doc填充到结果集合中
		List<Pair<Double, Integer>> it = heap.sortedList();
		for (int i = 0; i < it.size(); i++) {
			Pair<Double, Integer> pr = it.get(i);
			docIDs.add(pr.getValue2());
		}
	}

	// 查找下一个进行完全计算的文档
	private boolean do_next(List<TermInfo> posting, double minScore) throws Exception {

		Collections.sort(posting);
		
		System.out.println(posting.toString());

		// 找到使得累计的UB大于阈值的Privot term
		double accUB = 0.0f;
		int pivotTerm = 0;
		int numTerm = posting.size();

		for (pivotTerm = 0; pivotTerm < numTerm; pivotTerm++) {
			accUB += posting.get(pivotTerm).getTermWeight();
			if (accUB >= minScore) {
				break;
			}
		}

		// 已经没有胜出的候选
		if (accUB < minScore || pivotTerm >= numTerm) {
			return false;
		}

		// 候选doc已经被考虑过，滑动倒排链寻找候选doc
		if (posting.get(pivotTerm).getComparableElement() <= currentAdId) {
			int curAd = TermInfoUtil.pickTerm(posting, pivotTerm); // 选一个[0...pivotTerm]之间的term
			if (curAd == -1) {
				return false;
			}
			posting.get(pivotTerm).seekToGreaterTermNum(curAd);

		} else {
			// 找到候选的 doc,向后滑动倒排链跳过它
			Integer zeroComparableElement = posting.get(0).getComparableElement();
			if (posting.get(pivotTerm).getComparableElement() == zeroComparableElement) {
				currentAdId = zeroComparableElement;
				TermInfoUtil.seekBatchToGreaterTermNum(posting, currentAdId+1);
				return true;
			} else {
				// pivotTerm中的pivot没有胜出，滑动倒排链寻找候选doc
				Integer chooseSkipElement = posting.get(pivotTerm).getComparableElement();
				TermInfoUtil.seekBatchToGreaterTermNum(posting, chooseSkipElement);
				return do_next(posting, minScore);
			}
		}
		return false;
	}// end do_next

相关性检索算法能够快速排除大量不满足要求的广告，主要是通过两个上界来达到的：

（1）某个关键词t在所有的文档上贡献值的上界

（2）某个文档中所有关键词的上界的和

候选出来的广告还需要算出具体的权重，才能在所有的广告候选中选出权重最高的几个广告。通过大顶堆来实现：

import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.List;
import java.util.PriorityQueue;

/*
 *  固定容量的优先队列，模拟大顶堆，用于解决求topN小的问题  
 */
public class FixSizedPriorityQueue<E extends Comparable<E>> {
	private PriorityQueue<E> queue;
	private int maxSize; // 堆的最大容量

	public FixSizedPriorityQueue(int maxSize) {
		if (maxSize <= 0)
			throw new IllegalArgumentException();
		this.maxSize = maxSize;
		this.queue = new PriorityQueue<E>(maxSize, new Comparator<E>() {
			@Override
			public int compare(E o1, E o2) {
				return o2.compareTo(o1);
			}
		});
	}

	public void add(E e) {
		if (queue.size() < maxSize) { // 未达到最大容量，直接添加
			queue.add(e);
		} else { // 队列已满
			E peek = queue.peek();
			if (e.compareTo(peek) < 0) { // 将新元素与当前堆顶元素比较，保留较小的元素
				queue.poll();
				queue.add(e);
			}
		}
	}

	public List<E> sortedList() {
		List<E> list = new ArrayList<>(queue);
		Collections.sort(list); // PriorityQueue本身的遍历是无序的，最终需要对队列中的元素进行排序
		return list;
	}

	public E getTopElement() {
		List<E> list = this.sortedList();
		if (list.isEmpty()) {
			return null;
		}
		return list.get(0);
	}

}

下面就可以编写实例来测试了，如下：

public void test() throws Exception {
	
			// 查询条件
			Set<Integer> query = new HashSet<Integer>();
			query.add(1001);
			query.add(2001);
			query.add(3001);
			query.add(4001);
			query.add(5001);
			
			// 查询结果
			Set<Integer> result = new HashSet<Integer>();
			
			retrieve(query,result);
	
			System.out.println(result.toString());
}

运算的结果如下：　　

[ 
 TermInfo [termWeight=1.0, term=2001, array=[(1), 2, 4, 10, 100], pointer=0]
,  
 TermInfo [termWeight=3.0, term=4001, array=[(1), 4, 5, 23, 70, 200], pointer=0]
,  
 TermInfo [termWeight=0.5, term=1001, array=[(1), 3, 26], pointer=0]
,  
 TermInfo [termWeight=2.0, term=3001, array=[(2), 3, 6, 34, 56], pointer=0]
,  
 TermInfo [termWeight=4.0, term=5001, array=[(5), 14, 78], pointer=0]
]
符合条件的广告Id为：1
[ 
 TermInfo [termWeight=1.0, term=2001, array=[1, (2), 4, 10, 100], pointer=1]
,  
 TermInfo [termWeight=2.0, term=3001, array=[(2), 3, 6, 34, 56], pointer=0]
,  
 TermInfo [termWeight=0.5, term=1001, array=[1, (3), 26], pointer=1]
,  
 TermInfo [termWeight=3.0, term=4001, array=[1, (4), 5, 23, 70, 200], pointer=1]
,  
 TermInfo [termWeight=4.0, term=5001, array=[(5), 14, 78], pointer=0]
]
[ 
 TermInfo [termWeight=2.0, term=3001, array=[2, (3), 6, 34, 56], pointer=1]
,  
 TermInfo [termWeight=0.5, term=1001, array=[1, (3), 26], pointer=1]
,  
 TermInfo [termWeight=1.0, term=2001, array=[1, 2, (4), 10, 100], pointer=2]
,  
 TermInfo [termWeight=3.0, term=4001, array=[1, (4), 5, 23, 70, 200], pointer=1]
,  
 TermInfo [termWeight=4.0, term=5001, array=[(5), 14, 78], pointer=0]
]
[ 
 TermInfo [termWeight=1.0, term=2001, array=[1, 2, (4), 10, 100], pointer=2]
,  
 TermInfo [termWeight=3.0, term=4001, array=[1, (4), 5, 23, 70, 200], pointer=1]
,  
 TermInfo [termWeight=4.0, term=5001, array=[(5), 14, 78], pointer=0]
,  
 TermInfo [termWeight=2.0, term=3001, array=[2, 3, (6), 34, 56], pointer=2]
,  
 TermInfo [termWeight=0.5, term=1001, array=[1, 3, (26)], pointer=2]
]
符合条件的广告Id为：4
[ 
 TermInfo [termWeight=3.0, term=4001, array=[1, 4, (5), 23, 70, 200], pointer=2]
,  
 TermInfo [termWeight=4.0, term=5001, array=[(5), 14, 78], pointer=0]
,  
 TermInfo [termWeight=2.0, term=3001, array=[2, 3, (6), 34, 56], pointer=2]
,  
 TermInfo [termWeight=1.0, term=2001, array=[1, 2, 4, (10), 100], pointer=3]
,  
 TermInfo [termWeight=0.5, term=1001, array=[1, 3, (26)], pointer=2]
]
符合条件的广告Id为：5
[ 
 TermInfo [termWeight=2.0, term=3001, array=[2, 3, (6), 34, 56], pointer=2]
,  
 TermInfo [termWeight=1.0, term=2001, array=[1, 2, 4, (10), 100], pointer=3]
,  
 TermInfo [termWeight=4.0, term=5001, array=[5, (14), 78], pointer=1]
,  
 TermInfo [termWeight=3.0, term=4001, array=[1, 4, 5, (23), 70, 200], pointer=3]
,  
 TermInfo [termWeight=0.5, term=1001, array=[1, 3, (26)], pointer=2]
]
[ 
 TermInfo [termWeight=4.0, term=5001, array=[5, (14), 78], pointer=1]
,  
 TermInfo [termWeight=3.0, term=4001, array=[1, 4, 5, (23), 70, 200], pointer=3]
,  
 TermInfo [termWeight=0.5, term=1001, array=[1, 3, (26)], pointer=2]
,  
 TermInfo [termWeight=2.0, term=3001, array=[2, 3, 6, (34), 56], pointer=3]
,  
 TermInfo [termWeight=1.0, term=2001, array=[1, 2, 4, 10, (100)], pointer=4]
]
符合条件的广告Id为：14
[ 
 TermInfo [termWeight=3.0, term=4001, array=[1, 4, 5, (23), 70, 200], pointer=3]
,  
 TermInfo [termWeight=0.5, term=1001, array=[1, 3, (26)], pointer=2]
,  
 TermInfo [termWeight=2.0, term=3001, array=[2, 3, 6, (34), 56], pointer=3]
,  
 TermInfo [termWeight=4.0, term=5001, array=[5, 14, (78)], pointer=2]
,  
 TermInfo [termWeight=1.0, term=2001, array=[1, 2, 4, 10, (100)], pointer=4]
]
[ 
 TermInfo [termWeight=0.5, term=1001, array=[1, 3, (26)], pointer=2]
,  
 TermInfo [termWeight=2.0, term=3001, array=[2, 3, 6, (34), 56], pointer=3]
,  
 TermInfo [termWeight=3.0, term=4001, array=[1, 4, 5, 23, (70), 200], pointer=4]
,  
 TermInfo [termWeight=4.0, term=5001, array=[5, 14, (78)], pointer=2]
,  
 TermInfo [termWeight=1.0, term=2001, array=[1, 2, 4, 10, (100)], pointer=4]
]
[ 
 TermInfo [termWeight=3.0, term=4001, array=[1, 4, 5, 23, (70), 200], pointer=4]
,  
 TermInfo [termWeight=4.0, term=5001, array=[5, 14, (78)], pointer=2]
,  
 TermInfo [termWeight=1.0, term=2001, array=[1, 2, 4, 10, (100)], pointer=4]
]
[ 
 TermInfo [termWeight=4.0, term=5001, array=[5, 14, (78)], pointer=2]
,  
 TermInfo [termWeight=1.0, term=2001, array=[1, 2, 4, 10, (100)], pointer=4]
,  
 TermInfo [termWeight=3.0, term=4001, array=[1, 4, 5, 23, 70, (200)], pointer=5]
]
符合条件的广告Id为：78
[ 
 TermInfo [termWeight=1.0, term=2001, array=[1, 2, 4, 10, (100)], pointer=4]
,  
 TermInfo [termWeight=3.0, term=4001, array=[1, 4, 5, 23, 70, (200)], pointer=5]
]
[ 
 TermInfo [termWeight=3.0, term=4001, array=[1, 4, 5, 23, 70, (200)], pointer=5]
]
[1, 4, 5]

参考文章：

（1）http://x-algo.cn/index.php/2016/07/13/812/

（2）http://www.cnblogs.com/daremen/archive/2013/08/29/3289694.html

转载于:https://www.cnblogs.com/mazhimazhi/p/7499445.html

dianlong4020

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
相关性检索算法

招聘大神级广告业务开发有如下需求：职位描述：1、负责高性能广告引擎的设计、研发及持续优化；2、广告结算系统、数据分析系统的开发及持续优化3、与广告算法团队一起，研发和优化广告召回、排序、竞价、频率/流量控制等核心投放策略；4、对现有系统的不足进行分析,找到目前系统的瓶颈,改进提高系统性能岗位要求 :- 计算机或相关专业本科以上学历，具备在线广告或...
复制链接

扫一扫