在Google这个由10的100次方得名的站点中,各种评估网站的算法层出不穷,而PageRank即是其中之一。
Google的PageRank根据网站的外部链接和内部链接的数量和质量俩衡量网站的价值。PageRank背后的概念是,每个到页面的链接都是对该页面的一次投票,被链接的越多,就意味着被其他网站投票越多。这个就是所谓的“链接流行度”——衡量多少人愿意将他们的网站和你的网站挂钩。PageRank这个概念引自学术中一篇论文的被引述的频度——即被别人引述的次数越多,一般判断这篇论文的权威性就越高。
通常情况下讲,原创内容越多的站点,PageRank越容易提升,反之则相对比较困难,PageRank最大上限值为10。在Google的评估中,能上10的网站真可谓凤毛麟角,即使算上Google,能成就PageRank 10这“伟业”者,望眼环球也不足40家。一般来说,个人站点评估值4即办的不错,商业网站到6以上便算步入正轨了。
网上虽然有不少现成的查询器及源码,但是光用别人的毕竟不符合程序员风格,所以今天自己用Java重造轮子又写了个PageRank查询实现,捎带着把一些常用搜索引擎的网站链接及反向链接查询也加上了。
源码如下:
GooglePageRank.java
- package org.loon.test;
- import java.io.IOException;
- import java.util.Random;
- import java.util.regex.Matcher;
- import java.util.regex.Pattern;
- /**
- * Copyright 2008
- *
- * Licensed under the Apache License, Version 2.0 (the "License"); you may not
- * use this file except in compliance with the License. You may obtain a copy of
- * the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
- * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
- * License for the specific language governing permissions and limitations under
- * the License.
- *
- * @project loonframework
- * @author chenpeng
- * @email:ceponline@yahoo.com.cn
- * @version 0.1
- */
- public class GooglePageRank {
- // google pagerank服务器ip地址列表(最近google小气了很多,反复查询一个封ip)
- final static String[] GoogleServiceIP = new String[] { "64.233.161.100",
- "64.233.161.101", "64.233.183.91", "64.233.189.44", "66.102.1.103",
- "66.102.9.115", "66.249.89.83", "66.249.91.99", "66.249.93.190" };
- // google用识别标记
- final static private int GOOGLE_MAGIC = 0xE6359A60;
- // ch数值混合器
- private class CHMix {
- int a;
- int b;
- int c;
- public CHMix() {
- this(0, 0, 0);
- }
- public CHMix(int a, int b, int c) {
- this.a = a;
- this.b = b;
- this.c = c;
- }
- }
- /**
- * 按google要求混合成ch数据
- *
- * @param mix
- */
- private static void mix(final CHMix mix) {
- mix.a -= mix.b;
- mix.a -= mix.c;
- mix.a ^= mix.c >> 13;
- mix.b -= mix.c;
- mix.b -= mix.a;
- mix.b ^= mix.a << 8;
- mix.c -= mix.a;
- mix.c -= mix.b;
- mix.c ^= mix.b >> 13;
- mix.a -= mix.b;
- mix.a -= mix.c;
- mix.a ^= mix.c >> 12;
- mix.b -= mix.c;
- mix.b -= mix.a;
- mix.b ^= mix.a << 16;
- mix.c -= mix.a;
- mix.c -= mix.b;
- mix.c ^= mix.b >> 5;
- mix.a -= mix.b;
- mix.a -= mix.c;
- mix.a ^= mix.c >> 3;
- mix.b -= mix.c;
- mix.b -= mix.a;
- mix.b ^= mix.a << 10;
- mix.c -= mix.a;
- mix.c -= mix.b;
- mix.c ^= mix.b >> 15;
- }
- /**
- * 获得ch数值混合器
- *
- * @return
- */
- public static CHMix getInnerCHMix() {
- return new GooglePageRank().new CHMix();
- }
- /**
- * 通过url获得googlech(google数据库针对页面的全球唯一标识)
- *
- * @param url
- * @return
- */
- public static String GoogleCH(final String url) {
- // 格式化为google要求的info:url模式
- String nUrl = String.format("info:%s", new Object[] { url });
- // 获得新url字符串格式
- char[] urls = nUrl.toCharArray();
- // 获得新url长度
- int length = urls.length;
- // 获得一个ch数值混合器
- CHMix chMix = GooglePageRank.getInnerCHMix();
- // 为c注入google识别标识
- chMix.c = GOOGLE_MAGIC;
- // 为a、b项注入google要求的初始标识
- chMix.a = chMix.b = 0x9E3779B9;
- int k = 0;
- int len = length;
- while (len >= 12) {
- chMix.a += (int) (urls[k + 0] + (urls[k + 1] << 8)
- + (urls[k + 2] << 16) + (urls[k + 3] << 24));
- chMix.b += (int) (urls[k + 4] + (urls[k + 5] << 8)
- + (urls[k + 6] << 16) + (urls[k + 7] << 24));
- chMix.c += (int) (urls[k + 8] + (urls[k + 9] << 8)
- + (urls[k + 10] << 16) + (urls[k + 11] << 24));
- // 获得混合运算后的数据
- GooglePageRank.mix(chMix);
- k += 12;
- len -= 12;
- }
- chMix.c += length;
- // 产生googlech的11位标识
- switch (len) {
- case 11:
- chMix.c += (int) (urls[k + 10] << 24);
- case 10:
- chMix.c += (int) (urls[k + 9] <<