概述:该项目分成4个模块:爬取模块、解析模块、索引模块、搜索模块。
功能:爬取智联招聘和前程无忧两个网站上的招聘信息,解析并保存,在本地建立索引,最后提供web界面和各种搜索功能。
技术:heritrix3.0 、 hbase-writer0.9 、 hbase0.9、hadoop0.20.2、HTMLParser2.0 、 lucene3.3 、 bobo-browse2.5 、struts2.2 、 freemarker2.3 、 jquery
开发时间:3周。
开发环境:linux(fedora15)、eclipse
前期学习hadoop、lucene:2个月。
参考书:《Hadoop:The Definitive Guide, 2nd Edition》《Lucene in Action, Second Edition》《开发自己的搜索引擎:Lucene+Heritrix(第2版)》
一、爬取模块:
1、技术:heritrix3.0 + hbase-writer0.9 + hbase0.9 2、概述:用heritrix爬取前程无忧和智联招聘两个网站上职位有关的页面,并保存到hbase中rawjobs表。
rawjobs表:
key:Keying.createKey(url)
column family:content、curi 3、详细: (1)、创建自己的DecideRule类:
package com.qjqiao.modules.deciderules;
import org.archive.modules.CrawlURI;
import org.archive.modules.deciderules.DecideResult;
import org.archive.modules.deciderules.DecideRule;
public class JobsRule extends DecideRule {
private static final long serialVersionUID = 1L;
@Override
protected DecideResult innerDecide(CrawlURI uri) {
String u = uri.getURI();
if (u.startsWith("dns")
|| u.startsWith("DNS")
|| u.endsWith("robots.txt")
// www.zhaopin.com
|| u.contains("zhaopin.com/jobseeker")
|| u.contains("company.zhaopin.com")
|| u.contains("jobs.zhaopin.com")
|| u.contains("search.zhaopin.com/jobs")
|| u.contains("search.zhaopin.com/jobseeker")
// www.51job.com
|| u.contains("search.51job.com")) {
if (!u.contains("research.51job.com")) {
return DecideResult.ACCEPT;
}
}
return DecideResult.REJECT;
}
}
(2)、创建assingment policy类:
package org.archive.crawler.frontier;
import org.apache.commons.httpclient.URIException;
import org.archive.modules.CrawlURI;
import org.archive.net.UURI;
public class ELFHashQueueAssignmentPolicy extends
URIAuthorityBasedQueueAssignmentPolicy {
/**
*
*/
private static final long serialVersionUID = 1L;
public int ELFHash(String str, int number) {
int hash = 0;
long x = 0l;
char[] array = str.toCharArray();
for (int i = 0; i < array.length; i++) {
hash = (hash << 4) + array[i];
if ((x = (hash & 0xF0000000L)) != 0) {
hash ^= (x >> 24);
hash &= ~x;
}
}
int result = (hash & 0x7FFFFFFF) % number;
return result;
}
@Override
protected String getCoreKey(UURI basis) {
try {
System.out.println(this.ELFHash(basis.getURI().toString(), 50)
+ " |||| ELFHashQueueAssignmentPolicy : " + basis.getURI());
return this.ELFHash(basis.getURI().toString(), 50) + "";
} catch (URIException e) {
e.printStackTrace();
return "0";
}
}
}
(3)、修改配置文件crawler-beans.xml:
<bean id="simpleOverrides" class="org.springframework.beans.factory.config.PropertyOverrideConfigurer">
<property name="properties">
<value>
metadata.operatorContactUrl=http://www.51job.com
metadata.jobName=51job
metadata.description=jobs from 51job.com
</value>
</property>
</bean>
<bean id="metadata" class="org.archive.modules.CrawlMetadata" autowire="byName">
<property name="operatorContactUrl" value="[see override above]"/>
<property name="jobName" value="[see override above]"/>
<property name="description" value="[see override above]"/>
<property name="userAgentTemplate"
value="Mozilla/5.0 (compatible; heritrix/3.1.0 +@OPERATOR_CONTACT_URL@)"/>
</bean>
<bean id="seeds" class="org.archive.modules.seeds.TextSeedModule">
<property name="textSource">
<bean class="org.archive.spring.ConfigFile">
<property name="path" value="seeds.txt" />
</bean>
</property>
<property name='sourceTagSeeds' value='false'/>
</bean>
<bean id="scope" class="org.archive.modules.deciderules.DecideRuleSequence">
<property name="rules">
<list>
<bean class="org.archive.modules.deciderules.RejectDecideRule">
</bean>
<bean class="org.archive.modules.deciderules.surt.SurtPrefixedDecideRule">
</bean>
<bean class="org.archive.modules.deciderules.TooManyHopsDecideRule">
</bean>
<bean class="org.archive.modules.deciderules.TransclusionDecideRule">
</bean>
<bean class="com.qjqiao.modules.deciderules.JobsRule">
</bean>
<bean class="org.archive.modules.deciderules.surt.SurtPrefixedDecideRule">
<property name="decision" value="REJECT"/>
<property name="seedsAsSurtPrefixes" value="false"/>
<property name="surtsDumpFile" value="negative-surts.dump" />
</bean>
<bean class="org.archive.modules.deciderules.MatchesListRegexDecideRule">
</bean>
<bean class="org.archive.modules.deciderules.PathologicalPathDecideRule">
</bean>
<bean class="org.archive.modules.deciderules.TooManyPathSegmentsDecideRule">
</bean>
<bean class="org.archive.modules.deciderules.PrerequisiteAcceptDecideRule">
</bean>
</list>
</property>
</bean>
<bean id="hbaseParameterSettings" class="org.archive.io.hbase.HBaseParameters">
<property name="contentColumnFamily" value="content"></property>
<property name="contentColumnName" value="raw-data"></property>
<property name="charsetColumnName" value="charset"></property>
<property name="curiColumnFamily" value="curi"></property>
<property name="ipColumnName" value="ip"></property>
<property name="pathFromSeedColumnName" value="path-from-seed"></property>
<property name="isSeedColumnName" value="is-seed"></property>
<property name="viaColumnName" value="via"></property>
<property name="urlColumnName" value="url"></property>
<property name="requestColumnName" value="request"></property>
<!-- Overwrite more options here -->
</bean>
<bean id="hbaseWriterProcessor" class="org.archive.modules.writer.HBaseWriterProcessor">
<property name="zkQuorum" value="localhost">
</property>
<property name="zkClientPort" value="2181">
</property>
<property name="hbaseTable" value="rawjobs">
</property>
<property name="onlyProcessNewRecords" value="false">
</property>
<property name="onlyWriteNewRecords" value="false">
</property>
<property name="hbaseParameters">
<ref bean="hbaseParameterSettings" />
</property>
</bean>
<bean id="dispositionProcessors" class="org.archive.modules.DispositionChain">
<property name="processors">
<list>
<ref bean="hbaseWriterProcessor" />
<ref bean="candidates"/>
<ref bean="disposition"/>
</list>
</property>
</bean>
<bean id="queueAssignmentPolicy"
class="org.archive.crawler.frontier.ELFHashQueueAssignmentPolicy">
</bean>
(4)种子文件:seeds.txt:
http://search.51job.com/jobsearch/advance_search.php?lang=c&stype=2
http://www.zhaopin.com/jobseeker/index_industry.html
(5)运行示例:
二、解析模块:
1、技术:HTMLParser2.0 + hadoop mapreduce + hbase
2、概述:
整个解析模块作为一个mapreduce过程,并行处理;
从hbase的rawjobs表中读取爬取的页面信息,用HTMLParser解析页面,获取招聘信息,格式化(不同网站,招聘信息的字段取值范围有差别)并保存到hbase的parsedjobs表中。
parsedjobs表:
key:Keying.createKey(url)
column family:job
3、遇到的问题:中文乱码问题。
问题原因:不同网站、同一网站不同页面,有不同的编码方式,所以爬虫获取的InputStream字节流编码不同,hbase-writer直接将该字节序列写入hbase中;解析模块读取hbase中的网页内容时,Bytes.toString(byte[])方法用的是utf-8字符集,所以乱码。
解决方法:在hbase-writer的HbaseWriter中,获取爬取的网页的编码方式,并写入到hbase中rawjobs表中:
String contentType = curi.getContentType();
String charset = "utf-8";
if (-1 != contentType.indexOf("charset=")) {
charset = contentType
.substring(contentType.indexOf("charset=") + 8);
} else {
if (curi.getURI().contains("51job.com")) {
charset = "gb2312";
} else {
charset = "utf-8";
}
}
batchPut.add(Bytes.toBytes(getHbaseOptions().getContentColumnFamily()),
Bytes.toBytes(getHbaseOptions().getCharsetColumnName()),
Bytes.toBytes(charset));
解析模块读取时,用自定义的makeString (byte[],String)方法,使用对应的字符集将字节转换为字符串:
String charset=Bytes.toString(result.getValue(Bytes.toBytes("content"), Bytes.toBytes("charset") ));
String rawData = JobsParser.makeString(result.getValue(Bytes.toBytes("content"),
Bytes.toBytes("raw-data")), charset);
public static String makeString(byte[] b, String charset) {
if (b == null) {
return null;
}
if (b.length == 0) {
return "";
}
try {
return new String(b, 0, b.length, charset);
} catch (UnsupportedEncodingException e) {
System.out.println("charset not supported?");
e.printStackTrace();
return null;
}
}
4、详细:
(1)、mapreduce任务:
package com.jobsearcher.parser.mapreduce;
import java.io.ByteArrayOutputStream;
import java.io.DataOutputStream;
import java.io.IOException;
import java.io.UnsupportedEncodingException;
import java.util.Map;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableInputFormat;
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
import org.apache.hadoop.hbase.mapreduce.TableMapper;
import org.apache.hadoop.hbase.mapreduce.TableReducer;
import org.apache.hadoop.hbase.util.Base64;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper.Context;
import org.apache.hadoop.mapreduce.lib.output.NullOutputFormat;
import com.jobsearcher.parser.util.PageParser;
public class JobsParser {
/** Name of this 'program'. */
static final String NAME = "jobsparser";
static final String FROM_TABLENAME = "rawjobs";
static final String TO_TABLENAME = "parsedjobs";
/**
* Mapper.
*/
static class JobsParserMapper extends
TableMapper<ImmutableBytesWritable, Result> {
static int i = 0;
protected void map(ImmutableBytesWritable key, Result value,
Context context) throws IOException, InterruptedException {
context.write(key, value);
}
}
/*
* Reducer.
*/
static class JobsParserReducer
extends
TableReducer<ImmutableBytesWritable, Result, ImmutableBytesWritable> {
@Override
protected void reduce(ImmutableBytesWritable row,
Iterable<Result> results, Context context) throws IOException,
InterruptedException {
for (Result result : results) {
Put put = new Put(result.getRow());
String url = Bytes.toString(result.getValue(
Bytes.toBytes("curi"), Bytes.toBytes("url")));
String charset = Bytes.toString(result.getValue(
Bytes.toBytes("content"), Bytes.toBytes("charset")));
String rawData = JobsParser.makeString(
result.getValue(Bytes.toBytes("content"),
Bytes.toBytes("raw-data")), charset);
if (null != url && null != rawData) {
Map<String, String> job = PageParser.parse(url, rawData);
if (null != job) {
byte[] jobFamily = Bytes.toBytes("job");
for (Map.Entry<String, String> kv : job.entrySet()) {
put.add(jobFamily, Bytes.toBytes(kv.getKey()),
Bytes.toBytes(kv.getValue()));
}
long timestamp = (result
.getColumn(Bytes.toBytes("curi"),
Bytes.toBytes("url")).get(0)
.getTimestamp() + Long.parseLong(job
.get("job_time"))) / 2;
put.add(jobFamily, Bytes.toBytes("timestamp"),
Bytes.toBytes(timestamp));
context.write(row, put);
}
}
}
}
}
public static Job createSubmittableJob(Configuration conf)
throws IOException {
Job job = new Job(conf, NAME + "_" + FROM_TABLENAME + "_"
+ TO_TABLENAME);
job.setJarByClass(JobsParser.class);
Scan scan = new Scan();
TableMapReduceUtil.initTableMapperJob(FROM_TABLENAME, scan,
JobsParserMapper.class, ImmutableBytesWritable.class,
Result.class, job);
TableMapReduceUtil.addDependencyJars(job);
TableMapReduceUtil.initTableReducerJob(TO_TABLENAME,
JobsParserReducer.class, job);
return job;
}
public static void main(String[] args) throws Exception {
Configuration conf = HBaseConfiguration.create();
Job job = createSubmittableJob(conf);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
public static String makeString(byte[] b, String charset) {
if (b == null) {
return null;
}
if (b.length == 0) {
return "";
}
try {
return new String(b, 0, b.length, charset);
} catch (UnsupportedEncodingException e) {
System.out.println("charset not supported?");
e.printStackTrace();
return null;
}
}
}
(2)、parser类:
package com.jobsearcher.parser.util;
import java.util.Map;
public class PageParser {
public static Map<String, String> parse(String url, String page) {
if (url.contains("51job.com")) {
return PageParser51job.parse(url,page);
} else if (url.contains("zhaopin.com")) {
return PageParserZhaopin.parse(url,page);
} else {
System.out.println("unexpected url : " + url);
return null;
}
}
}
前程无忧页面解析类:PageParser51job
package com.jobsearcher.parser.util;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.HashMap;
import java.util.Map;
import org.htmlparser.Node;
import org.htmlparser.NodeFilter;
import org.htmlparser.Parser;
import org.htmlparser.filters.AndFilter;
import org.htmlparser.filters.CssSelectorNodeFilter;
import org.htmlparser.filters.HasChildFilter;
import org.htmlparser.filters.StringFilter;
import org.htmlparser.filters.TagNameFilter;
import org.htmlparser.util.NodeList;
import org.htmlparser.visitors.TextExtractingVisitor;
public class PageParser51job {
public static Map<String, String> parse(String url, String page) {
Map<String, String> job = new HashMap<String, String>();
if (!url.matches("http://search.51job.com/job/.*html?")) {
System.out.println("abandoned : " + url);
return null;
}
Parser parser = Parser.createParser(page, "UTF-8");
NodeFilter jobTitleFilter = new CssSelectorNodeFilter(
"div.s_txt_jobs table.jobs_1 td.sr_bt");
NodeFilter companyNameFilter = new CssSelectorNodeFilter(
"div.s_txt_jobs table.jobs_1 table td");
NodeFilter companyPropsFilter = new AndFilter(
new CssSelectorNodeFilter("div.s_txt_jobs table.jobs_1 td"),
new HasChildFilter(new StringFilter("公司行业")));
NodeFilter jobPropsNameFilter = new CssSelectorNodeFilter(
"div.s_txt_jobs div.jobs_com div.grayline div table.jobs_1 td.txt_1");
NodeFilter jobPropsValueFilter = new CssSelectorNodeFilter(
"div.s_txt_jobs div.jobs_com div.grayline div table.jobs_1 td.txt_2");
NodeFilter jobDescriptionFilter = new CssSelectorNodeFilter(
"div.s_txt_jobs div.jobs_com div.grayline div table.jobs_1 td div");
NodeFilter jobCategoryFilter = new AndFilter(new CssSelectorNodeFilter(
"div.s_txt_jobs div.jobs_com div.grayline div table.jobs_1 td"),new HasChildFilter(new StringFilter("职位职能")));
NodeFilter companyDescriptionFilter = new AndFilter(
new CssSelectorNodeFilter(
"div.s_txt_jobs div.jobs_com div.grayline div.jobs_txt"),
new HasChildFilter(new TagNameFilter("p")));
try {
NodeList nodeList = parser.parse(jobTitleFilter);
Node job_title_node = nodeList.elementAt(0);
job.put("job_title", job_title_node.toPlainTextString().trim());
parser.reset();
nodeList = parser.parse(companyNameFilter);
Node company_name_node = nodeList.elementAt(0);
String rawName = company_name_node.toPlainTextString();
String cname = rawName.substring(
0,
(-1 == rawName.indexOf("&")) ? rawName.length() : rawName
.indexOf("&"));
job.put("company_name", cname.trim());
parser.reset();
nodeList = parser.parse(companyPropsFilter);
Node company_props_node = nodeList.elementAt(0);
Parser company_props_parser = new Parser(
company_props_node.toHtml());
TextExtractingVisitor visitor = new TextExtractingVisitor();
company_props_parser.visitAllNodesWith(visitor);
String company_props = visitor.getExtractedText();
int s_industry = company_props.indexOf("公司行业:");
int s_type = company_props.indexOf("公司性质:");
int s_scale = company_props.indexOf("公司规模:");
int len = company_props.length();
if (-1 != s_industry) {
job.put("company_industry",
company_props.substring(s_industry + 5,
-1 != s_type ? s_type : (-1 != s_scale? s_scale: len)).trim());
}
if (-1 != s_type) {
job.put("company_type",
adaptCType(company_props.substring(
s_type + 5,
-1 != s_scale ? s_scale : len).trim()));
}
if (-1 != s_scale) {
job.put("company_scale",
company_props.substring(
s_scale + 5).trim());
}
parser.reset();
NodeList jobPropsNameNodeList = parser.parse(jobPropsNameFilter);
parser.reset();
NodeList jobPropsValueNodeList = parser.parse(jobPropsValueFilter);
Node jobPropsNameNode;
Node jobPropsValueNode;
for (int i = 0; i < jobPropsNameNodeList.size(); i++) {
jobPropsNameNode = jobPropsNameNodeList.elementAt(i);
jobPropsValueNode = jobPropsValueNodeList.elementAt(i);
String name = jobPropsNameNode.toPlainTextString().trim();
if (name.contains("发布日期")) {
job.put("job_time", adaptDate(jobPropsValueNode.toPlainTextString()
.trim()) + "");
}
if (name.contains("工作地点")) {
job.put("job_address", jobPropsValueNode
.toPlainTextString().trim());
}
if (name.contains("招聘人数")) {
job.put("job_count", jobPropsValueNode.toPlainTextString()
.trim());
}
if (name.contains("工作年限")) {
job.put("job_experience", jobPropsValueNode
.toPlainTextString().trim());
}
if (name.contains("学") && name.contains("历")) {
job.put("job_education", jobPropsValueNode
.toPlainTextString().trim());
}
if (name.contains("语言要求")) {
job.put("job_language", jobPropsValueNode
.toPlainTextString().trim());
}
if (name.contains("薪水范围")) {
job.put("job_salary", jobPropsValueNode.toPlainTextString()
.trim());
}
}
parser.reset();
nodeList = parser.parse(jobDescriptionFilter);
Node job_desc_node = nodeList.elementAt(0);
job.put("job_description", job_desc_node.getChildren().toHtml()
.trim());
parser.reset();
nodeList = parser.parse(companyDescriptionFilter);
Node company_desc_node = nodeList.elementAt(0);
String rawComDesc = company_desc_node.getChildren().toHtml();
job.put("company_description",
rawComDesc.replaceAll("<a.*>.*</a>", "").trim());
parser.reset();
nodeList = parser.parse(jobCategoryFilter);
Node job_category_node = nodeList.elementAt(0);
if(null != job_category_node){
String rawJobc = job_category_node.toPlainTextString().replaceAll(" ", " ");
job.put("job_category",
rawJobc.substring(rawJobc.indexOf("职位职能:") + 5).trim());
}
} catch (Exception e) {
System.out.println("abandoned : " + url);
e.printStackTrace();
return null;
}
job.put("from", "前程无忧");
job.put("url", url);
System.out.println("parsed : " + url);
return job;
}
private static SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd");
private static long adaptDate(String date) throws ParseException{
return df.parse(date).getTime();
}
private static String adaptCType(String ctype){
if(ctype.contains("合资")){
return "合资";
}
if(ctype.contains("民营")){
return "民营";
}
if(ctype.contains("国企")){
return "国企";
}
if(ctype.contains("外资")){
return "外商独资";
}
if(ctype.contains("代表处")){
return "外企代表处";
}
if(ctype.contains("机关")){
return "国家机关";
}
if(ctype.contains("事业单位")){
return "事业单位";
}
else{
return "其他";
}
}
}
智联招聘页面解析类PageParserZhaopin:
package com.jobsearcher.parser.util;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.HashMap;
import java.util.Map;
import org.htmlparser.Node;
import org.htmlparser.NodeFilter;
import org.htmlparser.Parser;
import org.htmlparser.filters.AndFilter;
import org.htmlparser.filters.CssSelectorNodeFilter;
import org.htmlparser.filters.HasChildFilter;
import org.htmlparser.filters.TagNameFilter;
import org.htmlparser.util.NodeList;
import org.htmlparser.visitors.TextExtractingVisitor;
public class PageParserZhaopin {
public static Map<String, String> parse(String url, String page) {
Map<String, String> job = new HashMap<String, String>();
if (!url.matches("http://jobs.zhaopin.com/.*html?")) {
System.out.println("not a valid url,abandoned : " + url);
return null;
}
Parser parser = Parser.createParser(page, "UTF-8");
NodeFilter jobTitleFilter = new CssSelectorNodeFilter(
"#positionTitle h1");
NodeFilter companyPropsFilter = new CssSelectorNodeFilter(
"#zpcontent > table.companyInfoTab");
NodeFilter jobPropsFilter = new CssSelectorNodeFilter(
"#zpcontent table.jobInfoTab table.jobInfoItems");
NodeFilter jobDesFilter = new CssSelectorNodeFilter(
"#zpcontent table.jobInfoTab div.jobDes div div");
NodeFilter companyDesFilter = new AndFilter(new CssSelectorNodeFilter(
"#zpcontent table.black12 td"), new HasChildFilter(
new AndFilter(new TagNameFilter("p"), new HasChildFilter(
new TagNameFilter("br")))));
try {
NodeList nodeList = parser.parse(jobTitleFilter);
Node job_title_node = nodeList.elementAt(0);
job.put("job_title", job_title_node.toPlainTextString().trim());
// company props
parser.reset();
nodeList = parser.parse(companyPropsFilter);
Node cp_node = nodeList.elementAt(0);
Parser cp_parser = new Parser(cp_node.toHtml());
TextExtractingVisitor visitor = new TextExtractingVisitor();
cp_parser.visitAllNodesWith(visitor);
String cp = visitor.getExtractedText();
int s_industry = cp.indexOf("公司行业:");
int s_type = cp.indexOf("公司类型:");
int s_scale = cp.indexOf("公司规模:");
int len = cp.length();
job.put("company_name", cp.substring(0, s_industry).trim());
if (-1 != s_industry) {
job.put("company_industry",
cp.substring(
s_industry + 5,
-1 != s_type ? s_type
: (-1 != s_scale ? s_scale : len))
.trim());
}
if (-1 != s_type) {
job.put("company_type",
cp.substring(s_type + 5, -1 != s_scale ? s_scale : len)
.trim());
}
if (-1 != s_scale) {
job.put("company_scale", cp.substring(s_scale + 5).trim());
}
// job props
parser.reset();
nodeList = parser.parse(jobPropsFilter);
Node jp_node = nodeList.elementAt(0);
Parser jp_parser = new Parser(jp_node.toHtml());
TextExtractingVisitor visitor2 = new TextExtractingVisitor();
jp_parser.visitAllNodesWith(visitor2);
String jp = visitor2.getExtractedText();
int sj[] = new int[10];
int sj_category = sj[0] = jp.indexOf("职位类别");
int sj_addr = sj[1] = jp.indexOf("工作地点");
int sj_time = sj[2] = jp.indexOf("发布日期");
int sj_experience = sj[3] = jp.indexOf("工作经验");
int sj_education = sj[4] = jp.indexOf("最低学历");
int sj_manage = sj[5] = jp.indexOf("管理经验");
int sj_type = sj[6] = jp.indexOf("工作性质");
int sj_count = sj[7] = jp.indexOf("招聘人数");
int sj_salary = sj[8] = jp.indexOf("职位月薪");
int jlen = sj[9] = jp.length();
if (-1 != sj_category) {
job.put("job_category",
jp.substring(sj_category + 5,
sjNext(sj,1)).trim());
}
if (-1 != sj_addr) {
job.put("job_address",
jp.substring(sj_addr + 5,
sjNext(sj,2)).trim()
.split(" ")[0]);
}
if (-1 != sj_time) {
job.put("job_time",
adaptDate(jp.substring(sj_time + 5,
sjNext(sj,3))
.trim()) + "");
}
if (-1 != sj_experience) {
job.put("job_experience",
adaptExp(jp.substring(sj_experience + 5,
sjNext(sj,4))
.trim()));
}
if (-1 != sj_education) {
job.put("job_education",
adaptEdu(jp.substring(sj_education + 5,
sjNext(sj,5)).trim()));
}
if (-1 != sj_manage) {
job.put("job_manage",
jp.substring(sj_manage + 5,
sjNext(sj,6)).trim());
}
if (-1 != sj_type) {
job.put("job_type",
jp.substring(sj_type + 5,
sjNext(sj,7)).trim());
}
if (-1 != sj_count) {
job.put("job_count",
jp.substring(sj_count + 5,
sjNext(sj,8)).trim());
}
if (-1 != sj_salary) {
job.put("job_salary", jp.substring(sj_salary + 5).trim());
}
// job description
parser.reset();
nodeList = parser.parse(jobDesFilter);
String jobDesc = "";
NodeList nl;
for (int i = 0; i < nodeList.size(); i++) {
nl = nodeList.elementAt(i).getChildren();
if (null != nl) {
jobDesc += nl.toHtml();
}
}
job.put("job_description", jobDesc.trim());
// company desc
parser.reset();
nodeList = parser.parse(companyDesFilter);
job.put("company_description", nodeList.elementAt(0).getChildren()
.toHtml().trim());
} catch (Exception e) {
System.out.println("abandoned : " + url);
e.printStackTrace();
return null;
}
job.put("from", "智联招聘");
job.put("url", url);
System.out.println("parsed : " + url);
return job;
}
private static int sjNext(int[] a, int index) {
for(int i=index; i < a.length; i++ ){
if(-1 != a[i])
return a[i];
}
return -1;
}
private static String adaptExp(String exp){
if(exp.contains("1")){
return "一年以上";
}
else if(exp.contains("2")){
return "二年以上";
}
else if(exp.contains("3")){
return "三年以上";
}
else if(exp.contains("4")){
return "四年以上";
}
else if(exp.contains("5")){
return "五年以上";
}
else if(exp.contains("6")){
return "六年以上";
}
else if(exp.contains("7")){
return "七年以上";
}
else if(exp.contains("8")){
return "八年以上";
}
else if(exp.contains("9")){
return "九年以上";
}
else if(exp.contains("10")){
return "十年以上";
}
else{
return "不限";
}
}
private static String adaptEdu(String edu){
if(edu.contains("初中")){
return "初中";
}
if(edu.contains("高中")){
return "高中";
}
if(edu.contains("中专")){
return "中专";
}
if(edu.contains("中技")){
return "中技";
}
if(edu.contains("大专")){
return "大专";
}
if(edu.contains("本科")){
return "本科";
}
if(edu.contains("硕士")){
return "硕士";
}
if(edu.contains("博士")){
return "博士";
}
else {
return "其他";
}
}
private static SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd");
private static long adaptDate(String date) throws ParseException{
return df.parse(date).getTime();
}
}
三、索引模块:
1、技术:lucene3.3 + hbase0.9
2、概述:读取hbase中parsedjobs表中的招聘信息,建立索引。使用了庖丁解牛进行中文分词。
3、详细:
package com.jobsearcher.indexer;
import java.io.File;
import java.io.IOException;
import net.paoding.analysis.analyzer.PaodingAnalyzer;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.NumericField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
public class JobsIndexer {
private static final String INDEX_DIR = "/home/qingjie/projects/jobsearcher/index";
public static void main(String... args) throws IOException {
System.out.println("indexing");
int count = 0;
Analyzer analyzer = new PaodingAnalyzer();
IndexWriterConfig conf = new IndexWriterConfig(Version.LUCENE_33,
analyzer);
Directory dir = FSDirectory.open(new File(INDEX_DIR));
IndexWriter writer = new IndexWriter(dir, conf);
HTable t = new HTable(HBaseConfiguration.create(),
Bytes.toBytes("parsedjobs"));
Scan scan = new Scan();
for (Result r : t.getScanner(scan)) {
String row = Bytes.toString(r.getRow());
String from = Bytes.toString(r.getValue(Bytes.toBytes("job"),
Bytes.toBytes("from")));
String url = Bytes.toString(r.getValue(Bytes.toBytes("job"),
Bytes.toBytes("url")));
String job_title = Bytes.toString(r.getValue(Bytes.toBytes("job"),
Bytes.toBytes("job_title")));
String company_name = Bytes.toString(r.getValue(
Bytes.toBytes("job"), Bytes.toBytes("company_name")));
String company_industry = Bytes.toString(r.getValue(
Bytes.toBytes("job"), Bytes.toBytes("company_industry")));
String company_type = Bytes.toString(r.getValue(
Bytes.toBytes("job"), Bytes.toBytes("company_type")));
String job_time = Bytes.toString(r.getValue(Bytes.toBytes("job"),
Bytes.toBytes("job_time")));
String job_address = Bytes.toString(r.getValue(
Bytes.toBytes("job"), Bytes.toBytes("job_address")));
String job_experience = Bytes.toString(r.getValue(
Bytes.toBytes("job"), Bytes.toBytes("job_experience")));
String job_education = Bytes.toString(r.getValue(
Bytes.toBytes("job"), Bytes.toBytes("job_education")));
long timestamp = Bytes.toLong(r.getValue(Bytes.toBytes("job"),
Bytes.toBytes("timestamp")));
String job_description = Bytes.toString(r.getValue(
Bytes.toBytes("job"), Bytes.toBytes("job_description")));
// index
Document doc = new Document();
doc.add(new Field("row", row, Field.Store.YES,
Field.Index.NOT_ANALYZED));
doc.add(new Field("from", from, Field.Store.YES,
Field.Index.NOT_ANALYZED));
doc.add(new Field("job_title", job_title, Field.Store.YES,
Field.Index.ANALYZED,
Field.TermVector.WITH_POSITIONS_OFFSETS));
doc.add(new Field("company_name", company_name, Field.Store.YES,
Field.Index.ANALYZED,
Field.TermVector.WITH_POSITIONS_OFFSETS));
doc.add(new Field("url", url, Field.Store.YES,
Field.Index.NO));
if (null != company_industry) {
doc.add(new Field("company_industry", company_industry,
Field.Store.YES, Field.Index.NOT_ANALYZED));
}
if (null != company_type) {
doc.add(new Field("company_type", company_type,
Field.Store.YES, Field.Index.NOT_ANALYZED));
}
doc.add(new Field("job_time",job_time,Field.Store.YES,Field.Index.NOT_ANALYZED));
doc.add(new NumericField("timestamp",Field.Store.YES,true).setLongValue(timestamp));
if (null != job_address) {
doc.add(new Field("job_address", job_address, Field.Store.YES,
Field.Index.NOT_ANALYZED));
}
if (null != job_experience) {
doc.add(new Field("job_experience", job_experience,
Field.Store.YES, Field.Index.NOT_ANALYZED));
}
if (null != job_education) {
doc.add(new Field("job_education", job_education,
Field.Store.YES, Field.Index.NOT_ANALYZED));
}
if (null != job_description) {
doc.add(new Field("job_description", job_description,
Field.Store.YES, Field.Index.ANALYZED));
}
writer.updateDocument(new Term("row", row), doc);
}
writer.close();
System.out.println("index finished!");
}
}
四、搜索模块:
1、技术:struts2.2 + lucene3.3 + bobo-browse2.5 + freemarker2.3 + hbase0.9 + jquery
2、概述:提供web界面和各种搜索功能。web界面用struts2、freemarker、jquery等实现;搜索功能用lucene实现,使用了Bobo-Browse实现分组搜索(Facet Search)。
3、详细:
(1)、struts2 action基类BaseAction:
package com.jobsearcher.searcher.action.base;
import com.jobsearcher.exception.JobsearcherException;
import com.jobsearcher.searcher.service.SearcherService;
import com.opensymphony.xwork2.ActionSupport;
public class BaseAction extends ActionSupport{
private static final long serialVersionUID = 1L;
protected SearcherService searcherService = SearcherService.getInstance();
public BaseAction() throws JobsearcherException{
}
}
(2)、搜索action:
package com.jobsearcher.searcher.action;
import java.text.NumberFormat;
import java.util.ArrayList;
import java.util.Calendar;
import java.util.Date;
import java.util.List;
import java.util.Map;
import com.browseengine.bobo.api.BrowseResult;
import com.browseengine.bobo.api.FacetAccessible;
import com.jobsearcher.exception.JobsearcherException;
import com.jobsearcher.searcher.action.base.BaseAction;
public class SearchAction extends BaseAction {
public SearchAction() throws JobsearcherException {
super();
}
/**
*
*/
private static final long serialVersionUID = 1L;
private int startPage = 1;
private int pageSize = 10;
private String keyword = "";
private String job_time = ""; // 0 1 3 7 14 30 60
private String city = "";
private String from = ""; // 智联招聘 前程无忧
private String company_name = "";
private String company_industry = "";
private String company_type = "";
private String job_experience = "";
private String job_education = "";
private Map<String, FacetAccessible> facetMap;
private int totalDocs;
private String timeCost = "";
private List<Map<String, String>> jobs = new ArrayList<Map<String, String>>();
private long now;
private long today;
private long l3day;
private long l1week;
private long l2week;
private long l1month;
private long l2month;
private String job_time_name = "";
public String execute() throws JobsearcherException {
long begin = System.nanoTime();
BrowseResult result = searcherService.search(makeQuery(), startPage,
pageSize, jobs);
facetMap = result.getFacetMap();
totalDocs = result.getNumHits();
long end = System.nanoTime();
NumberFormat format = NumberFormat.getInstance();
format.setMaximumFractionDigits(3);
timeCost = format.format((end * 1.0 - begin) / 1000000000.0);
// job_time
Calendar n = Calendar.getInstance();
n.setTime(new Date());
now = n.getTimeInMillis();
n.set(Calendar.HOUR_OF_DAY, 0);
today = n.getTimeInMillis();
l3day = today - 3l * 24 * 3600 * 1000;
l1week = today - 7l * 24 * 3600 * 1000;
l2week = today - 14l * 24 * 3600 * 1000;
l1month = today - 30l * 24 * 3600 * 1000;
l2month = today - 60l * 24 * 3600 * 1000;
return SUCCESS;
}
/*
* 0:tokenized
* 1:not tokenized
*/
private String[] makeQuery() {
String[] query = {"",""};
//tokenized
if (null != keyword && !keyword.equals("")) {
query[0] = query[0] + "+job_title:(" + keyword + ")";
}
if (null != job_time && !job_time.equals("")) {
query[0] = query[0] + "+job_time:(" + job_time + ")";
}
//not tokenized
if (null != city && !city.equals("")) {
query[1] = query[1] + "+job_address:(\"" + city + "\")";
}
if (null != from && !from.equals("")) {
query[1] = query[1] + "+from:(\"" + from + "\")";
}
if (null != company_name && !company_name.equals("")) {
query[1] = query[1] + "+company_name:(\"" + company_name + "\")";
}
if (null != company_industry && !company_industry.equals("")) {
query[1] = query[1] + "+company_industry:(\"" + company_industry + "\")";
}
if (null != company_type && !company_type.equals("")) {
query[1] = query[1] + "+company_type:(\"" + company_type + "\")";
}
if (null != job_experience && !job_experience.equals("")) {
query[1] = query[1] + "+job_experience:(\"" + job_experience + "\")";
}
if (null != job_education && !job_education.equals("")) {
query[1] = query[1] + "+job_education:(\"" + job_education + "\")";
}
return query;
}
public String getKeyword() {
return keyword;
}
public void setKeyword(String keyword) {
this.keyword = keyword;
}
//省略。。。
}
(3)、查看招聘详细信息action:
package com.jobsearcher.searcher.action;
import java.util.HashMap;
import java.util.Map;
import com.jobsearcher.exception.JobsearcherException;
import com.jobsearcher.searcher.action.base.BaseAction;
public class ViewAction extends BaseAction{
public ViewAction() throws JobsearcherException {
super();
}
private static final long serialVersionUID = 1L;
private String row;
Map<String,String> job = new HashMap<String,String>();
public String execute() throws JobsearcherException {
job = searcherService.getJobByRow(row);
return SUCCESS;
}
public String getRow() {
return row;
}
public void setRow(String row) {
this.row = row;
}
public Map<String, String> getJob() {
return job;
}
public void setJob(Map<String, String> job) {
this.job = job;
}
}
(4)、service类,采用单例模式:
package com.jobsearcher.searcher.service;
import java.io.File;
import java.io.IOException;
import java.net.URLEncoder;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Calendar;
import java.util.Comparator;
import java.util.Date;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import net.paoding.analysis.analyzer.PaodingAnalyzer;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Get;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.KeywordAnalyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.BooleanClause.Occur;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.NumericRangeQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.Sort;
import org.apache.lucene.search.SortField;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.highlight.Highlighter;
import org.apache.lucene.search.highlight.InvalidTokenOffsetsException;
import org.apache.lucene.search.highlight.QueryScorer;
import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
import org.apache.lucene.search.highlight.SimpleSpanFragmenter;
import org.apache.lucene.search.highlight.TokenSources;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
import com.browseengine.bobo.api.BoboBrowser;
import com.browseengine.bobo.api.BoboIndexReader;
import com.browseengine.bobo.api.Browsable;
import com.browseengine.bobo.api.BrowseException;
import com.browseengine.bobo.api.BrowseFacet;
import com.browseengine.bobo.api.BrowseHit;
import com.browseengine.bobo.api.BrowseRequest;
import com.browseengine.bobo.api.BrowseResult;
import com.browseengine.bobo.api.ComparatorFactory;
import com.browseengine.bobo.api.FacetSpec;
import com.browseengine.bobo.api.FacetSpec.FacetSortSpec;
import com.browseengine.bobo.api.FieldValueAccessor;
import com.browseengine.bobo.facets.FacetHandler;
import com.browseengine.bobo.facets.impl.RangeFacetHandler;
import com.browseengine.bobo.facets.impl.SimpleFacetHandler;
import com.jobsearcher.exception.JobsearcherException;
public class SearcherService {
private static final String INDEX_DIR = "/home/qingjie/projects/jobsearcher/index";
private static Directory dir = null;
private static String PARSED_JOBS_HBASE = "parsedjobs";
private static SearcherService searcherService = null;
private static IndexSearcher searcher = null;
private static IndexReader reader = null;
private static Analyzer analyzer = new PaodingAnalyzer();
private static final Map<String, Integer> sort = new HashMap<String, Integer>();
static {
// job_experience
sort.put("在读学生", 100);
sort.put("应届毕业生", 101);
sort.put("一年以上", 102);
sort.put("二年以上", 103);
sort.put("三年以上", 104);
sort.put("四年以上", 105);
sort.put("五年以上", 106);
sort.put("六年以上", 107);
sort.put("七年以上", 108);
sort.put("八年以上", 109);
sort.put("九年以上", 110);
sort.put("十年以上", 111);
sort.put("不限", 112);
// job_education
sort.put("初中", 30);
sort.put("高中", 31);
sort.put("中技", 32);
sort.put("中专", 33);
sort.put("大专", 34);
sort.put("本科", 35);
sort.put("硕士", 36);
sort.put("博士", 37);
sort.put("其他", 38);
}
private SearcherService() throws IOException {
dir = FSDirectory.open(new File(INDEX_DIR));
}
public static synchronized SearcherService getInstance()
throws JobsearcherException {
if (null == searcherService) {
try {
searcherService = new SearcherService();
} catch (IOException e) {
e.printStackTrace();
throw new JobsearcherException("创建索引目录过程出错!");
}
}
return searcherService;
}
public synchronized IndexSearcher getIndexSearcher()
throws JobsearcherException {
if (null == searcher) {
try {
searcher = new IndexSearcher(dir);
} catch (CorruptIndexException e) {
e.printStackTrace();
throw new JobsearcherException("索引目录已经损坏!");
} catch (IOException e) {
e.printStackTrace();
throw new JobsearcherException("打开索引目录过程出错!");
}
}
return searcher;
}
public synchronized IndexReader getIndexReader()
throws JobsearcherException {
if (null == reader) {
try {
reader = IndexReader.open(dir, true);
} catch (CorruptIndexException e) {
e.printStackTrace();
throw new JobsearcherException("索引目录已经损坏!");
} catch (IOException e) {
e.printStackTrace();
throw new JobsearcherException("打开索引目录过程出错!");
}
}
return reader;
}
public List<Map<String, String>> getNewJobs(int num)
throws JobsearcherException {
List<Map<String, String>> jobs = new ArrayList<Map<String, String>>();
Query q = NumericRangeQuery.newLongRange("timestamp", 0l,
new Date().getTime(), true, true);
IndexSearcher s = getIndexSearcher();
TopDocs hits;
try {
hits = s.search(q, num, new Sort(new SortField("timestamp",
SortField.LONG)));
for (ScoreDoc scoreDoc : hits.scoreDocs) {
Document doc = s.doc(scoreDoc.doc);
Map<String, String> job = new HashMap<String, String>();
job.put("job_title", doc.get("job_title"));
job.put("row", URLEncoder.encode(doc.get("row"), "utf-8"));
jobs.add(job);
}
} catch (Exception e) {
e.printStackTrace();
throw new JobsearcherException("搜索最新工作过程出错!");
}
return jobs;
}
public BrowseResult search(String[] query, int startPage, int pageSize,
List<Map<String, String>> jobs) throws JobsearcherException {
SimpleFacetHandler companyIndustryHandler = new SimpleFacetHandler(
"company_industry");
SimpleFacetHandler companyTypeHandler = new SimpleFacetHandler(
"company_type");
SimpleFacetHandler jobExperienceHandler = new SimpleFacetHandler(
"job_experience");
SimpleFacetHandler jobEducationHandler = new SimpleFacetHandler(
"job_education");
Calendar n = Calendar.getInstance();
n.setTime(new Date());
long now = n.getTimeInMillis();
n.set(Calendar.HOUR_OF_DAY, 0);
long today = n.getTimeInMillis();
long l3day = today - 3l * 24 * 3600 * 1000;
long l1week = today - 7l * 24 * 3600 * 1000;
long l2week = today - 14l * 24 * 3600 * 1000;
long l1month = today - 30l * 24 * 3600 * 1000;
long l2month = today - 60l * 24 * 3600 * 1000;
RangeFacetHandler jobTimeHandler = new RangeFacetHandler("job_time",
"job_time", Arrays.asList(timeRange(today, now),
timeRange(l3day, now), timeRange(l1week, now),
timeRange(l2week, now), timeRange(l1month, now),
timeRange(l2month, now)));
List<FacetHandler<?>> handlerList = Arrays
.asList(new FacetHandler<?>[] { companyIndustryHandler,
companyTypeHandler, jobTimeHandler,
jobExperienceHandler, jobEducationHandler });
try {
BoboIndexReader boboReader = BoboIndexReader.getInstance(
getIndexReader(), handlerList);
BrowseRequest br = new BrowseRequest();
br.setCount(pageSize);
br.setOffset((startPage - 1) * pageSize);
QueryParser parser0 = new QueryParser(Version.LUCENE_34,
"job_title", analyzer);
QueryParser parser1 = new QueryParser(Version.LUCENE_34,
"job_title", new KeywordAnalyzer());
BooleanQuery q = new BooleanQuery();
if(null != query[0] && !query[0].equals("")){
Query q0 = parser0.parse(query[0]);
q.add(q0, Occur.MUST);
}
if(null != query[1] && !query[1].equals("")){
Query q1 = parser1.parse(query[1]);
q.add(q1, Occur.MUST);
}
br.setQuery(q);
FacetSpec generalSpec = new FacetSpec();
generalSpec.setOrderBy(FacetSortSpec.OrderHitsDesc);
generalSpec.setMaxCount(10);
FacetSpec valueOrderSpec = new FacetSpec();
valueOrderSpec.setMaxCount(10);
valueOrderSpec.setOrderBy(FacetSortSpec.OrderByCustom);
valueOrderSpec.setCustomComparatorFactory(new ComparatorFactory() {
@Override
public Comparator<Integer> newComparator(
FieldValueAccessor fieldValueAccessor, int[] counts) {
return new Comparator<Integer>() {
public int compare(Integer o1, Integer o2) {
return o2 - o1;
}
};
}
@Override
public Comparator<BrowseFacet> newComparator() {
return new Comparator<BrowseFacet>() {
public int compare(BrowseFacet o1, BrowseFacet o2) {
return 0 - o1.getValue().compareTo(o2.getValue());
}
};
}
});
FacetSpec customSortSpec = new FacetSpec();
customSortSpec.setMaxCount(10);
customSortSpec.setOrderBy(FacetSortSpec.OrderByCustom);
customSortSpec.setCustomComparatorFactory(new ComparatorFactory() {
@Override
public Comparator<Integer> newComparator(
FieldValueAccessor fieldValueAccessor, int[] counts) {
return new Comparator<Integer>() {
public int compare(Integer o1, Integer o2) {
return o2 - o1;
}
};
}
@Override
public Comparator<BrowseFacet> newComparator() {
return new Comparator<BrowseFacet>() {
public int compare(BrowseFacet o1, BrowseFacet o2) {
return sort.get(o1.getValue()).compareTo(
sort.get(o2.getValue()));
}
};
}
});
br.setFacetSpec("company_industry", generalSpec);
br.setFacetSpec("company_type", generalSpec);
br.setFacetSpec("job_time", valueOrderSpec);
br.setFacetSpec("job_experience", customSortSpec);
br.setFacetSpec("job_education", customSortSpec);
SortField timeSort = new SortField("job_time", SortField.LONG);
br.setSort(new SortField[] { timeSort });
Browsable browser = new BoboBrowser(boboReader);
BrowseResult result = browser.browse(br);
// highlight jobs
QueryScorer jobTitleScorer = new QueryScorer(q, "job_title");
QueryScorer jobDesScorer = new QueryScorer(q, "job_title");
SimpleHTMLFormatter formatter = new SimpleHTMLFormatter(
"<span class=\"highlight\">", "</span>");
Highlighter jobTitleHighlighter = new Highlighter(formatter,
jobTitleScorer);
Highlighter jobDesHighlighter = new Highlighter(formatter,
jobDesScorer);
jobTitleHighlighter.setTextFragmenter(new SimpleSpanFragmenter(
jobTitleScorer));
jobDesHighlighter.setTextFragmenter(new SimpleSpanFragmenter(
jobDesScorer));
SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd");
for (BrowseHit browseHit : result.getHits()) {
Map<String, String> job = new HashMap<String, String>();
Document doc = getIndexSearcher().doc(browseHit.getDocid());
job.put("job_address", doc.get("job_address"));
job.put("company_name", doc.get("company_name"));
job.put("from", doc.get("from"));
job.put("row", URLEncoder.encode(doc.get("row"), "utf-8"));
job.put("job_time", format.format(new Date(Long.parseLong(doc.get("job_time")))));
String job_title = doc.get("job_title");
TokenStream stream = TokenSources.getAnyTokenStream(
getIndexReader(), browseHit.getDocid(), "job_title",
analyzer);
String jobTitleFragment = jobTitleHighlighter.getBestFragment(
stream, job_title);
if (null != jobTitleFragment && !jobTitleFragment.equals("")) {
job.put("job_title", jobTitleFragment);
} else {
job.put("job_title", job_title);
}
String job_description = doc.get("job_description");
TokenStream stream2 = TokenSources.getAnyTokenStream(
getIndexReader(), browseHit.getDocid(),
"job_description", analyzer);
String jobDescriptionFragment = jobDesHighlighter
.getBestFragment(stream2, job_description);
String desc;
if (null != jobDescriptionFragment && !jobDescriptionFragment.equals("")) {
desc = jobDescriptionFragment;
} else {
desc = job_description;
}
if(desc.length() > 100){
desc = desc.substring(0,100) + "...";
}
job.put("job_description", desc);
jobs.add(job);
}
return result;
} catch (IOException e) {
e.printStackTrace();
throw new JobsearcherException("搜索过程中出错!");
} catch (org.apache.lucene.queryParser.ParseException e) {
e.printStackTrace();
throw new JobsearcherException("查询解析过程中出错!");
} catch (BrowseException e) {
e.printStackTrace();
throw new JobsearcherException("创建分类搜索结果过程中出错!");
} catch (InvalidTokenOffsetsException e) {
e.printStackTrace();
throw new JobsearcherException("获取搜索结果过程中出错!");
}
}
// utils
private String timeRange(long from, long to) {
return "[" + from + " TO " + to + "]";
}
public Map<String,String> getJobByRow(String row) throws JobsearcherException{
Map<String,String> job = new HashMap<String,String>();
try {
HTable t = new HTable(HBaseConfiguration.create(),
Bytes.toBytes(PARSED_JOBS_HBASE));
Get get = new Get(Bytes.toBytes(row));
Result r = t.get(get);
String url = Bytes.toString(r.getValue(Bytes.toBytes("job"),
Bytes.toBytes("url")));
String from = Bytes.toString(r.getValue(Bytes.toBytes("job"),
Bytes.toBytes("from")));
String job_title = Bytes.toString(r.getValue(Bytes.toBytes("job"),
Bytes.toBytes("job_title")));
String job_category = Bytes.toString(r.getValue(Bytes.toBytes("job"),
Bytes.toBytes("job_category")));
String company_name = Bytes.toString(r.getValue(Bytes.toBytes("job"),
Bytes.toBytes("company_name")));
String company_industry = Bytes.toString(r.getValue(Bytes.toBytes("job"),
Bytes.toBytes("company_industry")));
String company_type = Bytes.toString(r.getValue(Bytes.toBytes("job"),
Bytes.toBytes("company_type")));
String company_scale = Bytes.toString(r.getValue(Bytes.toBytes("job"),
Bytes.toBytes("company_scale")));
String job_time = Bytes.toString(r.getValue(Bytes.toBytes("job"),
Bytes.toBytes("job_time")));
String job_address = Bytes.toString(r.getValue(Bytes.toBytes("job"),
Bytes.toBytes("job_address")));
String job_count = Bytes.toString(r.getValue(Bytes.toBytes("job"),
Bytes.toBytes("job_count")));
String job_experience = Bytes.toString(r.getValue(Bytes.toBytes("job"),
Bytes.toBytes("job_experience")));
String job_education = Bytes.toString(r.getValue(Bytes.toBytes("job"),
Bytes.toBytes("job_education")));
String job_language = Bytes.toString(r.getValue(Bytes.toBytes("job"),
Bytes.toBytes("job_language")));
String job_salary = Bytes.toString(r.getValue(Bytes.toBytes("job"),
Bytes.toBytes("job_salary")));
String job_description = Bytes.toString(r.getValue(Bytes.toBytes("job"),
Bytes.toBytes("job_description")));
String company_description = Bytes.toString(r.getValue(Bytes.toBytes("job"),
Bytes.toBytes("company_description")));
job.put("url", url);
job.put("from", from);
job.put("job_title", job_title);
job.put("job_category", job_category);
job.put("company_name", company_name);
job.put("company_industry", company_industry);
job.put("company_type", company_type);
job.put("company_scale", company_scale);
SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd");
job.put("job_time", format.format(new Date(Long.parseLong(job_time))));
job.put("job_address", job_address);
job.put("job_count", job_count);
job.put("job_experience", job_experience);
job.put("job_education", job_education);
job.put("job_language", job_language);
job.put("job_salary", job_salary);
job.put("job_description", job_description);
job.put("company_description", company_description);
return job;
} catch (IOException e) {
e.printStackTrace();
throw new JobsearcherException("连接hbase过错出错!");
}
}
}
4、界面:
1、首页:
2、搜索结果页面:
3、工作详细信息页面: