系列文章目录
目录
前言
功能起源于在自建平台内要求实现类似钉钉组织架构选择器整个组件的能力, 其中钉钉原生组件支持拼音, 汉字, 英文的混合检索, 十分高效.
功能需求类似如下:
# user input
input: h鹏|hp|hup|hupeng|hupen|胡鹏|hpe ... 其他剩余的排列组合的输入
returns: 胡鹏
限于项目可用资源十分有限, 最终此功能设计落地为一个不使用其他第三方搜索组件, 默认基于服务内存的功能. 支持通过接口扩展数据来源和搜索算法.
功能基于倒排索引(inverted index)思想完成的简易中英文拼音混合的全文检索, 不涉及权重或分值的计算, 单纯利用用户数据完成全文检索.
实现: github: LocalCacheBasedSearchAlgorithm
[套路]系列意在提供一个功能基本实现套路的系列文档, 查就完了.
一、设计
1.1 Inverted Index
用过ES的人多少都会了解到, 底层搜索核心逻辑就是使用“倒排索引”实现的, 这里不深入倒排索引的原理, 此功能也只是在思想上简单地不能再简单地实现了倒排索引的基本原理.
1.1.1 基本原理
1.2 拼音混合查询场景落地
1.2.1 文档处理
在数据侧, 将汉字转换拼音后, 以每个字拼音全拼和首字母分别为索引构建对“文档”的关联, 形成倒排索引+ 列表的数据结构, 这一步类似ES保存文档并解析文档的粗略过程.
1.2.2 用户输入处理
在用户侧, 将用户输入内容中的汉字按输入顺序替换拼音(多音字), 英文字母按顺序查找拼音并尝试对后续字母进行优化, 最终将多音字情况生成一个目标笛卡尔积, 此结果集在文档中使用like
操作查询.
二、实现
- 创建索引和文档以及二者之间的关系
- 处理用户输入并查询
该能力的此次应用是实现类似钉钉组织架构选择器的功能, 下述代码示例都将使用钉钉的人员信息为数据源说明实现.
2.1 构建索引和文档
2.1.1 汉字转拼音
搜索准确性的高低完全基于拼音库解析汉字的准确性高低
本例使用pinyin4j
, 其中很少部分多音字有错误, 如果介意可以完全自定义实现提供第三方的拼音库,
hutool自带的pinyin解析不合适, 多音字也不够强大.
gradle
dependency("com.belerweb:pinyin4j:2.5.1")
maven
<dependency>
<groupId>com.belerweb</groupId>
<artifactId>pinyin4j</artifactId>
<version>2.5.1</version>
</dependency>
拼音处理接口设计
/**
* @author hp 2023/4/28
*/
public interface PinyinConverter {
List<PinyinModel> chineseToPinyin(String chinese);
boolean isChineseLetter(char c);
@Data
@AllArgsConstructor
class PinyinModel {
@FieldDesc("名称替换中文后的结果")
private String name;
@FieldDesc("替换字的拼音: LinkedHashSet")
private Set<String> pinyin;
@FieldDesc("替换字的拼音首字母")
private Set<String> shortPinyin;
}
}
默认的接口实现
- possibilities(): 扩展多音字, 默认没有实现成笛卡尔积的方式, 代码中有具体说明(可以继续迭代)
/**
* @author hp
*/
@Slf4j
public class DefaultPinyinConverter implements PinyinConverter {
@Override
public List<PinyinModel> chineseToPinyin(String chinese) {
try {
return pinyin(chinese);
} catch (BadHanyuPinyinOutputFormatCombination e) {
log.error("pinyin4j Exception", e);
throw new RuntimeException(e);
}
}
@MethodDesc("是否中文字符")
@Override
public boolean isChineseLetter(char c) {
Character.UnicodeBlock block = Character.UnicodeBlock.of(c);
return block == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS
|| block == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A
|| block == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_B
|| block == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_C
|| block == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_D
|| block == Character.UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS
|| block == Character.UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS_SUPPLEMENT;
}
@MethodDesc("pinyin4j只能处理纯中文; 解析名称转拼音后的绝大部分可能结果")
private List<PinyinModel> pinyin(String chinese) throws BadHanyuPinyinOutputFormatCombination {
List<PinyinModel> allPossibilities = Lists.newArrayList(new PinyinModel(chinese, Sets.newLinkedHashSet(), Sets.newLinkedHashSet()));
final HanyuPinyinOutputFormat format = new HanyuPinyinOutputFormat();
format.setToneType(HanyuPinyinToneType.WITHOUT_TONE);
for (char c : chinese.toCharArray()) {
Set<String> pinyinSet;
if (isChineseLetter(c)) {
final String[] pinyin = PinyinHelper.toHanyuPinyinStringArray(c, format);
pinyinSet = pinyin == null ? null : Sets.newLinkedHashSet(Arrays.asList(pinyin));
} else {
pinyinSet = Sets.newLinkedHashSet();
pinyinSet.add(String.valueOf(c));
}
allPossibilities = possibilities(String.valueOf(c), allPossibilities, pinyinSet);
}
return allPossibilities;
}
@MethodDesc("多音字的所有可能,但对于多音字连字问题没做处理,这种类库都依赖外部解析库,外部库存在多音字不准确的问题,暂时不处理")
private List<PinyinModel> possibilities(String letter, List<PinyinModel> list, Set<String> pinyin) {
return list.stream().map(i ->
{
if (CollUtil.isEmpty(pinyin)) {
return null;
}
return pinyin.stream()
.map(j -> {
//这里只是简单替换所有出现的字符, 没有扩展出全部组合
//例如 的 = (de/di);
//的的 = dedi/dede/dide/didi, 这里无脑换,只会换出 dede一种情况
i.setName(i.getName().replace(letter, j));
i.getPinyin().add(j);
i.getShortPinyin().add(String.valueOf(j.charAt(0)));
return i;
})
.collect(Collectors.toList());
}
)
.filter(Objects::nonNull)
.flatMap(Collection::stream)
.collect(Collectors.toList());
}
}
2.1.2 文档设计
文档模型
为了满足前端组件, 设计成一个树节点, 混合查询的关键属性如下(省略builder)
- nameToPinyin: 简单记录一个直接的拼音转换
- pinyin: 保存了多音字的笛卡尔积
- shortPinyin: 通过pinyin属性提取的拼音首字母
/**
* @author hp
*/
@Data
public class DingTalkPickerNode {
@Expose(deserialize = false, serialize = false)
private static final String RESIGNED_SUFFIX = "[已离职]";
@Expose(deserialize = false, serialize = false)
private static final Gson JSON;
static {
JSON = new GsonBuilder()
.disableHtmlEscaping()
.create();
}
@JsonSerialize(using = DingTalkPickerNodeSourceSerializer.class)
@FieldDesc("数据源")
private DingTalkPickerNodeSource source;
@FieldDesc("数据id")
private String id;
@FieldDesc("父级id")
private String parentId;
@FieldDesc("系统用户id")
private String systemUserId;
@FieldDesc("名称")
private String name;
@FieldDesc("头像")
private String avatar;
@FieldDesc("头像样式,最高级别,直接用在标签里")
private String avatarStyle;
@FieldDesc("是否展示")
private boolean show = true;
@JsonSerialize(using = DingTalkPickerNodeTypeSerializer.class)
@FieldDesc("节点类型")
private DingTalkPickerNodeType type;
@FieldDesc("职称等")
private String title;
@FieldDesc("是否选中")
private boolean selected = false;
@FieldDesc("是否可选")
private boolean selectable = true;
@FieldDesc("该节点的子节点是否可多选")
private boolean multiSelectable = true;
@FieldDesc("是否可以被检索到")
private boolean searchable = true;
@FieldDesc("用户节点是否已离职")
private boolean resigned = false;
@FieldDesc("子节点")
private List<DingTalkPickerNode> children;
@JsonIgnore
@FieldDesc("name中文转拼音")
private String nameToPinyin;
@JsonIgnore
@FieldDesc("全拼")
private Set<String> pinyin;
@JsonIgnore
@FieldDesc("拼音首字母")
private Set<String> shortPinyin;
public static DingTalkPickerNode defaultRoot(DingTalkPickerNodeSource source){
final DingTalkPickerNode root = new DingTalkPickerNode();
root.setSource(source);
return root;
}
public DingTalkPickerNode() {
this.children = Lists.newArrayList();
}
private DingTalkPickerNode(
DingTalkPickerNodeSource source,
String id,
String name,
String avatar,
DingTalkPickerNodeType type,
String title
) {
this.source = source;
this.id = id;
this.name = name;
if (StrUtil.isEmpty(avatar)) {
this.avatar = type.getAvatar();
} else {
this.avatar = avatar;
}
this.type = type;
this.title = title;
this.children = Lists.newArrayList();
}
public boolean isTypeOf(DingTalkPickerNodeType type) {
return Objects.equals(type, getType());
}
public boolean isSourceOf(DingTalkPickerNodeSource source){
return Objects.equals(source, getSource());
}
public void setChildren(List<DingTalkPickerNode> children) {
if (CollUtil.isNotEmpty(children)) {
this.children.addAll(children);
}
}
public DingTalkPickerNode clearChildren() {
this.children = Lists.newArrayList();
return this;
}
public boolean hasChildren() {
return CollUtil.isNotEmpty(children);
}
@JsonIgnore
public boolean hasUsers() {
return hasChildren() && children.stream().anyMatch(i -> i.getType().equals(DingTalkPickerNodeType.USER));
}
@JsonIgnore
public boolean hasDepartments() {
return hasChildren() && children.stream().anyMatch(i -> i.getType().equals(DingTalkPickerNodeType.DEPT));
}
@JsonIgnore
public boolean isOrgNode() {
return Objects.equals(getType(), DingTalkPickerNodeType.ORG);
}
@JsonIgnore
public boolean isDeptNode() {
return Objects.equals(getType(), DingTalkPickerNodeType.DEPT);
}
@JsonIgnore
public boolean isUserNode() {
return Objects.equals(getType(), DingTalkPickerNodeType.USER);
}
public void resigned() {
Preconditions.checkArgument(isUserNode());
setResigned(true);
setSelectable(false);
setSearchable(false);
setName(
getName().replaceAll("[\\[已离职\\]]", StrUtil.EMPTY)
.replaceAll("已离职", StrUtil.EMPTY)
+ RESIGNED_SUFFIX
);
}
public void employed() {
Preconditions.checkArgument(isUserNode());
setResigned(false);
setSelectable(true);
setSearchable(true);
if (StrUtil.isNotEmpty(getName())) {
setName(
getName().replaceAll("[\\[已离职\\]]", StrUtil.EMPTY)
.replaceAll("已离职", StrUtil.EMPTY)
);
}
}
public static String serialize(DingTalkPickerNode root) {
Preconditions.checkArgument(Objects.nonNull(root));
return JSON.toJson(root);
}
public static DingTalkPickerNode deserialize(String jsonString) {
Preconditions.checkArgument(StrUtil.isNotEmpty(jsonString));
Preconditions.checkArgument(JSONUtil.isTypeJSON(jsonString));
return JSON.fromJson(jsonString, DingTalkPickerNode.class);
}
}
2.1.3 构建索引
- load(): 加载数据源数据并处理创建索引和文档关系
- DingTalkPickerNodeSource: 只是一个数据源分类(钉钉源, 自定义源)
接口设计
public interface SearchAlgorithm {
default void load(List<DingTalkPickerNode> data) {
}
List<DingTalkPickerNode> search(DingTalkPickerNodeSource source, String input);
}
基于本地内存的实现
/**
* @author hp
*/
@Slf4j
@RequiredArgsConstructor
public class LocalCacheBasedSearchAlgorithm implements SearchAlgorithm {
@FieldDesc("构建一个带有解析过拼音名称的用户节点集合,该集合已修正id,用于查询")
private final Map<DingTalkPickerNodeSource, List<DingTalkPickerNode>> modelsWithPinyin = Maps.newHashMap();
@FieldDesc("首字母与用户节点解析后,可能对应的全拼集合,用于替换首字母")
private final Map<DingTalkPickerNodeSource, Map<String, Set<String>>> shortPinyinKeyedMapping = Maps.newHashMap();
private final PinyinConverter pinyinConverter;
@Override
public void load(List<DingTalkPickerNode> data) {
modelsWithPinyin.clear();
shortPinyinKeyedMapping.clear();
buildLocalCache(data);
}
private List<String> shortToFull(DingTalkPickerNodeSource source, String keyword) {
final Map<String, Set<String>> sourceMap = Maps.newHashMap();
if (Objects.isNull(source)) {
shortPinyinKeyedMapping.values().forEach(sourceMap::putAll);
} else {
if (!shortPinyinKeyedMapping.containsKey(source)) {
return Collections.emptyList();
}
sourceMap.putAll(shortPinyinKeyedMapping.get(source));
}
final List<Character> chars = Chars.asList(keyword.toCharArray());
final List<Set<String>> collect = chars.stream()
.map(c -> {
final String lowerCase = c.toString().toLowerCase();
if (pinyinConverter.isChineseLetter(c)) {
final List<PinyinConverter.PinyinModel> pinyin = pinyinConverter.chineseToPinyin(c.toString());
return pinyin.stream().map(PinyinConverter.PinyinModel::getName).collect(Collectors.toCollection(Sets::newLinkedHashSet));
} else if (Character.isAlphabetic(c)) {
final Set<String> defaultSet = Sets.newLinkedHashSet();
defaultSet.add(lowerCase);
return sourceMap.getOrDefault(lowerCase, defaultSet);
} else {
final Set<String> defaultSet = Sets.newLinkedHashSet();
defaultSet.add(lowerCase);
return defaultSet;
}
})
.toList();
Set<String> set = collect.get(0);
for (int i = 1; i < collect.size(); i++) {
final Set<String> strings = collect.get(i);
Set<String> finalSet = set;
set = strings.stream()
.map(s ->
//拼全拼
finalSet.stream().map(ss -> ss + s).collect(Collectors.toCollection(Sets::newLinkedHashSet))
)
.flatMap(Collection::stream)
.collect(Collectors.toSet());
}
set.add(keyword);
return Lists.newArrayList(set);
}
private void buildLocalCache(List<DingTalkPickerNode> nodes) {
if (CollUtil.isEmpty(nodes)) {
return;
}
nodes.forEach(node -> {
if (!node.isUserNode()) {
buildLocalCache(node.getChildren());
} else {
final DingTalkPickerNodeSource source = node.getSource();
final List<DingTalkPickerNode> pinyinNodes = buildPinyinNodes(node);
modelsWithPinyin.compute(source, (k, v) -> {
if (Objects.isNull(v)) {
return Lists.newArrayList(pinyinNodes);
} else {
v.addAll(pinyinNodes);
return v;
}
});
buildPinyinMapping(source, pinyinNodes);
}
});
}
private List<DingTalkPickerNode> buildPinyinNodes(DingTalkPickerNode node) {
return pinyinConverter.chineseToPinyin(node.getName())
.stream()
.map(pinyin -> {
final DingTalkPickerNode copy = DingTalkPickerMapper.INSTANCE.copy(node);
// Set pinyin.
copy.setNameToPinyin(pinyin.getName());
copy.setPinyin(pinyin.getPinyin());
copy.setShortPinyin(pinyin.getShortPinyin());
return copy;
})
.toList();
}
private void buildPinyinMapping(DingTalkPickerNodeSource source, List<DingTalkPickerNode> pinyinNodes) {
pinyinNodes.forEach(pinyinNode -> {
final List<String> pinyin = Lists.newArrayList(pinyinNode.getPinyin().iterator());
final List<String> shortPinyin = Lists.newArrayList(pinyinNode.getShortPinyin().iterator());
for (int i = 0; i < shortPinyin.size(); i++) {
int finalI2 = i;
shortPinyinKeyedMapping.compute(source, (k, v) -> {
Map<String, Set<String>> mapping;
if (Objects.nonNull(v)) {
mapping = v;
} else {
mapping = Maps.newHashMap();
}
if (mapping.containsKey(shortPinyin.get(finalI2))) {
mapping.computeIfPresent(shortPinyin.get(finalI2), (s, arr) -> {
arr.add(pinyin.get(finalI2));
return arr;
});
} else {
mapping.computeIfAbsent(shortPinyin.get(finalI2), s -> {
final Set<String> arr = Sets.newLinkedHashSet();
arr.add(pinyin.get(finalI2));
return arr;
});
}
return mapping;
});
}
});
}
}
2.2 处理用户输入并查询
2.2.1 基本处理逻辑
保持汉字输入的顺序, 转换拼音, 扩展多音字, 将每个英文字母视为缩写, 使用 shortPinyinKeyedMapping
映射关系扩展全拼.
经过上述处理后从用户输入扩展出了一个待搜索集, 使用该集合在 全拼索引=> 文档 的映射关系中循环匹配即可
2.2.2 搜索设计
搜索接口设计
- search(): 搜索入口
- 参数 DingTalkPickerNodeSource source: 为了客户端从树的某一层开始向下搜索的目的, 去掉的话实现更更简单.
- shortToFull(): 扩展用户输入(可迭代)
public interface SearchAlgorithm {
default void load(List<DingTalkPickerNode> data) {
}
List<DingTalkPickerNode> search(DingTalkPickerNodeSource source, String input);
}
基于本地内存的搜索实现
package com.hp.dingtalk.userpicker.infrastructure.search;
import cn.hutool.core.collection.CollUtil;
import com.google.common.collect.Lists;
import com.google.common.collect.Maps;
import com.google.common.collect.Sets;
import com.google.common.primitives.Chars;
import com.hp.common.base.annotations.FieldDesc;
import com.hp.common.base.annotations.MethodDesc;
import com.hp.dingtalk.userpicker.context.TranslationHolder;
import com.hp.dingtalk.userpicker.domain.DingTalkPickerNode;
import com.hp.dingtalk.userpicker.domain.DingTalkPickerNodeSource;
import com.hp.dingtalk.userpicker.domain.mapper.DingTalkPickerMapper;
import com.hp.dingtalk.userpicker.infrastructure.pinyin.PinyinConverter;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import java.util.*;
import java.util.concurrent.atomic.AtomicInteger;
import java.util.stream.Collectors;
/**
* @author hp
*/
@Slf4j
@RequiredArgsConstructor
public class LocalCacheBasedSearchAlgorithm implements SearchAlgorithm {
@FieldDesc("构建一个带有解析过拼音名称的用户节点集合,该集合已修正id,用于查询")
private final Map<DingTalkPickerNodeSource, List<DingTalkPickerNode>> modelsWithPinyin = Maps.newHashMap();
@FieldDesc("首字母与用户节点解析后,可能对应的全拼集合,用于替换首字母")
private final Map<DingTalkPickerNodeSource, Map<String, Set<String>>> shortPinyinKeyedMapping = Maps.newHashMap();
private final PinyinConverter pinyinConverter;
@Override
public void load(List<DingTalkPickerNode> data) {
modelsWithPinyin.clear();
shortPinyinKeyedMapping.clear();
buildLocalCache(data);
}
@Override
public List<DingTalkPickerNode> search(DingTalkPickerNodeSource source, String keywords) {
//pre-matching
final List<String> allPossibilities = shortToFull(source, keywords);
if (CollUtil.isEmpty(allPossibilities)) {
return Collections.emptyList();
}
//matching
List<DingTalkPickerNode> sourceNodes;
if (Objects.isNull(source)) {
sourceNodes = modelsWithPinyin.values()
.stream()
.flatMap(Collection::stream)
.collect(Collectors.toList());
} else {
if (!modelsWithPinyin.containsKey(source)) {
return Collections.emptyList();
}
sourceNodes = modelsWithPinyin.get(source);
}
final List<DingTalkPickerNode> data = sourceNodes
.stream()
.filter(DingTalkPickerNode::isSearchable)
.filter(i -> match(allPossibilities, i))
.toList();
//removing duplications
return data.stream()
.collect(Collectors.groupingBy(DingTalkPickerNode::getName))
.values()
.stream()
.map(list -> list.stream()
.filter(i -> i.isSourceOf(DingTalkPickerNodeSource.DING_TALK))
.findAny()
.orElse(list.get(0)
)
)
.toList();
}
@MethodDesc("是否匹配关键字")
private boolean match(List<String> allPossibilities, DingTalkPickerNode model) {
for (String lowerCasedKeywords : allPossibilities) {
if (
model.getName().startsWith(lowerCasedKeywords) ||
model.getName().contains(lowerCasedKeywords) ||
String.join("", model.getShortPinyin()).startsWith(lowerCasedKeywords) ||
model.getNameToPinyin().startsWith(lowerCasedKeywords)
) {
return true;
}
}
return false;
}
private List<String> shortToFull(DingTalkPickerNodeSource source, String keyword) {
final Map<String, Set<String>> sourceMap = Maps.newHashMap();
if (Objects.isNull(source)) {
shortPinyinKeyedMapping.values().forEach(sourceMap::putAll);
} else {
if (!shortPinyinKeyedMapping.containsKey(source)) {
return Collections.emptyList();
}
sourceMap.putAll(shortPinyinKeyedMapping.get(source));
}
// input char array
final List<Character> chars = Chars.asList(keyword.toCharArray());
// translate
AtomicInteger index = new AtomicInteger(0);
final List<TranslationHolder> translationHolders = chars.stream()
.map(c -> {
final String lowerCase = c.toString().toLowerCase();
final TranslationHolder translationHolder = new TranslationHolder(lowerCase, index.getAndIncrement());
if (pinyinConverter.isChineseLetter(c)) {
final List<PinyinConverter.PinyinModel> pinyin = pinyinConverter.chineseToPinyin(c.toString());
final Set<String> translations = pinyin.stream().map(PinyinConverter.PinyinModel::getName).collect(Collectors.toCollection(Sets::newLinkedHashSet));
translationHolder.setTranslations(translations);
return translationHolder;
} else if (Character.isAlphabetic(c)) {
final Set<String> translations = sourceMap.getOrDefault(lowerCase, null);
translationHolder.setTranslations(translations);
return translationHolder;
} else {
return translationHolder;
}
})
.toList();
final List<TranslationHolder> optimized = TranslationHolder.Optimizer.optimize(translationHolders);
Set<String> _1st = optimized.get(0).getTranslations();
for (int i = 1; i < optimized.size(); i++) {
final Set<String> translations = optimized.get(i).getTranslations();
_1st = _1st.stream()
.flatMap(s1 -> translations.stream().map(s2 -> s1 + s2))
.collect(Collectors.toSet());
}
_1st.add(keyword);
return Lists.newArrayList(_1st);
}
}
三、效果
@Test
public void test_search() {
final List<String> inputs = Lists.newArrayList(
"cx", "chx", "chex",
"chexi", "chexia", "chexiao",
"chenx", "chenxi", "chenxia", "chenxiao",
"陈晓", "陈x", "陈xi", "陈xia", "陈xiao",
"c晓", "ch晓", "che晓", "chen晓"
);
inputs.forEach(input->{
final List<DingTalkPickerNode> nodes = dingTalkPickerRepository.findByNameLike(null, input);
Assertions.assertThat(nodes)
.isNotEmpty()
.extracting(DingTalkPickerNode::getName)
.contains("陈晓");
});
}
总结
功能通过对倒排索引概念的简单实现, 完成了用户输入拼音混合中文时的全文查询. 当然这种简单的实现方式, 尤其在搜索算法的实现上, 还有许多可以优化的空间. 实际应用到实现类似钉钉组织架构组件的搜索功能上, 体感基本上和钉钉的原生效果一致了.
感兴趣的话可以在仓库参考完成项目代码: