发表过的paper

最新推荐文章于 2024-07-11 11:12:37 发布

bao_zhifeng

最新推荐文章于 2024-07-11 11:12:37 发布

阅读量510

点赞数

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/lycanlancelot/article/details/50187807

版权

My research interest is "How to make Data usable" to data consumers. That involves:
design novel data usability modules in the front office,
propose efficient evaluation algorithm in the mid office, and
provide efficient storage and indexing schemes in the back office.

To make more data truly usable, keyword search is one important aspect I make continuous efforts, and my ultimate goal is to have a one-size-fits-all general framework to provide usability support for all these heterogeneous data. Some selected publications are listed: [1-8] is part of the works on semi-structured data, [9] is the work on structured data focusing on data quality problem, and [10-12] is the one on social network data.

Note: A categorization of publications for each of the following research interests can be found at PUBLICATIONS.

Keyword search over spatial & textual data

- a general support for various types of fuzzy type-ahead spatial keyword query
- one-size-fits-all index design for various types and degrees of relaxation
- maximization of query result reuse at different granularities

Geo-textual data are generated in abundance. Recent study focused on the processing of spatial keyword queries which retrieve objects that match certain keywords within a spatial region. To ensure eff.ective retrieval, various extensions were done including the allowance of errors in keyword matching and auto-completion using prefi.x matching. Our goal is to devise a unifying strategy for processing di.fferent variants of spatial keyword query. We adopt the auto-completion paradigm that generates the initial query as a prefi.x matching query. If there are few matching results, other variants are performed as a form of relaxation that reuses the processing done in the earlier phase. The types of relaxation include spatial region expansion and exact/approximate prefi.x/substring matching. Moreover, since the autocompletion paradigm allows appending characters after the initial query, we look at how query processing done for the initial query and relaxation can be reused in such instances. Compared to existing works which process variants of spatial keyword query as new queries over di.fferent indexes, we offer a more compelling way to effi.cient and eff.ective spatial keyword search. Extensive experiments substantiate our claims.

Social Networks (Search, Analytics and Management)
Vision: have a general database to manage, search and analyze over the social network data
- Database Design and Implementation
- data modeling
- query optimization
- concurrency control
- In-database Search
- Personalized + RealTime Search: (search in your own "circles" and gets the latest results just published 1 second ago)

- In-database Social Network Analysis (SNA)
- link prediction
This challenge requires a model for social network data,and graphs are the obvious candidate, but:sonSchema -- a social network is not a graph! [10]
A graph is a static, syntactic model that does not capturethe dynamics and semantics of a social network; this is evident from:sonLP -- social network link prediction by principal component regression. [11]

We have designed a general search engine framework that is able to providePersonalized + Realtime Search in social networks[13]

Our ambition is to build a sonSchema-based open-source systemthat replaces MySQL as the default database management system for social network data:sonSQL -- an extensible relational DBMS for social network start-ups. [12]

Effective Keyword Search over (semi-)structured data: XML DB & Relational DB
- Search intention identification: search target & search constraints
- Data-driven query suggestion/refinement
- Enabling exploration of data by visualizations of query results as easy as using Google Maps

[1,2,3] studied how to provide an effective search engine for keyword search over XML data.
[1] is the first work that explicitly points the essential difference between web search and database search. In particular, keyword search over database brings several unique challenges: (1) The target that a user query intends to search for is usually unknown or implicit, while in web search the target is unstructured document without doubt. (2) The keyword ambiguity problem: a keyword can appear as part of the metadata (e.g. table/attribute name in RDBMS, or tag name in XML database) or the data in database; a keyword can appear as the text values of different metadata part and carry different meanings; a keyword can appear as the metadata of different parts of database (e.g. in RDBMS different tables contain attributes with the same name but carry different meanings, or in XML database the nodes with same tag name but different meanings). The keyword ambiguity further obstructs the search engine from identifying the constraints that a user query intends to search via. (3) The structure information in database has to be taken into account in devising the matching result semantics and result ranking scheme.

[1] is also the first work ever that identifies and address the above three challenges, and [2] is a follow-up to further enhance the framework. In particular, we designed heuristic-based solutions to find the promising search target, and further embedded the structural information of XML data in designing the IR-style result scoring method. I have been investigating how to integrate DB and IR techniques in a seamless to way to enforce effective keyword query processing over database, and it is recognized as a promising direction. Another important impact is: our approach is general in that it can handle both the semi-structured and unstructured data, so the whole framework can easily reduce to web search scenario and incorporated into the existing web search engine seamlessly. [1] is nominated as one of the 8 best papers in ICDE 2009, and invited to have an extension published in TKDE Special Issues “Best Papers of ICDE 2009”.

As we can see from [1,2], the approach works assuming there are matching results in the database that match user’s query. However, how about the case that when user did not get their expected result? In particular, if what they search for is unavailable in the database, the system will return an empty result or, worse, erroneous mismatch results. We call this problem the MisMatch Problem. As a complement of our search engine built in [1], [3] solves the MisMatch problem in the context of XML keyword search. Our solution is based on two novel concepts that we introduce: Target Node Type and Distinguishability. Using these concepts, we develop a low-cost post-processing algorithm on the results of query evaluation to detect the MisMatch problem and generate helpful suggestions to users. Our approach has three noteworthy features: (1) for queries with the MisMatch problem, it generates the explanation, suggested queries and their sample results as the output to users, helping users judge whether the MisMatch problem is solved without reading all query results; (2) it is portable as it can work with any LCA-based matching semantics and is orthogonal to the choice of result retrieval method adopted; (3) it is lightweight in the way that it occupies a very small proportion of the whole query evaluation time. Extensive experiments on three real datasets verify the effectiveness, efficiency and scalability of our approach. A search engine called XClear has been built and is available at http://xclear.comp.nus.edu.sg.

Efficient Database Keyword Search
[4,5,6] are my continuous efforts on addressing the efficiency issue in building a keyword search engine, i.e. how to efficiently retrieve all the matching results. In the context of XML keyword search, smallest lowest common ancestor (SLCA) and exclusive lowest common ancestor (ELCA) are two widely adopted matching semantics, aiming to find the smallest subtree of the whole XML data tree that contains all keywords. Most existing optimizations on the result retrieval try to skip visiting the keyword match nodes (along the keyword inverted) that contribute to an already-found SLCA/ELCA node. Our optimization in [4,5] goes through another different way, where we propose to assign each node a unique ID which is its pre-order number, and propose a new kind of inverted index, where for each keyword ki, the corresponding inverted list consists of all nodes that contain ki in its subtree. As a result, SLCA and ELCA computation can be cast into a variant of set intersection problem, and our solution can work for both SLCA and ELCA, and outperforms than existing works about 2 orders of magnitude. Subsequently, as a system work, we need to materialize the final results (in form of subtrees rooted at the above SLCA/ELCA node), so we propose an efficient approach [6] which has a seamless integration with the aforementioned SLCA/ELCA node retrieval methods, and it has got the Best Paper Nomination in DASFAA 2012.

Efficient Structured Query Processing over (semi-)structured Data
- index design
- labeling scheme
- non-answer and duplicate-answer pruning

Besides the keyword query, we also have a thorough study on how to efficiently process the structured query over XML data, e.g. XPath query, and [7] is a very good summary of our continuous efforts on it.

Both keyword query processing and structured query processing rely on one core component, the labelling scheme for XML data, because without it, we have no way to know the parent-child, ancestor-descendant relationships between nodes in XML data. We observe that existing dynamic labeling schemes, however, often sacrifice query performance and introduce additional labeling cost to facilitate arbitrary updates even when the documents actually seldom get updated. Since the line between static and dynamic XML data is often blurred in practice, we believe it is important to design a labeling scheme that is compact and effiicient regardless of whether the documents are frequently updated or not. In [8], we propose a novel labeling scheme called DDE (for Dynamic DEwey) which is tailored for both static and dynamic XML documents. For static documents, the labels of DDE are the same as those of dewey which yield compact size and high query performance. When updates take place, DDE can completely avoid re-labeling and its label quality is most resilient to the number and order of insertions compared to the existing approaches. In addition, we introduce Compact DDE (CDDE) which is designed to optimize the performance of DDE for insertions. Both DDE and CDDE can be incorporated into existing systems and applications that are based on dewey labeling scheme with minimum efforts.

Meta-data management over database
- provenance data: storage, management and usability tracking

[9] studied the data quality problem, which is very important to extract, track unexpected result during studying the heterogeneous or uncertain data. Provenance information is vital in many application areas as it helps explain data lineage and derivation. Understanding the provenance of data has become exceedingly important due to the large number of sources, diversity of formats and sheer volume of data that current business and scientific applications have to deal with. For example, in scientific computing area, as in many other areas, provenance is vital to establish trust or correctness of results; in database environments it can help update views, explain unexpected results, and assist with data integration; in databases with uncertainty, it can be used to track correlation between probabilistic variables.

My research goal in this direction is to develop provenance data management frameworks that can be combined as part of the database management system. In particular, my concern is on: (1) the expressiveness of the provenance semantics designed, to interpret not only from which source data the result comes from, but also how the results come from the source data; (2) design novel data structures that minimize the storage cost of provenance data; (3) provide an efficient support on evaluating the provenance tracking query.
As a start, [9] focuses on designing a framework for storing fine-grained provenance data relating to data derived via database queries. While storage space is of little concern when dealing with high-level provenance, the requirements of storing fine-grained provenance data can be significant, with the size of provenance data often exceeding the size of the actual data. One way to deal with this is to compute provenance data only when requested, rather than storing it, but a drawback is that without good inverse functions this can be expensive, and it may require intermediate query results to be stored. Therefore, we first propose a provenance tree data structure which matches the query structure and thereby presents a possibility to avoid redundant storage of information regarding the derivation process. Then I investigate two approaches for reducing storage costs. The first provides a means of optimizing the selection of query tree nodes where provenance information should be stored. The second exploits logical query-rewriting, in particular join-reordering, which can be done after the query has already been computed, and thus does not impede the efficiency of query execution. The optimization algorithms run in polynomial time in the query size and linear in the size of the provenance information, thus enabling provenance tracking and optimization without incurring large overheads. Moreover, I built a relational query engine from scratch, supporting ASPJ operations required for executing SQL-style queries as well as provenance data construction during query execution.

References
[1] Z.F. Bao, T.W. Ling, B. Chen and J.H. Lu. Effective XML Keyword Search With Relevance Oriented Ranking. ICDE 2009, full paper. (Best Papers Award)

[2] Y. Zeng, Z.F. Bao, H.V. Jagadish, G.L. Li and T.W. Ling. Breaking out the Mismatch Trap. ICDE 2014, full paper.

[3] Z.F. Bao, J.H. Lu , T.W. Ling, and B. Chen. Towards an Effective XML Keyword Search. TKDE 2010.

[4] J.F. Zhou, Z.F. Bao, Z.Y. Chen and T.W. Ling. Fast Result Enumeration for Keyword Queries on XML Data. DASFAA 2012, full paper. (One of the Five Best Papers)

[5] J.F. Zhou, Z.F. Bao,W. Wang, J.J. Zhao and X.F. Meng. Efficient Query Processing for XML Keyword Queries based on the IDList Index. VLDB Journal, accepted in 2013.

[6] J.F. Zhou, Z.F. Bao, W. Wang, T.W. Ling, Z.Y. Chen, X.D. Lin and J.F. Guo. Fast SLCA and ELCA Computation for XML Keyword Queries based on Set Intersection. ICDE 2012, full paper.

[7] J.H. Lu, T.W. Ling, Z.F. Bao and C. Wang. Extended Tree Pattern Matching: Theories and Algorithms. TKDE 2010, regular paper.

[8] L. Xu, T.W. Ling, H.Y. Wu and Z.F. Bao. DDE: From Dewey to a Fully Dynamic XML Labeling. SIGMOD 2009, full paper.

[9] Z.F. Bao, H. Koehler, X.F. Zhou and S. Sadiq. Efficient Provenance Storage For Relational Queries. CIKM 2012, full paper.

[10] Z.F. Bao, Y.C. Tay and J.B. Zhou. sonSchema: A Conceptual Schema for Social Networks. ER 2013: 197-211

[11] Z.F. Bao, Y. Zeng and Y.C. Tay. sonLP: social network link prediction by principal component regression. ASONAM 2013: 364-371

[12] Z.F. Bao, J.B. Zhou and Y.C. Tay. sonSQL: An Extensible Relational DBMS for Social Network Start-Ups. ER 2013: 495-498

[13] Y.C. Li, Z.F. Bao, G.L. Li and K.L. Tan. Real Time Personalized Search over Social Networks. ICDE 2015: 639-650.

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
发表过的paper

My research interest is "How to make Data usable" to data consumers. That involves: design novel data usability modules in the front office, propose efficient evaluation algorithm in the mid offic
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。