signature=506ccff074d130c2e8d4e3268d3b44f1,Web Knowledge Extraction for Search Task Simplification

最新推荐文章于 2021-05-30 14:50:18 发布

冯颉

最新推荐文章于 2021-05-30 14:50:18 发布

阅读量100

点赞数

文章标签：搜索引擎结构化信息网页数据提取查询意图知识库

BACKGROUND

Search engines provide a valuable tool to users seeking information on the web. Traditional search engines provide a means for a user to enter a search query, and a display to provide the search results to the user. For example, a user may enter a search query into query text input box, and click a search button or other control to request execution of the search query. The search engine may then provide a list of various web sites that are the results of the search, indicated by Uniform Resource Locators (URLs) or other identifying information. Unfortunately, search result lists may be lengthy and/or noisy, making it difficult for a user to find desired information.

SUMMARY

Techniques are described for generating structured information from semi-structured web pages, and retrieving the structured information in response to a user query that indicates a query intent. The structured information is automatically extracted offline from semi-structured web pages that may be noisy and/or complex, through the use of an auto wrapper solution that is noise tolerant, and scalable to deal with large amounts of data. Extraction of the structured information includes transforming the web page data into lists of tag path text items based on the document object model (DOM) of each page, and determining tag path text occurrence vectors and tag path text position vectors from the DOM trees. These vectors are employed to determine root templates and detail templates for the web pages. Structured information is generated in tabular form based on the root and detail templates. The structured information is stored in a knowledge base or other data repository and provided in response to a user search query with a user intent. Offline extraction and storage of structured information in a knowledge base enables the information to be provided more readily in response to online user search queries.

Extraction of structured information may also include a pre-processing stage in which one or more clusters of pages are determined for the input web pages, based on measured similarities between the pages. The clusters may be determined based on similar elements in the tag path text data of the pages. A minimum size threshold may be applied to the clusters, such that clusters below a threshold number of pages are removed and not used in subsequent processing.

This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

BRIEF DESCRIPTION OF THE DRAWINGS

The detailed description is described with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The same reference numbers in different figures indicate similar or identical items.

FIG. 1 depicts example structured information that may be presented to a user in response to an example comparison query.

FIG. 2 is a schematic diagram depicting an example environment in which embodiments may operate.

FIG. 3 is a diagram of an example computing device, which may be deployed as part of the example environment of FIG. 2 according to embodiments.

FIG. 4 depicts a flow diagram of an illustrative process for determining structured information based on semi-structured web pages, according to embodiments.

FIG. 5 depicts a flow diagram of an illustrative process for web page clustering, according to embodiments.

FIG. 6 depicts a flow diagram of an illustrative process for providing structured information based on a determined intent for a user query, according to embodiments.

FIGS. 7A-7H depict example collections of web site data in various stages of processing for determining structured information, according to embodiments.

DETAILED DESCRIPTION

Overview

Embodiments described herein provide for the automatic extraction of structured information from semi-structured web pages. Extraction of structured information may be performed offline by one or more server devices. In some embodiments, such server devices are dedicated to a task of web knowledge extraction to provide structured information. In other embodiments, the extraction of structured information occurs on devices that also perform other functions, such as providing a web search engine. After extraction, the structured information may be stored in a knowledge base or other data storage mechanism, and retrieved in response to user queries.

In some embodiments, structured information includes list, tables, graphs, or other digests of information, presented in a format such that a user may more readily find useful information. An example of structured information is depicted in FIG. 1. This example shows structured information 100 that may be presented in response to a user search query searching for information to compare to professional athletes, “Bob ABC” and “Sam XYZ.” In this example, the user may have entered the query “Bob ABC vs. Sam XYZ.” The example structured information 100 includes an image 102 and biographical information 104 associated with Bob ABC, and an image 106 and biographical information 108 associated with Sam XYZ. The example structured information 100 further includes a tabular comparison 110 of season statistics for each of the two players. Such structured information presented in a summarized or digest format enables the user to see at a glance the desired comparison information, and may free the user from a time-consuming search through multiple web sites in a search results list.

In some embodiments, extraction of structured information includes transforming the document object model (DOM) tree of one or more web pages to form a list of tag path text items for each page. As used herein, tag path text data refers to the full path from the root of the DOM tree to a tag, coupled with the text data associated with that tag. The tag path text data items may then be employed to determine tag path text occurrence vectors and tag path text position vectors for data items in the tag path text data for the pages. The tag path text occurrence vectors are used to determine a root template that includes those data items that are present in more than a certain threshold number of the pages, and that occur once in those pages where they are present. The root template is then used to determine data blocks in the web page data, and detail templates are determined recursively through analysis of the tag path text position vectors. The structured information is extracted from the root template and detail templates, and then stored in a knowledge base.

Some embodiments include a pre-processing phase of clustering the one or more web pages based on determined similarities between at least some of the web pages. This clustering may measure similarities in the tag path text data of the pages, and determine one or more clusters of pages. In some embodiments, clusters smaller than a minimum number of pages may be removed and not employed in further processing.

Example Environment

FIG. 2 shows an example environment 200 in which embodiments operate. As shown, the various devices of environment 200 communicate with one another via one or more networks 202 that may include any type of networks that enable such communication. For example, networks 202 may include public networks such as the Internet, private networks such as an institutional and/or personal intranet, or some combination of private and public networks. Networks 202 may also include any type of wired and/or wireless network, including but not limited to local area networks (LANs), wide area networks (WANs), Wi-Fi, WiMax, and mobile communications networks (e.g. 3G, 4G, and so forth). Networks 202 may utilize communications protocols, including packet-based and/or datagram-based protocols such as internet protocol (IP), transmission control protocol (TCP), user datagram protocol (UDP), or other types of protocols. Moreover, networks 202 may also include a number of devices that facilitate network communications and/or form a hardware basis for the networks, such as switches, routers, gateways, access points, firewalls, base stations, repeaters, backbone devices, and the like.

Environment 200 further includes one or more client device(s) 204 associated with web user(s). Client device(s) 204 may include any type of computing device that a web user may employ to send and receive information over networks 202. For example, client device(s) 204 may include, but are not limited to, desktop computers, laptop computers, tablet computers, wearable computers, media players, automotive computers, mobile computing devices, smart phones, personal data assistants (PDAs), game consoles, mobile gaming devices, set-top boxes, and the like. Client device(s) 204 generally include one or more applications that enable a user to send and receive information over the web and/or internet, including but not limited to web browsers, e-mail client applications, chat or instant messaging (IM) clients, and other applications. Such applications may include functionality for interacting with a search engine. For example, a browser or other application installed on a client device may enable the user to interact with a search engine through a user interface.

As shown, environment 200 may further include one or more web server device(s) 206. Briefly stated, web server device(s) 206 include computing devices that are configured to serve content or provide services to users over network(s) 202. Such content and services include, but are not limited to, hosted static and/or dynamic web pages, social network services, e-mail services, chat services, games, multimedia, and any other type of content, service or information provided over networks 202.

In some embodiments, web server device(s) 206 may collect and/or store information related to online user behavior as users interact with web content and/or services. For example, web server device(s) 206 may collect and store data for search queries specified by users using a search engine to search for content on networks 202. Moreover, web server device(s) 206 may also collect and store data related to web pages that the user has viewed or interacted with, the web pages identified using an IP address, uniform resource locator (URL), uniform resource identifier (URI), or other identifying information. This stored data may include web browsing history, cached web content, cookies, and the like.

In some embodiments, users may be given the option to opt out of having their online user behavior data collected, in accordance with a data privacy policy implemented on one or more of web server device(s) 206, or on some other device. Such opting out allows the user to specify that no online user behavior data is collected regarding the user, or that a subset of the behavior data is collected for the user. In some embodiments, a user preference to opt out may be stored on a web server device, or indicated through information saved on the user's web user client device (e.g. through a cookie or other means). Moreover, some embodiments may support an opt-in privacy model, in which online user behavior data for a user is not collected unless the user explicitly consents.

As further shown in FIG. 2, environment 200 may include one or more search server device(s) 208. Search server device(s) 208, as well as the other types of server devices shown in FIG. 2, are described in greater detail herein with regard to FIG. 3. Search server device(s) 208 may be configured (e.g., with a search engine) to receive and execute web search queries entered by users and provide search results. In some embodiments, search server device(s) 208 perform a process for query intent determination such as that described with regard to FIG. 6.

Environment 200 may also include one or more knowledge extraction server device(s) 210 that extract structured knowledge from semi-structured web pages, as described further with regard to FIG. 4. Such structured knowledge may be stored by knowledge extraction server device(s) 210 in a knowledge base or other data storage device. Such a knowledge base may be incorporated into (e.g., local to) the knowledge extraction server device(s) 210, or may be external such as in data storage device(s) 212. The structured information generated by knowledge extraction server device(s) 210 may then be provided to search server device(s) 208 to enable the structured information to be provided to users requesting searches. In some embodiments, knowledge extraction server devices 210 may also perform a web page clustering process such as that described with regard to FIG. 5. In other embodiments, the clustering process may be performed by one or more separate devices.

Environment 200 may further include one or more data storage devices 212, configured to store data related to the various operations described herein. Such storage devices may be incorporated into one or more of the servers depicted, or may be external storage devices separate from but in communication with one or more of the servers. In some embodiments, data storage device(s) 212 may include a knowledge base to store structured information extraction from semi-structured web pages by knowledge extraction server device(s) 210.

In some embodiments, one or more of the server devices depicted in FIG. 2 may include multiple computing devices arranged in a cluster, server farm, cloud computing service, or other distributed or non-distributed groupings of computing devices to share workload. Such groups of servers may be load balanced or otherwise managed to provide more efficient operations. Moreover, although various computing devices of environment 200 are described as clients or servers, each device may operate in either capacity to perform operations related to various embodiments. Thus, the description of a device as client or server is provided for illustrative purposes, and does not limit the scope of activities that may be performed by any particular device. Moreover, in some embodiments the functionality of devices depicted in FIG. 2 may be combined, e.g. such that the operations of knowledge extraction server device(s) 210 and search server device(s) 208 may occur on a single device or cluster of devices.

Example Computing Device Architecture

FIG. 3 depicts a block diagram for an example computing device architecture for various devices depicted in FIG. 2. As shown, computing device 300 includes processing unit 302. Processing unit 302 may encompass multiple processing units, and may be implemented as hardware, software, or some combination thereof. Processing unit 302 may include one or more processors. As used herein, processor refers to a hardware component. Processing unit 302 may include computer-executable, processor-executable, and/or machine-executable instructions written in any suitable programming language to perform various functions described herein. In some embodiments, processing unit 302 may further include one or more graphics processing units (GPUs).

Computing device 300 further includes a system memory 304, which may include volatile memory such as random access memory (RAM), static random access memory (SRAM), dynamic random access memory (DRAM), and the like. System memory 304 may further include non-volatile memory such as read only memory (ROM), flash memory, and the like. System memory 304 may also include cache memory. As shown, system memory 304 includes one or more operating systems 306, and one or more executable components 310, including components, programs, applications, and/or processes, that are loadable and executable by processing unit 302. System memory 304 may further store program/component data 308 that is generated and/or employed by executable components 310 and/or operating system(s) 306 during their execution.

Executable components 310 include one or more of various components to implement functionality described herein, on one or more of the servers depicted in FIG. 2. For example, executable components 310 may include a search engine 312 operable to receive search queries from users and perform web searches based on those queries. Search engine 312 may also provide a user interface that allows the user to input a query and view search results. Executable components 310 may also include a clustering component 314 that is configured to perform various tasks related to web page clustering, as described further herein.

In some embodiments, executable components 310 may include a web knowledge extraction component 316. This component may be present, for example, where computing device 300 represents knowledge extraction server device(s) 210. Web knowledge extraction component 316 may be configured to perform various tasks related to the extraction of structured information from semi-structured web pages, as described herein. Executable components 310 may also include a query intent analysis component 318, to perform tasks related to user query intent determination, as described below with reference to FIG. 6. Executable components 410 may further include other components 320.

As shown in FIG. 3, computing device 300 may also include removable storage 330 and/or non-removable storage 332, including but not limited to magnetic disk storage, optical disk storage, tape storage, and the like. Disk drives and associated computer-readable media may provide non-volatile storage of computer readable instructions, data structures, program modules, and other data for operation of computing device 300.

In general, computer-readable media includes computer storage media and communications media.

Computer storage media includes volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structure, program modules, and other data. Computer storage media includes, but is not limited to, RAM, ROM, erasable programmable read-only memory (EEPROM), SRAM, DRAM, flash memory or other memory technology, compact disc read-only memory (CD-ROM), digital versatile disks (DVDs) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information for access by a computing device.

In contrast, communication media may embody computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transmission mechanism. As defined herein, computer storage media does not include communication media.

Computing device 300 may include input device(s) 334, including but not limited to a keyboard, a mouse, a pen, a voice input device, a touch input device, and the like. Computing device 300 may further include output device(s) 336 including but not limited to a display, a printer, audio speakers, and the like. Computing device 300 may further include communications connection(s) 338 that allow computing device 300 to communicate with other computing devices 340, including client devices, server devices, data storage devices, or other computing devices available over network(s) 202.

Example Processes

FIGS. 4-6 depict flowcharts showing example processes in accordance with various embodiments. The operations of these processes are illustrated in individual blocks and summarized with reference to those blocks. The processes are illustrated as logical flow graphs, each operation of which may represent a set of operations that can be implemented in hardware, software, or a combination thereof. In the context of software, the operations represent computer-executable instructions stored on one or more computer-readable storage media. Execution of the computer-executable instructions by one or more processors enables the one or more processors to perform the recited operations. Generally, computer-executable instructions include routines, programs, objects, modules, components, data structures, and the like that perform particular functions or implement particular abstract data types. The order in which the operations are described in the flow diagrams is not intended to be construed as a limitation, and any number of the described operations can be divided into sub-operations, combined in any order, and/or executed in parallel to implement the processes.

FIG. 4 depicts an example process 400 for extracting structured information from one or more semi-structured web pages. In some embodiments, process 400 executes on one or more of the devices shown in FIG. 2, such as knowledge extraction server device(s) 210, and may be executed by one or more of the executable components 310, such as web knowledge extraction component 316. One or more semi-structured web pages 402 are clustered at 404 for form one or more groups of web pages based on similarities between the web pages. In some embodiments clustering may proceed using a K-means type clustering algorithm, as described below with regard to FIG. 5. However, other clustering algorithms may be employed to cluster the web pages. In some embodiments the semi-structured web pages 402 are a set of pages that compose a web site or more than one web site. The pages are semi-structured in that they include both data and metadata that provides at least some structure for the data. Such metadata may be in the form of a markup language such as Hypertext Markup Language (HTML), Extensible Markup Language (XML), or the like.

At 406 the web page data for the one or more clusters is transformed to tag path text vectors and tag path position vectors. As used herein, a tag path is a path from a root node to a text node in the DOM tree of a web page. The tag path text is a combination of a tag path and the text node of the tag path. FIG. 7A depicts a DOM tree 702 of an example web page “Page 1,” and FIG. 7B shows a list of tag path text items derived from the DOM tree 702. In FIG. 7B, the first column 704 is a list of tag paths, and the second column 706 is a list of text nodes corresponding to each tag path. For example, in line 01 of FIG. 7B, “

” is a tag path corresponding to text node “Archipelago 1.14,” and the tag path text for this line is “ Archipelago 1.14.” FIG. 7C shows a list of tag paths 708 and corresponding text nodes 710 for another example web page, “Page 2.” In some embodiments, use of a DOM tree description, tag path text data, and/or other digest of web site data enables scalability and the processing of large amounts of web site data.

In some embodiments, a tag path text occurrence vector is a vector of the occurrences of tag path text items in a cluster of web pages. The tag path text occurrence vector Vtptfor each tag path text may be expressed mathematically as shown in Equation (1):

Vtpt=[f1,f2, . . . ,fn] Equation (1)

where the length n of Vtptis the number of input web pages in the cluster, and fiis the occurrence frequency of the a tag path text in the ithpage. For example, based on the example tag path text data items in FIGS. 7B and 7C, the tag path text occurrence vector for tag path text “

Price:” is (1, 1), given that this tag path text occurs once in Page 1 and once in Page 2. The tag path text occurrence vector for tag path text “

$2.99” is (1, 0), given this particular tag path text occurs once in Page 1, and zero times in Page 2. In some embodiments, a tag path text occurrence vector is calculated for each unique tag path text item in the cluster.

In some embodiments, a tag path text position vector is a vector of the positions where a tag path text occurs in each page. The tag path text position vector Ptptfor each tag path text may be expressed mathematically as shown in Equation (2):

Ptpt=[p1,p2, . . . ,pn] Equation (2)

where the length n of Ptptis the number of input web pages in the cluster, and piis a set of positions where the tag path text in the ithpage. For example, based on the example web pages in FIGS. 7B and 7C, the tag path text position vector for “

冯颉

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
signature=506ccff074d130c2e8d4e3268d3b44f1,Web Knowledge Extraction for Search Task Simplification

BACKGROUNDSearch engines provide a valuable tool to users seeking information on the web. Traditional search engines provide a means for a user to enter a search query, and a display to provide the se...
复制链接

扫一扫