高中生综合评价报告管理系统开题报告

395 篇文章 16 订阅
395 篇文章 16 订阅

一.本课题的研究意义、研究现状和发展趋势(文献综述)

1.研究意义

全面质量评估是我国深入推进素质教育、深化课程教学改革的必然选择,从2014年新一轮考试招生制度改革开始,全面质量评价研究取得了积极和积极的成果,但研究也存在一定的表面积、封闭性和偏差性,今后应深化理论分析,加强对全面质量评价的深入研究;紧密联系时代热点,加强热学研究的全面质量评估;平衡地区差异,提高整体质量评估的公平性。

2.研究现状

高等教育综合质量综合评价是贯彻党的教育方针,全面实施素质教育,深化课程改革,转变人才培养方式,促进学生全面发展,让高中生素质综合评价在提高高中生素质和高校招生录取中发挥重要作用,这是政府意志的体现,是深化素质教育的需要,因此,全面质量评价体系是我国特定国情的具体产物。在国外,学生评价中还没有明确的“综合素质评价”概念。然而,世界主要国家和地区在评估高中生时,除了对学生的学业水平测试外,大家都注重综合素质,其形式和内容都类似于对我国学生综合素质的理解。因此,在研究现状这一环节,本文只对国内相关研究做出综述。

文献[1]系统地归纳总结了轨迹数据挖掘领域中轨迹表示的不同方法和原理,对此进行了详细的对比分析,且给出了在轨迹数据挖掘中相应的应用场景。文献[2]中作者为非结构化个人信息管理在移动环境下能够更加自然高效提出一个名为“Ruby”的系统,该系统可支持基于笔迹标签的检索以及非结构化笔记的编辑。文献[3]中作者为有效准确地了解并根据区域水资源的各项指标辅助生成突发事件后的应急方案,运用人工神经网络技术,研发了区域水环境信息管理系统,给水资源的检测管理带来了决策性的支持。文献[4]以用户多QoS约束条件为立足点,针对广域信息管理系统对蚁群算法进行了改进,其信息素由总效用评价函数来更新,不仅成功实现了资源共享,还极大地满足了用户地需求。文献[5]基于对图异常检测技术的广泛调研以及对各算法优缺点的对比分析,对图异常检测的核心技术、应用场景等进行了全面的总结。文献[6]基于深度学习技术,对程序理解研究工作的原理和技术进行了总结,这具有极大的研究价值和前景。文献[7]从众多角度对千万条出自中国网站的用户口令字符串进行了仔细地分析与挖掘,从中概括出十几条有利于恢复用户口令的规则,以此呼吁重视口令安全问题。文献[8]以数据管理系统为前提,提出一种基于云环境的自适应辅助索引机制,使其省去查询前所需的索引创建时间,同时大大地提高了查询的效率。文献[9]基于高效用序列模式可被模式增长方式直接挖掘的前提,提出一种新型高效序列模式挖掘算法HUSP-FP,  大大地降低了算法的时空复杂度。文献[10]针对由告警缺失及冗余产生攻击场景构建失准的情况,提出以因果知识网络为基础的攻击场景构建方法,对真实告警数据进行充分挖掘能够定量刻画因果关系,再借助构成的因果知识网络给攻击场景的构建分类,该方式将专家知识与数据挖掘的优势进行了巧妙的结合,使得攻击场景构建精确度得到提高。文献[11]针对医患纠纷、医疗资源分布不均、医疗信息可靠度等背景,展示了一种通用的医疗信息管理与服务体系,建设了医疗信息管理、医疗服务质量评价、个性化服务推荐、医疗信息共享等机制,一定程度上加快了现代医疗服务模式的应用。文献[12]基于频繁项集引入了高效用项集,设计出一种文本分类模型。以挖掘必要性和关联度强的文本作为特征输入神经网络,于6个数据集5个基准算法上展开对比分析,取得了较好的结果。

3.发展趋势

目前,中学生完整的素质评估档案主要采用传统的纸质或手工电子形式进行管理,不仅不利于学生素质评估档案的系统、详尽的收集,但也不能帮助学生和家长了解随着时间的推移情况。它已逐渐成为制约学生素质评估全面管理发展的瓶颈。

而现代网络技术的新特点正是解决这一问题的利器,我国也有很多地区在尝试构建学生素质全面评价的档案管理平台,但在形式和内容上与安徽省的实际需求有很大的不同。正是由于地区差异,本文从地方政策和习惯的角度研究了地方学生素质评估档案管理的整个体系的设计与实施。

学生综合素质评价是一项长期的、持续的工作,未来学生综合素质评价将呈现以下趋势:基于越来越深入的学习和认知模式的评价,应更加重视可预测性和与信息技术的深度融合;评价行为植根于教育教学之中,评价结果为教育教学提供参考,因此,基于全球素质教育的特点,学生综合素质评价体系应提供一个简单、成本低、效果好的记录和评价体系,这是学生综合素质评价体系建设和管理过程不断优化和完善的方向。

二.主要设计(研究)内容

根据教育教学和高等教育管理过程的实际需要,本设计以学生素质评估档案的完整管理为核心,涉及教务处、班主任、学生处、辅导员、教务处等,校医与全体学生,为教育管理服务,全面构建学生在校期间全面质量评估档案的高效管理,以体现更好的交互性和灵活性,摆脱各种时空约束,本系统采用流行成熟的三层B/S结构,依托IntelliJ idea平台和MySQL数据库进行设计和实现,本设计的主要内容包括:

1.根据高等教育学生综合评价报告管理系统的要求,结合信息技术的发展和应用现状,对系统的业务流程、功能需求、功能需求、系统功能、系统功能等进行了详细的研究和分析,在系统安全性和系统性能方面,设计并实现整个高校学生评价报告管理系统。

2.根据需求分析和系统设计的基本理论,实现了B/S三层架构与MySQL数据库管理技术的结合,采用软件工程的思想,采用面向对象的分析与设计方法组织各系统模块的构建,使系统更符合产品设计的实用原则。

3.采用表B/S三级架构实现了系统的各个功能模块,并通过编码和测试,最终达到学校对高校学生全面评价报告管理系统的要求,达到设计预期。

三.研究方案及工作计划(含工作重点与难点及拟采用的途径)

1.研究方案

本次毕业设计采用三层B/S架构,采用MySQL数据库管理系统,结合JavaScript、HTML 5、CSS 3,运用云计算技术,开发一套高中生综合评价报告管理系统,最终形成一系列真实完整的评价系统。利用高中生综合评价管理平台,教师可以多方面发掘和开发学生的潜能,了解学生的发展需求,对每个学生采取积极的教育态度;学生可以认识自己,树立自信,促进自律和自信,促进自律,以自律和自律促进自身的健康成长;家长可以及时衡量孩子的成长,与学校形成合力。反映学生在校学习和生活的过程,使学生素质评估档案管理更加高效、规范、科学,使学生更好地了解自己的优缺点,帮助家长为孩子制定个性化的培训方案,促进学校教育整体质量的提高,以实现人民发展的教育目标。

2.工作计划

起止日期

(日/月)

周次

内容进程

2.18-2.24

1

阅读毕业设计管理办法;收集开题报告的相关资料;学习相关开发技术知识,为接下来的开发工作打好扎实的知识储备。

2.25-3.3

2

翻译英文文献。

3.4-3.10

3

撰写开题报告。

3.11-3.17

4

制作开题报告答辩ppt,并答辩。

3.18-3.24

5

学习STM32有关知识,并掌握开发软件,深入了解酒驾检测系统。

3.25-3.31

6

进行文献调研,了解研究内容的相关知识。

4.1-4.7

7

进行软件设计与硬件设计。

4.8-4.14

8

开始部分论文撰写。

4.15-4.21

9

做好中期检查准备。

4.22-4.28

10

通过视频以及书籍学习,完善基于STM32酒驾检测系统的设计。

4.29-5.5

11

进行代码编写。

5.6-5.12

12

完善后台代码,美化前台界面。

5.13-5.19

13

系统基本完成,完成酒驾检测系统数据的录入,并且进行系统测试,解决出现的问题。

5.20-5.26

14

进行系统设计与编码实现部分的论文撰写,完善论文前面部分。

5.27-6.2

15

在老师的指导下,进行论文的修改、定稿并打印,上交论文和成品。

6.3-6.9

16

做好答辩准备,将系统打包,将文件打印并上交。

四.阅读的主要参考文献(不少于10篇,期刊类文献不少于7篇,应有一定数量的外文文献,至少附一篇引用的外文文献(3个页面以上)及其译文)

[1]曹翰林,唐海娜,王飞,徐勇军.轨迹表示学习技术研究进展[J].软件学报,2021,32(05):1461-1479.

[2]陈明炫,姜映映,田丰,戴国忠.Ruby:一个基于移动设备的个人信息管理系统[J].计算机辅助设计与图形学学报,2010,22(09):1475-1482.

[3]崔磊,赵璇,王本.区域水环境信息管理系统的开发和应用[J].清华大学学报(自然科学版),2008(03):440-444.

[4]李罡,吴志军.基于多QoS约束条件的广域信息管理系统任务调度算法[J].通信学报,2019,40(07):27-37.

[5]李忠,靳小龙,庄传志,孙智.面向图的异常检测研究综述[J].软件学报,2021,32(01):167-193.

[6]刘芳,李戈,胡星,金芝.基于深度学习的程序理解研究进展[J].计算机研究与发展,2019,56(08):1605-1620.

[7]刘功申,邱卫东,孟魁,李建华.基于真实数据挖掘的口令脆弱性评估及恢复[J].计算机学报,2016,39(03):454-467.

[8]牟雁超,苏汉宸,程序,李红燕,王腾蛟.ASIC:一种适用于云数据管理的自适应辅助索引机制[J].计算机研究与发展,2013,50(S1):352-360.

[9]唐辉军,王乐,樊成立.基于模式增长的高效用序列模式挖掘算法[J].自动化学报,2021,47(04):943-954.

[10]王硕,汤光明,王建华,孙怡峰,寇广.基于因果知识网络的攻击场景构建方法[J].计算机研究与发展,2018,55(12):2620-2636.

[11]吴信东,叶明全,胡东辉,吴共庆,胡学钢,王浩.普适医疗信息管理与服务的关键技术与挑战[J].计算机学报,2012,35(05):827-845.

[12]吴玉佳,李晶,宋成芳,常军.基于高效用神经网络的文本分类方法[J].电子学报,2020,48(02):279-284.

[13]Wookey Lee,Justin JongSu Song,Charles Cheolgi Lee,Tae-Chang Jo,James J. H. Lee. Graph threshold algorithm[J]. The Journal of Supercomputing,2021(prepublish).

[14]Sun Xianwen,Xu Ruzhi,Wu Longfei,Guan Zhitao. A differentially private distributed data mining scheme with high efficiency for edge computing[J]. Journal of Cloud Computing,2021,10(1).

[15]Chunmei Yuan,Yikun Yang,Yang Liu. Sports decision-making model based on data mining and neural network[J]. Neural Computing and Applications,2020(prepublish).

外文文献及其翻译

Graph threshold algorithm

Wookey Lee,Justin JongSu Song,Charles Cheolgi Lee,Tae-Chang Jo,James J. H. Lee

Abstract Recently more and more information sources are connected together and become a sort of complex graphs that can be exploited not only as a structured and semi-structured data such as rdb or xml, RDF or NoSQL, but also as many kinds of unstructured data such as web, bioinformatics, genometrics, patents, social media, knowlege graphs, IoT, hidden graph from deep learning results. State of the art studies have suggested methods of presenting data as hyper-graphs in search queries and finding search results in subgraphs or graph embeddings rather than a list of individual node results. We study the problem of retrieving top-k graph results with the query relevances; that is, given a set of query keywords Q on a graph G, we aim to find a subgraph g of G such that g is highly related to Q and closely linked under g. In order to consider the relevant graph results and the connectivity simultaneously, we present an effective algorithm graph threshold algorithm (GTA) based on a threshold algorithm (TA) which works efficiently in non-graph structure. We show that GTA does not need unnecessary searches under given objective functions, and prove the existence of an upper bound of the size of subgraph for top-k results, called hopmax,which makes it efficient to find the combined results. Finally, we conduct the performance studies on real and synthetic graphs, which demonstrate that our algorithm significantly outperforms conventional approaches with respect to time complexity and cost consumption.

Keywords Keyword graph,Top-k graph query,Graph TA,GTA

  1. Introduction

Graph based researches have recently been captured wide attention by the information retrieval community, data processing and bio-informatics societies, artificial intelligence displine, operations research group, and more. Since the graphs are not only the general format of the indipendent item searches, but also much more advantageous for them to support the theoretical breakthrough. At first, once the query and the relevant results can be represented by a graph, which can provide the opportunities to apply huge theoretical resources mainly inherited from the previous mathematical achievements of the graph theory. For example, a web page or a social media can be represented by a node and the hyperlink between them by an edge, which will be exactly matched by a graph or digraph. In that case, it can be applied by a lot of graph theories developed so far such as centralities, degrees and diameters, trees and graph theorems, connectivity and diversity, cycle and cut, matrix calculations, network flow dynamics, sensitivity and duality, etc.

Secondly, if the query results are represented by a linked result or a structured one, they can be more inclusive for the semantic representation and broadening in target applications . It can trigger much more enhanced user interface devices than the conventional document listing interfaces. Thus the representation may be enhanced from the simple enumeration of texts to the higher level of graphic user interfaces such as Virtual Reality, eXtreme Reality, or Mixed Reality, so that much more different targets can be included such as bio-molecular structures, medical organisms,chemical reactions,architectural and topological applications, chronological relationships, micro and macro targets, etc. Additionally for the mobile device applications with the limited screen size and personalization, they may be supported to optimize the content and to endow privacy consideration for mobile users.

Another reason why the graph based approach is inevitable is that the size of big data increases exponentially. In the explosive growth of the big data will also increase the connectivity among them, so that the graph approach can be advantageous for the data processing, storing and retrieving information effectively. For example, there attacks the pagerank scores by increasing huge number of fake nodes and edges, which also can be analyzed and protected by the Search Engine Optimization approaches. This phenomenon emphasizes the importance of graph search methodologies reflecting complex structures and big data. Additionally, the heterogeneous network makes the problem the more difficult, since it includes not only the different types of objects that is not a homogeneous link between the nodes, but also a combination of different types of data intermixed with different data sources. Note that the conventional approaches have only considered the query and a single suitable result so far.

The main contributions of this work are as follows:

  1. We propose a novel algorithm for the hyper node problem that incoporates the node as well as the graph as a whole top-k graph search target.
  2. We exclude the unnecessary operations by theoretically deriving the size of the graph based on the graph threshold.
  3. We prove GTA to be superior to the existing research methods experimentally on the real-world data.
  1. Related works

The related researches are classified into three directions according to the search method and the results. The first is the traditional top-k query processing methods that efficiently find k individual objects due to the set of query terms, which would have been and will be continued with respect to newly appearing various data types. The second is the keyword based graph search methods. For example, given the query keywords as Qset , {“painkiller”, “PHR”, “Pubmed”}, then the result of top-k query processing will be a set of single nodes containing all the query keywords. The top-k query processing method has been working for  finding a single list, but it does not provide the structural information (graph results). Another limitation has been reported that if the number of keywords increases, the result will inevitably be sacrificed by either the scarcity or duplication.

The third is that the top-k graph search where it can be sub-classified into the query type, search result type and data type.Many subgraph matching algorithms can be applied to graph-structured data regardless of unstructured (ex. Web, Doc) or semi-structured (ex. XML, RDF) or structured (ex. RDBMS). However, the limitation of these methods would be the burden to know the structure of the source and to construct an accurate query . The success of these search stems from what it does not require, a specialized query language or knowledge of the underlying structure of the data. In reality, it is difficult for users to create structured query terms with the unknown schema or the graph topology.

Our motivation is to consider the graph topology as well as TA approach together. Since TA is exellent on the sorted results, but the number of node combinations from the graph and devising the hyper-graph topology is a challenging issue. We suggest a hyper-graph notion and the relevant ranking measure which can open a new opportunity to incorporate the information retrieval perspective in addition to data science point of views. One more thing to mention here is that the size of the input nodes will be increased exponentially on the graphs in general that coerces our problem from applying a deep learning network, which will be our future work.

  1. Naïve graph threshold algorithm

In this section, we want to see why the threshold algorithm (TA) and the naïve extention of it are not appropriate for Graph environment, where the TA can find the optimal solution without reading the data less than a threshold value obtained from a monotonic function. The preliminary procedure for TA is summarized as follows. Table 1 lists the frequently used notations in this paper.

Table 1 The notations with the description

Notation

Description

m

Number of keyword relevance lists

Li

Ranked list for ith attribute

o

Object to be scored

F()

Scoring (ranking) function

()

Upper bound score

pi(o)

The value of scoring predicate pi applied to o

pi

Score of the last object seen under sorted access

T

Threshold value

Ak

Top-k answer set

Algorithm 1 NaïveGTA:NaïveGraph Threshold Algorithm

Require: Li,m,o

Ensure: Ak

  T ← 0,Result set Ak ← 0,

  for all m do                                                  ▷Step 1

     Sort Li each m

     Find pi(o) in Li

  Compute F(o)=F(p1,…,pm)

  While k > ||Ak|| do

     for all iLi do                                            ▷Step 2

        T ←

     Ak ← Sorted Set((o,F(o)|oAk))                               ▷Step 3

  Return Ak

Step 1. Do sorted access in parallel to each of the m sorted lists Li. As a new object o is seen under the sorted access in some list, do random access to the other lists to find pi(o) in every other list Li. Predicate pi determines objects order in Li. Then compute the score F(o)= F(p1, … , pm) of object o. If this score is among the k highest scores seen so far, then remember object o and its score F(o) (ties are broken arbitrarily, so that only k objects and their scores are remembered at any time).      

Step 2. For each list Li, letbe the score of the last object seen under sorted access. Define the threshold value T to be F(,…,) . As soon as at least k objects have been seen whose scores is at least equal to T, then halt.

Step 3. Let Ak be a set containing the k objects that have been seen with the highest scores. The output is then the sorted set {(o, F(o)|oAk )}.

  1. Graph threshold algorithm (GTA)

Table 2 summarizes the notations used in the GTA algorithm.

The biggest problem with NaïveGTA is that as mentioned in Sect.4, the algorithm can be excuted after finding Gh including all hop node combinations. There are two main factors that can reduce the computational complexity of the graph top-k query problem.

Table 2  Notations used in the GTA algorithm

Notation

Description

T

Threshold value

Ak

Top-k results set

k

Number of top-k

s(v)

Cost function at node v

Li

All attribute value Li (1 ≤ im, ordered by descending order)

Ntop

Top cost node when GTA precessing in current iteration

Nvisited

Visited nodes (already done random access)

Hcandi

Hop nodes that is connected unvisited nodes

Ngate

Single nodes that is connected current top nodes in Nvisited

The first thing is to make Gh minimize as much, and secondly, the hop node combination should be derived as fast as possible. GTA is a method for generating hop nodes by proceeding with the step of the algorithm in order to reduce the former   Gh and supplemented the disadvantages of NaïveGTA by improving the hop node combining operation to solve the latter problem. The algorithm is described below step by step.

Step 1. There is a graph G. Each node attribute will be stored as an ordered list Li with m attributes. And find a current top node-set (hereinafter referred to as Ntop) in each list (L1, … , Lm) while sequentially searching for Li sorted in parallel. This is to store the objects to be random access in the current iteration. If all the top node-set is saved, delete all the nodes stored in Ntop in Li. Then, like TA, random access is performed to find the value of all remaining attributes of Ntop. When the random access is completed, it is saved in the hop graph set Gh, where Gh stores both single node and hop node.

Step 2. Call the MakeHopNode function, which is a sub-procedure to create the hop graph. We store the gate node-set (Ngate) using Ntop. Then, find the hop node h using Ngate and Nvisited and put it in Gh. If h is less than or equal to k minus the number of single nodes included in top-k, it is put in Gh (Theorem 3).

Step 3. Modify the threshold T value through Ntop as in the case of TA.

Step 4. Put the value of Gh greater than T into the result set Ak.

Step 5. In the number of Ak is larger than k, the algorithm terminates. Otherwise, go back to Step 1 and repeat iteration.

Step 6. If there is a candidate hop node-set (a non-permanent node and a connected hop node) that can be included in Ak after the result Ak is derived, it can be included in the result set Ak if both cases are satisfied. First, the score of the hop node must be greater than T. Second, the maximum length of the hop node must be less than or equal to k minus the number of single nodes (Theorem 3).

  1. Experiments

We have designed and performed a comprehensive set of experiments to evaluate the search performance of GTA. TA, NaïveGTA, and Star were used as comparison algorithms with GTA. DBLife (3365 nodes, 19,050 arcs) was used as the dataset. Experiments were carried out on the following top-k, number of query size m, graph density, dataset distribution difference, and hop size. Table 3 represents the queries of the experiments.

Table 3  The example of queries

Query

Query

Q1

Conference integration

Q6

XML search

Q2

Relational search

Q7

Database tuning

Q3

Turkey fuzziness

Q8

Dataspaces information

Q4

Berkeley dataspaces

Q9

Indexing ranking

Q5

Beijing integration

Q10

Romance educator

(a) execution time with top-k (|h| = 2, m = 3)

(b)execution time with queries (m = 2,|h = 2)

Fig. 1  Comparison of algorithms execution time according to top-k change and each query

The first experiment carried out the algorithm performance for increasing k with the number of hops (|h|) fixed at 3. Figure 1a represents the average execution time with the top-k change from each query. Although there is no significant difference in execution time when |h| is fixed at 3, STAR records an increase as the matrix size increases as the top-k increases. TA algorithm is not affected much by the number of top-k. This is because the TA algorithm determines the result according to the number of query size m rather than the number of data n. The following experiment is the execution time for each query. The execution time is taken to find the top-100 result for each query. Similar to the top-k test results, the graph search algorithms took about 2–3 times more search time than the TA algorithm, but GTA showed the best performance.

The second experiment carried out the algorithm performance for increasing edges with |h| fixed at 2. In Fig.2a, in the case of NaïveGTA, it can be seen that the number of edges increases proportionally when making hop nodes. GTA and TA were not significantly affected by edges increase.

This experiment is a comparison of search speed according to hop size change. In Fig.2b, in the case of TA, all results have the same time because there is no hop, so it is not affected by hop. NaïveGTA shows that the hop node generation time is greatly increased even when the hop size at 4, and the speed is increased sharply from the hop size at 2. The GTA is not significantly affected by the number of hops, and the performance is even better.

This experiment is a comparison of execution time with data distribution. The experimental result shows that 500, 1500, and 3000 nodes of arbitrary data are generated with three data types (uniform, correlated, anti-correlated) in Fig.3.

(a)execution time with graph density

(b)execution time with hop size

Fig. 2 Comparison of algorithms execution time according to number of edges and hop size

Fig. 3 Comparison of three data set (Uniform/Correlated/Anti-Correlated)

(a)Comparison of average number of hops (avg.|h|) by number of edge with top-k Change

(b)Comparison of the average number of hops (avg.|h|)by number of nodes according to top-k change

Fig. 4 Comparison of the average hop size according to number of edges and nodes

The point of interest is that it has the worst performance in uniform data and the best performance in anti-correlated data, which is important. Considering anti- correlated data in two dimensions, we can see that it is concentrated in the threshold part of y = −x type. This type of data type can quickly derive the TA result compared to other data types because the threshold is quickly lowered without performing many iterations.

This experiment is a comparative experiment on the change of the number of hop nodes according to top-k change. We experimented with two variables (node, edge). In Fig.4a, we experimented to generate four graphs of 50, 100, 200, and 400 edges in 50 nodes and experimented how the average hop count changes for each graph. As the number of nodes is 50, the number of hops is 1.5 on the average, with a ratio of 1 to 1 for 50 edges. Also, it was not significantly affected by the number of top-k. In edges 100 cases, the ratio of the node to the edge is 1:2 ratio, but it did not respond to top-k. Therefore, we can see that the change of the number of hops according to top-k greatly affected by the dense graph. In Fig.4b, we experimented with 100 edges with 100 nodes to 3000 nodes. Experimental results show that the average number of hops is maintained at 1 to 3 and the number of nodes is not significantly affected. The experimental results show that the actual average number of hops is significantly smaller than that of Max Hop. This is because the number of hop nodes is reduced each time a singleton node is added to top-k. Also, as the number of arcs increases, the probability of generating a good result set is high even with a small number of hops so that the number of |h| is much smaller than that of Max (k). Figure 5 shows the change of max-hop with increase of dimension m. The random graph is generated and repeated 5 times to show the number of MaxHop on the graph. When the dimension is less than 10, the number of h increases with the increase of the dimension. In this experiment, we can observe that h is not always smaller than top-k, and that the sensitivity to dimension does not increase much after hop 11. That is, as the arc increases, the number of hops increases finely on the average as the dimension increases as a whole.

(a)Comparison of the average number of hops (avg.|h|) by number of edge with top-k Change

(b)Comparison of average number of hops (avg.| h|) by number of nodes according to top-k change

Fig. 5 Comparison of the average hop size according to number of edges and nodes

In Fig.6a, the experiment is the top-k change experiment when |h| is changed to 4 on the experiment. TA was not influenced by the number of hops, so it showed good results and GTA was also not influenced by hop significantly. STAR was also somewhat slower than GTA but generally stable. However, in the case of NaïveGTA, the hop node generation time also increases, and it can be seen that it increases greatly in top-2. In addition, from the top-3 or higher, the result is exponentially increased. Figure 6b shows the performance test when k increases significantly. (top-200 to 1000). In general, all the algorithms except NaïveGTA showed good experimental results even if they proceeded to top-1000 or higher. In the case of NaïveGTA, the speed was greatly increased and was excluded from the experimental results.

(a)execution time according to top-k change (h=4)

(b)execution time according to top-k

Fig. 6 Comparison of the execution time according to top-k with hop size 4

  1. Conclusion

We present two algorithms, called NaïveGTA and GTA, for solving the Top-k graph search problem with respect to k, dimension, and arc/node changes compared with the existing algorithms. The Top-k graph search problem is bounded by the combi- nation of the number of arcs where all the hops of the arcs should be enumerated.To solve this problem, we prove that h < max(d, k) and the Top-k graph search problem can be determined by d and k, and the threshold for the maximum hop is quickly decided by |h|. This method not only covers all the hop combinations but also allows sub-graph enumeration without reading all the combinations which is the main difference between NaïveGTA and GTA. We devised a novel algorithm GTA that can effectively solve the Top-k graph search problem for the general graph environment which is the general form of isolated node problems even with a high ratio of arcs such as complete graphs and various dimensions. It can be applied to graph-structured data environment regardless of unstructured (ex. Web, Doc) or semi-structured (ex. XML, RDF) or structured (ex. RDBMS) so that the various fields for finding top-k results considering relationships can easily covered such as social networks, patents, citation graphs, web SEO and fake references, etc. In the future the size of the input combination will be increased from the huge graph embeddings, we will tackle the issue by a deep and shallow learning network.

图阈值算法

【摘要】近年来,越来越多的信息源连接在一起,成为一种复杂的图形,不仅可以作为结构化和半结构化数据(如rdb或xml,RDF或NoSQL)使用,还可以作为多种非结构化数据(如网络、生物信息学、基因组学、专利、社交媒体、知识图、物联网)深度学习结果中的隐藏图。现有技术研究提出了在搜索查询中将数据呈现为超图并在子图或图嵌入中而不是单个节点结果列表中查找搜索结果的方法。 我们研究了与查询相关性检索top-k图结果的问题;也就是说,给定图G上的一组查询关键字Q,我们的目标是找到G的子图g,以使gQ高度相关并在g下紧密链接。为了同时考虑相关图结果和连通性,我们提出了一种基于阈值算法(TA)的有效算法图阈值算法(GTA),该算法在非图结构中有效工作。我们证明了GTA在给定的目标函数下不需要不必要的搜索,并证明了前k个结果的子图大小的上限存在,称为hopmax,这使得查找组合结果非常有效。最后,我们在实数图和合成图上进行了性能研究,这表明在时间复杂度和成本消耗方面,我们的算法明显优于传统方法。

关键字:关键字图;Top-k图查询;Graph TA;GTA

1 引言

最近,基于图的研究受到了信息检索社区、数据处理和生物信息学会、人工智能部门、运筹学小组等的广泛关注。由于图形不仅是独立项目搜索的通用格式,而且对于它们支持理论突破也更具优势。首先,一旦查询和相关结果可用图表示,就可以提供机会应用主要从图理论的先前数学成果继承的巨大理论资源。例如,网页或社交媒体可以由节点表示,它们之间的超链接可以由边缘表示,该边缘将由图形或有向图精确匹配。在那种情况下,它可以被迄今为止开发的许多图论所应用,例如中心度,度和直径,树和图定理,连通性和多样性,循环和割,矩阵计算,网络流动力学,灵敏度和对偶性等。

其次,如果查询结果由链接结果或结构化结果表示,则它们对于语义表示和目标应用程序的扩展可能更具包容性。与传统的文档列表界面相比,它可以触发更多增强的用户界面设备。因此,可以从简单的文本枚举到更高级别的图形用户界面(例如虚拟现实,极限现实或混合现实)增强表示,以便可以包含更多不同的目标,例如生物分子结构、医用生物、化学反应、建筑和拓扑应用、时间关系、微观和宏观目标等。另外,对于屏幕尺寸和个性化受限的移动设备应用程序,可以支持它们以优化内容并为移动用户提供隐私考虑。

本文的主要贡献如下:

  1. 我们提出了一种针对超节点问题的新算法,该算法将节点以及图作为整体的top-k图搜索目标进行了整合。
  2. 我们从理论上根据图阈值推导图的大小来排除不必要的操作。
  3. 我们通过实验证明了GTA在现实数据上优于现有的研究方法。

2 相关研究工作

根据搜索方法和结果将相关研究分为三个方向。第一种是传统的top-k查询处理方法,该方法由于查询项的集合而有效地找到了k个单独的对象,对于新出现的各种数据类型,这将是并将继续。 第二种是基于关键字的图搜索方法。例如,给定查询关键字Qset,{“painkiller”, “PHR”, “Pubmed”},则top-k查询处理的结果将是包含所有查询关键字的一组单个节点。top-k查询处理方法一直在寻找单个列表,但是它不提供结构信息(图形结果)。据报道,另一个限制是,如果关键字的数量增加,则结果将不可避免地因缺乏或重复而牺牲。

第三是top-k图形搜索,可以在其中将其细分为查询类型,搜索结果类型和数据类型。无论是非结构化(例如Web,Doc)还是半结构化(例如XML,RDF)或结构化(例如RDBMS),许多子图匹配算法都可以应用于图结构化数据。但是,这些方法的局限性是了解源结构和构建准确查询的负担。这些搜索的成功源于它不需要什么,一种专门的查询语言或对数据底层结构的了解。实际上,用户很难用未知的架构或图拓扑创建结构化的查询词。

本文的目的是同时考虑图拓扑和TA方法。 由于TA在排序结果上是出色的,但是来自图的节点组合数量和设计超图拓扑是一个具有挑战性的问题。本文建议使用超图概念和相关的排名度量标准,这可为数据科学观点之外的信息检索观点提供新的机会。这里要提到的一件事是,输入节点的大小通常会在图形上成倍增加,这将迫使我们应用深度学习网络来解决我们的问题,这将是本文未来的工作。

3 朴素的图阈值算法

在本章中,我们想了解为什么阈值算法(TA)和它的朴素扩展不适合图环境,在图环境中TA可以找到最优解决方案,而不需要读取从单调函数获得的低于阈值的数据。对TA的初步程序总结如下。表1列出了本文中常用的符号。

1 符号和说明

符号

说明

m

关键字相关性列表数

Li

ith属性的排名列表

o

得分对象

F()

计分(排名)功能

()

上限分数

pi(o)

得分谓词pi的值应用于o

pi

在排序访问下看到的最后一个对象的分数

T

门槛值

Ak

Top-k答案集

Algorithm 1 NaïveGTA:NaïveGraph Threshold Algorithm

Require: Li,m,o

Ensure: Ak

  T ← 0,Result set Ak ← 0,

  for all m do                                                  ▷Step 1

     Sort Li each m

     Find pi(o) in Li

  Compute F(o)=F(p1,…,pm)

  While k > ||Ak|| do

     for all iLi do                                            ▷Step 2

        T ←

     Ak ← Sorted Set((o,F(o)|oAk))                              ▷Step 3

  Return Ak

步骤 1对每一个排序的列表Li并行进行排序访问。当一个新对象o出现在某个列表的排序访问下时,对其他列表进行随机访问,以在每个其他列表Li中找到pi(o)。谓词pi确定对象在Li中的顺序。然后计算对象o的分数F(o)= F(p1, … , pm)。如果这个分数是迄今为止看到的k个最高分数之一,那么记住对象o和它的分数F(o)(平局被任意打破,因此任何时候只有k个对象及其分数被记住)。

步骤 2对于每个列表Li,设为排序访问下看到的最后一个对象的得分。定义阈值TF(,…,) 。只要至少有k个对象的分数至少等于T,那么就停止。

步骤 3Ak是一个包含k个已经见过的得分最高的对象的集合。然后输出为排序集{(o, F(o)|oAk)}。

4 图阈值算法(GTA)

表2总结了GTA算法中使用的符号。NaïveGTA最大的问题是如第4章所述,在找到包含所有跳节点组合的Gh后,可以执行算法。有两个主要因素可以降低图顶k查询问题的计算复杂度。

2  GTA算法中使用的符号

符号

说明

T

阈值

Ak

top-k结果集

k

k的数量

s(v)

节点v处的成本函数

Li

所有属性值Li(1≤im,按降序排列)

Ntop

当前迭代中GTA处理时的最高成本节点

Nvisited

已访问的节点(已做随机访问)

Hcandi

跳接未访问节点的节点

Ngate

连接N中当前顶部节点的已访问单个节点

首先要尽可能的最小化Gh,其次要尽可能快的推导出hop节点组合。GTA是通过对算法的步骤进行处理来生成跳节点的方法,减少了之前的Gh,补充了NaïveGTA的缺点,通过改进跳节点的组合操作来解决后一个问题。下面将逐步描述该算法。

步骤1有一个图g,每个节点属性将被存储为一个有序列表Li,包含m个属性。在并行搜索排序Li的同时,在每个列表(L1, … , Lm)中找到当前的top节点集(以下简称Ntop)。这是为了在当前迭代中存储要随机访问的对象。如果保存了所有top节点集,则在Li中删除所有存储在Ntop中的节点。然后,与TA一样,执行随机访问,以查找Ntop的所有剩余属性的值。当随机访问完成后,将其保存在跳图集Gh中,其中Gh存储单节点和跳节点。

步骤2调用MakeHopNode函数,它是创建跳图的子过程。使用Ntop存储门节点集(Ngate)。然后,利用NgateNvisited找到hop节点h,放入Gh中。如果h小于等于k减去top-k中包含的单个节点数,则将其放入Gh中。

步骤3和TA一样,通过Ntop修改阈值T

步骤4Gh大于T的值放入结果集Ak中。

步骤5Ak的个数大于k时,算法终止。否则,返回步骤1并重复迭代。

步骤6如果在得到结果Ak后存在一个候选跳节点集(非永久节点和已连接的跳节点)可以包含在Ak中,则在两种情况都满足的情况下,也可以将其包含在结果集Ak中。首先,跳节点的得分必须大于t。其次,跳节点的最大长度必须小于或等于k减去单个节点数。

5 实验

我们设计并进行了一套全面的实验来评估GTA的搜索性能。使用TA, NaïveGTA, Star作为与GTA的比较算法。数据集使用DBLife(3365个节点,19,050弧度)。分别对top-k、查询大小数m、图密度、数据集分布差异、跳数大小进行了实验。表3表示实验的查询。

3  查询事例

查询

查询

Q1

会议整合

Q6

XML 搜索

Q2

关系搜索

Q7

数据库调优

Q3

土耳其模糊性

Q8

数据空间信息

Q4

Berkeley数据空间

Q9

指数排名

Q5

北京一体化

Q10

浪漫教育家

(a) top-k的执行时间(|h| = 2, m = 3)

(b)查询(m = 2,|h| = 2)的执行时间

top-k变化与每次查询算法执行时间对比

第一个实验是在跳数(|h|)固定在3的情况下增加k的算法性能。图1a表示每个查询的top-k变化的平均执行时间。虽然|h|固定为3时,执行时间没有显著差异,但STAR记录随着矩阵大小的增加,top-k增加。top-k个数对TA算法影响不大。这是因为TA算法是根据查询大小的个数m来决定结果的,而不是根据数据个数n。下面的实验是每个查询的执行时间。执行时间用于查找每个查询的前100个结果。与top-k测试结果相似,图搜索算法的搜索时间是TA算法的2-3倍左右,但GTA的性能最好。

第二个实验对|h|和固定在2时增加边的算法性能进行了研究。在图2a中,在NaïveGTA的情况下,可以看到,在制作跳节点时,边的数量成比例地增加。边缘增加对GTA和TA无显著影响。

此实验是根据跳数变化来比较搜索速度。在图2b中,对于TA来说,由于没有hop,所有的结果都有相同的时间,所以它不受hop的影响。从NaïveGTA可以看出,即使当跳数为4时,跳节点的生成时间也大大增加,速度比跳数为2时的生成速度明显提高。GTA不受跳数的显著影响,性能甚至更好。

此实验是对执行时间和数据分布的比较。实验结果显示,图3中三种数据类型(均匀、相关、反相关)分别生成500、1500、3000个任意数据节点。

(a)图密度下的执行时间

(b)执行时间与跳的大小

2 算法执行时间按边数和跳数比较

3 三组数据集比较(一致/相关/反相关)

(a)平均跳数与上k变化的边数的比较

(b)top-k变化的平均跳数与节点数的比较

4 平均跳数按边数和节点数的比较

有趣的是,它在均匀数据中表现最差,在反相关数据中表现最好,这一点很重要。考虑二维的反相关数据,我们可以看到它集中在y = −x类型的阈值部分。与其他数据类型相比,这种类型的数据类型可以快速获得TA结果,因为阈值很快就降低了,无需执行多次迭代。

本实验是对hop节点数量随top-k变化的对比实验。我们实验了两个变量(节点,边缘)。在图4a中,我们尝试在50个节点中生成50条、100条、200条、400条边的4个图,并尝试每个图的平均跳数是如何变化的。当节点数为50时,平均跳数为1.5,50条边的比例为1:1。top-k数量对其影响不显著。在边100例中,节点与边的比例为1:2,但对top-k无响应。因此,我们可以看到,根据top-k的跳数的变化很大程度上受到稠密图的影响。在图4b中,我们实验了100条边,100个节点到3000个节点。实验结果表明,平均跳数保持在1 ~ 3,节点数没有明显的影响。实验结果表明,实际的平均跳数明显小于最大跳数。这是因为每次向top-k添加一个单节点时,跳节点的数量就会减少。同时,随着弧数的增加,即使跳数较少,生成好的结果集的概率也很高,因此|h|的数目远小于Max (k)的数目。图5显示了变化的max-hop增加尺寸m。生成随机图和重复5次MaxHop在图的数量。当维数小于10时,h的个数随着维数的增加而增加。在这个实验中,我们可以观察到h并不总是小于top-k,并且在hop 11之后,对尺寸的敏感性并没有增加太多。也就是说,随着弧度的增加,跳数随着维度的整体增加而平均地增加。

(a)比较跳数平均值与上k变化的边数

(b)top-k变化比较平均跳数与节点数

5 平均跳数按边数和节点数的比较

在图6a中,实验为实验上|h|变为4时的top-k变化实验。结果表明,TA不受跳数的影响,表现良好,而GTA也不受跳数的影响。STAR也比GTA慢一些,但总体上是稳定的。但是,在NaïveGTA的情况下,跳节点的生成时间也会增加,并且可以看出,它在top-2中大大增加了。此外,从前三名或更高开始,结果呈指数增长。图6b显示了当k显着增加时的性能测试。总的来说,除NaïveGTA之外的所有算法都表现出良好的实验结果,即使  它们达到了1000以上或更高。在NaïveGTA的情况下,速度大大提高,并被排除在实验结果之外。

7 结论

与现有算法相比,我们提出了两种算法,分别称为NaïveGTA和GTA,用于解决关于k,维数和弧/节点变化的Top-k图搜索问题。Top-k图搜索问题的局限在于应枚举所有弧段的弧段数量的组合。为解决此问题,我们证明了h < max(d, k)和top-k图搜索问题可以由dk确定,最大跳数的阈值可以通过|h|快速确定。 这种方法不仅涵盖了所有跃点组合,而且还允许在不读取所有组合的情况下进行子图枚举,这是NaïveGTA与GTA之间的主要区别。我们设计了一种新颖的算法GTA,该算法可以有效地解决一般图环境中的Top-k图搜索问题,这是孤立节点问题的一般形式,即使具有高比例的圆弧(例如完整图和各种尺寸)也是如此。 它可以应用于图结构化数据环境,而不管其是非结构化(例如Web,Doc)还是半结构化(例如XML,RDF)还是结构化(例如RDBMS),因此考虑到查找前k个结果的各个字段,可以轻松涵盖各种关系,例如社交网络,专利,引文图表,Web SEO和虚假引用等。在未来,输入组合的规模将由巨大的图嵌入增加,我们将通过深度和浅层学习网络来解决这个问题。

(a)top-k变化的执行时间(h=4)

(b)top-k的执行时间

hop大小下top-k执行时间对比

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值