cpython库_机器学习CPython库'VKF'

最新推荐文章于 2024-09-18 20:39:31 发布

cullen2012

最新推荐文章于 2024-09-18 20:39:31 发布

阅读量395

点赞数

文章标签： python 机器学习人工智能深度学习 java

原文链接：https://habr.com/en/post/510120/

版权

cpython库

Previous article describes a web server for Machine Learning system 'VKF' based on Lattice Theory. This paper is an attempt to explain details of using the CPython library directly. We reproduce working sessions of experiments on datasets 'Mushroom' and 'Wine Quality' from UCI Machine Learning repository. The structures of input files are discussed too. 上一篇文章描述了基于莱迪思理论的机器学习系统'VKF'的Web服务器。本文试图解释直接使用CPython库的细节。我们从UCI机器学习存储库中复制关于数据集“蘑菇”和“葡萄酒质量”的实验工作会议。输入文件的结构也进行了讨论。

介绍 (Introduction)

In the paper a new version of Machine Learning system based on Lattice Theory is presented. Main advantage of this version is a Python interface to effective algorithms codded on C++.

在本文中，提出了一种基于格理论的新版本的机器学习系统。这个版本的主要优势是Python接口，可与C ++编写的有效算法结合使用。

Nowadays Machine Learning procedures (such as convolution networks, random forest and support vector machine) reach a very high level of accuracy exceding humans on speach, video and image recognition. However they cannot to provide arguments to support correctness of the decisions.

如今，机器学习程序(例如卷积网络，随机森林和支持向量机)达到了非常高的准确性水平，在语音，视频和图像识别方面超越了人类。但是，他们无法提供论据来支持决策的正确性。

On the other hand, symbolic approaches to Machine Learning (inductive logic programming, cover learning with integer programming) have a provably high computational complexity and in practice do not applicable to samples of moderate size.

另一方面，机器学习的符号方法(归纳逻辑编程，使用整数编程进行覆盖学习)具有很高的计算复杂性，实际上不适用于中等大小的样本。

The proposed approach uses probabilistic algorithms to avoid these drawbacks. System 'VKF method' applies a modern technique of algebraic Lattice Theory (Formal Concept Analysis) and Probability Theory (especially, Markov chains). However you do not need to understand advanced mathematics to apply the system to intelligent data analysis. The author created CPython library (vkf.cp38-win32.pyd for OS Windows and vkf.cpython-38-x86_64-linux-gnu.so for Linux) to provide an access to procedures through Python language interface.

所提出的方法使用概率算法来避免这些缺点。系统“ VKF方法”应用了代数格理论(形式概念分析)和概率论(特别是马尔可夫链)的现代技术。但是，您无需了解高级数学即可将该系统应用于智能数据分析。作者创建了CPython库(对于OS Windows为vkf.cp38-win32.pyd，对于Linux为vkf.cpython-38-x86_64-linux-gnu.so)，以提供通过Python语言界面访问过程的功能。

There exists a parent approach — JSM method of automatic hypotheses generation — introduced in 80-th years of the XX century by Prof. V.K. Finn. Now it transforms, in terms of its creator, into a method of automated support of scientific investigations. It uses argumentation logics, checks stability of generated hypotheses through sequence of extended samples, and combines of prediction results of different strategies of JSM reasoning.

在20世纪80年代，VK Finn教授引入了一种父方法-自动生成假设的JSM方法。现在，就其创建者而言，它已转变为一种自动支持科学研究的方法。它使用论证逻辑，通过扩展样本序列检查生成的假设的稳定性，并结合JSM推理不同策略的预测结果。

Unfortunately, the author and his colleagues have discovered and investigated some theoretical shortcomings of the 'JSM method':

不幸的是，作者和他的同事发现并研究了“ JSM方法”的一些理论缺陷：

The number of hypotheses can be an exponentially large with respect to the size of input data (training sample) in the worst case.
在最坏的情况下，相对于输入数据(训练样本)的大小，假设的数量可能成倍增长。
Problem of detection of large hypothesis is computational (NP-)hard.
大假设的检测问题是计算(NP-)困难的。
Overtraining is unavoidable and appears in practice.
过度培训是不可避免的，并且在实践中会出现。
There are 'phantom' similarities between training examples, where each such parent has alternative hypothesis on the target property cause.
训练示例之间存在“幻像”相似之处，其中每个这样的父母都有关于目标属性原因的替代假设。
The probability of appearance of a stable similarity tends to zero under increasing number of extensions of training samples.
随着训练样本扩展次数的增加，出现稳定相似性的可能性趋于零。

Results of the author form chapter 2 of his Dr.Science dissertation thesis. Last item was discovered by the author recently, but, in his opinion, puts an end to approach of the extending samples.

作者的结果来自他的Dr.Science论文的第二章。最后一项是作者最近发现的，但在他看来，这终止了扩展样本的方法。

At last JSM community do not provide the source codes of these systems. Moreover the programming languages used (Fort and C#) prevent others to use this approach. The single freely distributed version of the 'JSM-solver' (written in C++ by Tatyana A. Volkova robofreak) known to the author contains an annoying error that sometimes leads to break.

最后，JSM社区不提供这些系统的源代码。此外，使用的编程语言(Fort和C＃)阻止其他人使用此方法。作者已知的免费发行的“ JSM-solver”版本(由Tatyana A. Volkova robofreak用C ++ 编写 )包含一个令人讨厌的错误，有时会导致中断。

Initially, the author planned to share the codes of the encoder developed for the «VKF-method» system with the JSM community. Therefore, he has put all the algorithms that are simultaneously applicable to both JSM and VKF systems in a separate library (vkfencoder.cp38-win32. pyd under OS Windows or vkfencoder.cpython-38-x86_64-linux-gnu.so under Linux). Unfortunately, these algorithms were not used by the JSM community. The VKF library contains same algorithms (for example, the vcf.VCA class), but based on completing in tables directly through the web interface. Here we will use the vkfencoder library.

最初，作者计划与JSM社区共享为«VKF-method»系统开发的编码器的代码。因此，他将所有同时适用于JSM和VKF系统的算法放在单独的库中(在OS Windows下为vkfencoder.cp38-win32.pyd，在Linux下为vkfencoder.cpython-38-x86_64-linux-gnu.so) 。不幸的是，JSM社区没有使用这些算法。 VKF库包含相同的算法(例如vcf.VCA类)，但是基于直接通过Web界面完成表中的算法。在这里，我们将使用vkfencoder库。

1处理离散属性数据 (1 Work with discrete attributes data)

We assume that the reader has the MariaDB server and MariaDB Connector/C (default IP-address is 127.0.0.1:3306, default user 'root' has 'toor' password). We begin with installation of the 'vkfencoder' and 'vkf' libraries into virtual environment 'demo' and with creation of empty database 'mushroom' under MariaDB.

我们假设读取器具有MariaDB服务器和MariaDB Connector / C(默认IP地址为127.0.0.1:3306，默认用户“ root”具有“ toor”密码)。我们首先将“ vkfencoder”和“ vkf”库安装到虚拟环境“ demo”中，并在MariaDB下创建空数据库“ mushroom”。

krrguest@amd2700vii:~/src$ python3 -m venv demo 
krrguest@amd2700vii:~/src$ source demo/bin/activate 
(demo) krrguest@amd2700vii:~/src$ cd vkfencoder
(demo) krrguest@amd2700vii:~/src/vkfencoder$ python3 ./setup.py build
(demo) krrguest@amd2700vii:~/src/vkfencoder$ python3 ./setup.py install
(demo) krrguest@amd2700vii:~/src/vkfencoder$ cd ../vkf
(demo) krrguest@amd2700vii:~/src/vkf$ python3 ./setup.py build
(demo) krrguest@amd2700vii:~/src/vkf$ python3 ./setup.py install
(demo) krrguest@amd2700vii:~/src/vkf$ cd ../demo
(demo) krrguest@amd2700vii:~/src/demo$ mysql -u root -p
MariaDB [(none)]> CREATE DATABASE IF NOT EXISTS mushroom;
MariaDB [(none)]> exit;

Then 'mushroom' database occurs, directory ~/src/demo/lib/python3.8/site-packages/vkfencoder-1.0.3-py3.8-linux-x86_64.egg/

然后出现``蘑菇''数据库，目录〜/ src / demo / lib / python3.8 / site-packages / vkfencoder-1.0.3-py3.8-linux-x86_64.egg /

contains the library file vkfencoder.cpython-38-x86_64-linux-gnu.so, and directory ~/src/demo/lib/python3.8/site-packages/vkf-2.0.1-py3.8-linux-x86_64.egg/

包含库文件vkfencoder.cpython-38-x86_64-linux-gnu.so和目录〜/ src / demo / lib / python3.8 / site-packages / vkf-2.0.1-py3.8-linux-x86_64。蛋/

contains the library file vkf.cpython-38-x86_64-linux-gnu.so.

包含库文件vkf.cpython-38-x86_64-linux-gnu.so。

Now we execute Python3 and make a VKF experiment with the dataset 'Mushroom'. We assume that directory ~/src/demo/files/ contains 3 files (mushrooms.xml, MUSHROOMS.train, and MUSHROOMS.rest). The structure of these files will be explained in the next chapter. The first file 'mushrooms.xml' determines structures of low semilattices on values of each attributes describing some feature of mushrooms. The second and third files are (approximately equal) parts of original file 'agaricus-lepiota.data' that is a converted into digital form book «The Audubon Society Field Guide to North American Mushrooms», published in 1981 in New York. The names 'encoder', 'lattices', 'trains', and 'tests' appeared below are the names of tables of 'mushroom' for encodings, cover relations on values, training sample and test sample, respectively.

现在，我们执行Python3并使用数据集“ Mushroom”进行VKF实验。我们假设目录〜/ src / demo / files /包含3个文件(mushrooms.xml，MUSHROOMS.train和MUSHROOMS.rest)。这些文件的结构将在下一章中说明。第一个文件“ mushrooms.xml”根据描述蘑菇某些特征的每个属性的值确定低半格的结构。第二个和第三个文件是原始文件“ agaricus-lepiota.data”的部分(大约相等)，原始文件被转换成数字形式的《奥杜邦协会北美蘑菇业现场指南》，于1981年在纽约出版。下面出现的名称“编码器”，“晶格”，“训练”和“测试”分别是用于编码的“蘑菇”表的名称，分别涉及值，训练样本和测试样本的关系。

(demo) krrguest@amd2700vii:~/src/demo$ Python3
>>> import vkfencoder
>>> xml = vkfencoder.XMLImport('./files/mushrooms.xml', 'encoder', 'lattices', 'mushroom', '127.0.0.1', 'root', 'toor')
>>> trn = vkfencoder.DataImport('./files/MUSHROOMS.train', 'e', 'encoder', 'trains', 'mushroom', '127.0.0.1', 'root', 'toor') 
>>> tst = vkfencoder.DataImport('./files/MUSHROOMS.rest', 'e', 'encoder', 'tests', 'mushroom', '127.0.0.1', 'root', 'toor')

'e' in last two lines corresponds to the edibility of the mushroom (the form of presentation of the target attribute value).

最后两行中的“ e”对应于蘑菇的可食用性(目标属性值的表示形式)。

Note that there exists a vkfencoder.XMLExport class that saves information from two tables 'encoder' and 'lattices' in an xml file, which after making changes can be processed again by the vkfencoder.XMLImport class to replace original encoding.

请注意，存在一个vkfencoder.XMLExport类，该类将来自两个表“ encoder”和“ lattices”的信息保存在一个xml文件中，进行更改后，可以由vkfencoder.XMLImport类再次处理以替换原始编码。

Now we run the VKF experiment: generate the encoder, load the previously calculated hypotheses (if any), calculate given number (100) of additional hypotheses by the specified number (4) of threads, and save them in the 'vkfhyps' table.

现在我们运行VKF实验：生成编码器，加载先前计算的假设(如果有的话)，通过指定的线程数(4)计算给定数量(100)的其他假设，并将其保存在“ vkfhyps”表中。

>>> enc = vkf.Encoder('encoder', 'mushroom', '127.0.0.1', 'root', 'toor')
>>> ind = vkf.Induction()
>>> ind.load_discrete_hypotheses(enc, 'trains', 'vkfhyps', 'mushroom', '127.0.0.1', 'root', 'toor') 
>>> ind.add_hypotheses(100, 4) 
>>> ind.save_discrete_hypotheses(enc, 'vkfhyps', 'mushroom', '127.0.0.1', 'root', 'toor')

It is possible to get a Python list of all non-trivial pairs (attribute_name, attribute_value) for VKF hypothesis with index ndx

可以使用索引ndx获得VKF假设的所有非平凡对(属性名，属性值)的Python列表

>>> ind.show_discrete_hypothesis(enc, ndx)

One of hypothesis to accept a mushroom as eatable one has a form

一种假设可以接受蘑菇作为食用形式

[('gill_attachment', 'free'), ('gill_spacing', 'close'), ('gill_size', 'broad'), ('stalk_shape', 'enlarging'), ('stalk_surface_below_ring', 'scaly'), ('veil_type', 'partial'), ('veil_color', 'white'), ('ring_number', 'one'), ('ring_type', 'pendant')]

Due to the probabilistic nature of the VKF method it may not be generated, but the author has proved that with a sufficiently large number of generated VKF hypotheses, a very similar hypothesis will arise, which will almost as well predict the target property of important test examples. For more details see chapter 4 of the Dr.Science Thesis of the author.

由于VKF方法的概率性质，它可能不会生成，但是作者已经证明，使用足够多的生成的VKF假设，将会出现非常相似的假设，这几乎可以预测重要测试的目标属性例子。有关更多详细信息，请参见作者的Dr.Science论文的第4章。

Last step is a prediction: create a tests sample to estimate a quality of generated hypotheses and simultaneously predict the target property of its elements

最后一步是预测：创建测试样本以估计生成的假设的质量，同时预测其元素的目标属性

>>> tes = vkf.TestSample(enc, ind, 'tests', 'mushroom', '127.0.0.1', 'root', 'toor')
>>> tes.correct_positive_cases()
>>> tes.correct_negative_cases()
>>> exit()

2数据结构 (2 Data structure)

2.1离散属性数据上的格子结构 (2.1 Lattices structures on discrete attributes data)

Class 'vkfencoder.XMLImport' parses an xml file (describing the orders between attributes values), creates, and fills in two tables 'encoder' (which converts each value to a string of bits) and 'lattices' (which stores the relationships between the values of each attributes).

类“ vkfencoder.XMLImport”解析一个xml文件(描述属性值之间的顺序)，创建并填充两个表“ encoder”(将每个值转换为位字符串)和“ lattices”(存储之间的关系)每个属性的值)。

The structure of the input file must be similar to

输入文件的结构必须类似于

<?xml version="1.0"?>
<document name="mushrooms_db">
  <attribute name="cap_color">
    <vertices>
      <node string="null" char='_'></node>
      <node string="brown" char='n'></node>
      <node string="buff" char='b'></node>
      <node string="cinnamon" char='c'></node>
      <node string="gray" char='g'></node>
      <node string="green" char='r'></node>
      <node string="pink" char='p'></node>
      <node string="purple" char='u'></node>
      <node string="red" char='e'></node>
      <node string="white" char='w'></node>
      <node string="yellow" char='y'></node>
    </vertices>
    <edges>
      <arc source="brown" target="null"></arc>
      <arc source="buff" target="brown"></arc>
      <arc source="buff" target="yellow"></arc>
      <arc source="cinnamon" target="brown"></arc>
      <arc source="cinnamon" target="red"></arc>
      <arc source="gray" target="null"></arc>
      <arc source="green" target="null"></arc>
      <arc source="pink" target="red"></arc>
      <arc source="pink" target="white"></arc>
      <arc source="purple" target="red"></arc>
      <arc source="red" target="null"></arc>
      <arc source="white" target="null"></arc>
      <arc source="yellow" target="null"></arc>
    </edges>
  </attribute>
</document>

The above example corresponds to the ordering between the values of the third attribute 'cap_color’ (cap colour) for the 'mushroom' dataset from Machine Learning repository (University of California, Irvine). Each 'attribute' field represents the lattice structure on the values of the corresponding attribute. In our example, the group corresponds to the discrete ‘cap_color ' attribute. The list of all attribute values forms the subgroup. We added the value to indicate a trivial (absent) similarity. The other values correspond to their description in the accompanying file 'agaricus-lepiota.names'.

上面的示例对应于来自Machine Learning存储库(加利福尼亚大学欧文分校)的“蘑菇”数据集的第三属性“ cap_color”(帽色)的值之间的顺序。每个“属性”字段代表相应属性值上的晶格结构。在我们的示例中，该组对应于离散的'cap_color'属性。所有属性值的列表构成子组。我们添加了该值以指示平凡(不存在)的相似性。其他值对应于随附文件“ agaricus-lepiota.names”中的描述。

The order between them is represented by the contents of the subgroup. Each position describes a “more specific/more general" relationship between pairs of attribute values. For example, means that the pink cap of the mushroom is more specific than the red cap.

它们之间的顺序由子组的内容表示。每个位置都描述了成对的属性值之间的“更具体/更一般”的关系，例如，表示蘑菇的粉红色帽子比红色的帽子更具体。

The author proved a theorem about the correctness of the representation generated by the vkfencoder.XMLImport class and stored in the 'encoder' table. His algorithm and proof of the correctness theorem uses a modern part of Lattice Theory known as Formal Concept Analysis. For details, the reader is again referred to the author's dissertation work (Chapter 1).

作者证明了由vkfencoder.XMLImport类生成并存储在“ encoder”表中的表示形式正确性的一个定理。他的算法和正确性定理的证明使用了格子理论的现代部分，即形式概念分析。有关详细信息，请读者再次参考作者的论文 (第1章)。

2.2离散属性的样本文件结构 (2.2 Samples file structure for discrete attributes)

First of all, the reader has two possibilities: either she/he can implement their own loader of training and test examples into the database or can use the class vkfencoder.DataImport from the library. In the second case, the reader should take into account that the target attribute must be at the first position and consist of a single character (for example, '+'/'-', '1'/'0' or 'e'/'p').

首先，读者有两种可能性：要么可以将他们自己的培训和测试示例实现程序加载到数据库中，要么可以使用库中的vkfencoder.DataImport类。在第二种情况下，读者应考虑到目标属性必须位于第一个位置并由单个字符组成(例如，'+'/'-'，'1'/'0'或'e' /'p')。

The training sample must be a CSV file (with comma-separated values) that describes training examples and counter-examples (examples that do not have a target property).

培训样本必须是CSV文件(带有逗号分隔的值)，该文件描述了培训示例和反示例(不具有目标属性的示例)。

The structure of the input file should be similar to

输入文件的结构应类似于

e,f,f,g,t,n,f,c,b,w,t,b,s,s,p,w,p,w,o,p,k,v,d
p,x,s,p,f,c,f,c,n,u,e,b,s,s,w,w,p,w,o,p,n,s,d
p,f,y,g,f,f,f,c,b,p,e,b,k,k,b,n,p,w,o,l,h,v,g
p,x,y,y,f,f,f,c,b,p,e,b,k,k,n,n,p,w,o,l,h,v,p
e,x,y,b,t,n,f,c,b,e,e,?,s,s,e,w,p,w,t,e,w,c,w
p,f,f,g,f,f,f,c,b,g,e,b,k,k,n,p,p,w,o,l,h,y,g
p,x,f,g,f,f,f,c,b,p,e,b,k,k,p,n,p,w,o,l,h,y,g
p,x,f,y,f,f,f,c,b,h,e,b,k,k,n,b,p,w,o,l,h,y,g
p,f,f,y,f,f,f,c,b,h,e,b,k,k,p,p,p,w,o,l,h,y,g
p,x,y,g,f,f,f,c,b,h,e,b,k,k,p,p,p,w,o,l,h,v,d
p,x,f,g,f,c,f,c,n,u,e,b,s,s,w,w,p,w,o,p,n,v,d
p,x,y,g,f,f,f,c,b,h,e,b,k,k,p,b,p,w,o,l,h,v,g
e,f,y,g,t,n,f,c,b,n,t,b,s,s,p,g,p,w,o,p,k,y,d
e,f,f,e,t,n,f,c,b,p,t,b,s,s,p,p,p,w,o,p,k,v,d
p,f,y,g,f,f,f,c,b,p,e,b,k,k,p,b,p,w,o,l,h,y,p

Each line describes a single training example. The first position contains 'e' or ' p ' depending on the edibility/poisonous of the corresponding mushroom. The other positions (separated by commas) contain either letters (short names) or strings (long names of the corresponding attribute values). For example, the fourth position corresponds to the ‘cup_color 'attribute, where' g’,’ p‘,’ y‘,’ b‘, and’ e 'represent ' grey', 'pink', 'yellow', 'buff', and 'red', respectively. A question mark in the middle of the table indicates a missing value. However, the values of other (non-target) attributes can be strings. It is important to note that when saving to the database, spaces in their names will be replaced with '_' characters.

每行描述一个培训示例。根据相应蘑菇的可食用性/有毒性，第一位置包含“ e”或“ p”。其他位置(用逗号分隔)包含字母(短名称)或字符串(相应属性值的长名称)。例如，第四个位置对应于“ cup_color”属性，其中“ g”，“ p”，“ y”，“ b”和“ e”代表“灰色”，“粉红色”，“黄色”，“浅黄色” ”和“红色”。表格中间的问号表示缺少值。但是，其他(非目标)属性的值可以是字符串。重要的是要注意，当保存到数据库时，其名称中的空格将被替换为'_'字符。

The file with test examples has the same form. However the VKF system is not known about the real sign of an example, and its prediction is compared with it to estimate a quality of the generated hypotheses and their predictive power.

带有测试示例的文件具有相同的形式。但是，VKF系统尚不知道示例的真实符号，因此将其预测与之进行比较以估计所生成假设的质量及其预测能力。

2.3连续属性的样本文件结构 (2.3 Samples file structure for continuous attributes)

Again two options are possible: either the reader can create their own loader of training and test samples into the database, or can use the vkfencoder.DataLoad from the library. In the second case, the reader should take into account that the target attribute must be in the first position and consist of a natural number (for example, 0, 7, 15).

同样有两种选择：阅读器可以创建自己的培训和测试样本加载器到数据库中，或者可以使用库中的vkfencoder.DataLoad。在第二种情况下，读者应考虑到目标属性必须位于第一位置并由自然数组成(例如0、7、15)。

Training samples should be saved as a csv file (with values separated by a some separator) that describes training examples and counter-examples (examples that do not have a target property).

培训样本应另存为一个csv文件(其值由一些分隔符分隔)，其中描述了培训示例和反示例(不具有target属性的示例)。

The structure of the input file should be similar to

输入文件的结构应类似于

"quality";"fixed acidity";"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol"
5;7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4
5;7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8
5;7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;9.8
6;11.2;0.28;0.56;1.9;0.075;17;60;0.998;3.16;0.58;9.8
5;7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4
5;7.4;0.66;0;1.8;0.075;13;40;0.9978;3.51;0.56;9.4
5;7.9;0.6;0.06;1.6;0.069;15;59;0.9964;3.3;0.46;9.4
7;7.3;0.65;0;1.2;0.065;15;21;0.9946;3.39;0.47;10
7;7.8;0.58;0.02;2;0.073;9;18;0.9968;3.36;0.57;9.5

Above are the first few lines of the 'vv_red.csv' file generated from the 'winequality-red.csv' file from the UCI MLR Wine Quality dataset (of quality of Portuguese red wines). It is transformed by permutation of the target attribute from the last column to the first one. Note that when saving to the database, spaces in attributes' names will be replaced with '_'characters.

以上是从UCI MLR葡萄酒质量数据集(葡萄牙红酒的质量)的“ winequality-red.csv”文件生成的“ vv_red.csv”文件的前几行。通过将目标属性从最后一列置换到第一列来对其进行转换。请注意，当保存到数据库时，属性名称中的空格将被替换为“ _”字符。

3个具有连续属性数据的VKF实验 (3 VKF experiment with continuous attributes data)

we will demonstrate the work on the Wine Quality dataset from the UCI ML repository. Let's start by creating an empty mariadb database named 'red_wine'.

我们将展示来自UCI ML存储库的Wine Quality数据集上的工作。让我们从创建一个名为“ red_wine”的空mariadb数据库开始。

(demo) krrguest@amd2700vii:~/src/demo$ mysql -u root -p
MariaDB [(none)]> CREATE DATABASE IF NOT EXISTS red_wine;
MariaDB [(none)]> exit;

As a result, the empty 'red_wine' database will appear.

结果，将显示空的“ red_wine”数据库。

Then we run Python 3 and make a VKF experiment on the Wine Quality datasr. We assume that the directory ~/src/demo/files/ contains a vv_red.csv file. The structure of this file was described in the previous paragraph. The names 'verges', 'complex', 'trains', and 'tests' correspond to names of tables in the' red_wine' database for thresholds (both for source and regression-calculated attributes), coefficients of significant logistic regressions between pairs of attributes, and training and test examples, respectively.

然后，我们运行Python 3，并在Wine Quality datasr上进行VKF实验。我们假设目录〜/ src / demo / files /包含一个vv_red.csv文件。该文件的结构已在上一段中进行了描述。名称“ verges”，“ complex”，“ trains”和“ tests”对应于“ red_wine”数据库中用于阈值的表名(包括源属性和回归计算属性)，各对之间的重要逻辑回归系数属性，以及训练和测试示例。

(demo) krrguest@amd2700vii:~/src/demo$ Python3
>>> import vkfencoder
>>> nam = vkfencoder.DataLoad('./files/vv_red.csv', 7, 'verges', 'trains', 'red_wine', ';', '127.0.0.1', 'root', 'toor')

The second argument sets a threshold (7 in our case), when exceeded, the example is declared positive. Argument ';' matches the separator (the default separator is the comma ',').

第二个参数设置一个阈值(在我们的示例中为7)，如果超过该阈值，则该示例声明为正。参数';' 匹配分隔符(默认分隔符是逗号'，')。

As indicated in <a habr.com/en/post/509480>the previous author's note, the procedure for continuous attributes differs radically from the case of discrete features.

如<a habr.com/en/post/509480>先前作者的注释中所述，连续属性的过程与离散特征的过程根本不同。

At first, we calculate logistic regressions using the vkf class.Join, whose coefficients are saved in the 'complex' table.

首先，我们使用vkf类计算逻辑回归。Join的系数保存在“复杂”表中。

>>> join = vkf.Join('trains', 'red_wine', '127.0.0.1', 'root', 'toor')
>>> join.compute_save('complex', 'red_wine', '127.0.0.1', 'root', 'toor')

Now, using information theory, we calculate thresholds using the class vkf.Generator, which together with the maximum and minimum are saved in the 'verges' table.

现在，使用信息论，我们使用vkf.Generator类计算阈值，并将阈值与最大值和最小值一起保存在“ verges”表中。

>>> gen = vkf.Generator('complex', 'trains', 'verges', 'red_wine', 7, '127.0.0.1', 'root', 'toor')

The fifth argument specifies the number of thresholds on the source attributes. By default (and for a value equal to 0), it is calculated as a logarithm of the number of features. Regressions attributes obtain a single threshold.

第五个参数指定源属性上的阈值数。默认情况下(对于等于0的值)，它以要素数量的对数计算。回归属性获得单个阈值。

Now we are ready for the VKF experiment: invoke encoders, load previously calculated hypotheses (if any), compute a given number (300) of additional hypotheses through the specified number (8) of threads, and save them in the 'vkfhyps' table.

现在我们准备好进行VKF实验：调用编码器，加载先前计算的假设(如果有的话)，通过指定的线程数(8)计算给定数量(300)的其他假设，并将其保存在'vkfhyps'表中。

>>> qual = vkf.Qualifier('verges', 'red_wine', '127.0.0.1', 'root', 'toor')
>>> beg = vkf.Beget('complex', 'red_wine', '127.0.0.1', 'root', 'toor')
>>> ind = vkf.Induction()
>>> ind.load_continuous_hypotheses(qual, beg, 'trains', 'vkfhyps', 'red_wine', '127.0.0.1', 'root', 'toor') 
>>> ind.add_hypotheses(300, 8) 
>>> ind.save_continuous_hypotheses(qual, 'vkfhyps', 'red_wine', '127.0.0.1', 'root', 'toor')

It is possible to obtain a Python list of triples (attribute_name, (low_bound, high_bound)) for a VKF hypothesis with index ndx

对于索引为ndx的VKF假设，可以获取Python的三元组列表(属性名，(低限，高限))

>>> ind.show_continuous_hypothesis(enc, ndx)

Last step is a prediction: create a tests sample to estimate a quality of generated hypotheses and simultaneously predict the target property of its elements

最后一步是预测：创建测试样本以估计生成的假设的质量，同时预测其元素的目标属性

>>> tes = vkf.TestSample(qual, ind, beg, 'trains', 'red_wine', '127.0.0.1', 'root', 'toor')
>>> tes.correct_positive_cases()
>>> tes.correct_negative_cases()
>>> exit()

Since we only have one file, here we used a training sample as test examples.

由于我们只有一个文件，因此在这里我们将训练样本用作测试示例。

4在Python中上传文件的备注 (4 Remark on file uploading in Python)

The author uses the aiofiles library to upload files (such that 'mushrooms.xml') on web server. Here is a snippet of code that might be useful

作者使用aiofiles库在Web服务器上上传文件(例如“ mushrooms.xml”)。这是一段可能有用的代码

import aiofiles
import os
import vkfencoder

async def write_file(path, body):
    async with aiofiles.open(path, 'wb') as file_write:
        await file_write.write(body)
    file_write.close()

async def xml_upload(request):
    #determine names of tables
    encoder_table_name = request.form.get('encoder_table_name')
    lattices_table_name = request.form.get('lattices_table_name')
    #Create upload folder if doesn't exist
    if not os.path.exists(Settings.UPLOAD_DIR):
        os.makedirs(Settings.UPLOAD_DIR)
    #Ensure a file was sent
    upload_file = request.files.get('file_name')
    if not upload_file:
        return response.redirect("/?error=no_file")
    #write the file to disk and redirect back to main
    short_name = upload_file.name.split('/')[-1]
    full_path = f"{Settings.UPLOAD_DIR}/{short_name}"
    try:
        await write_file(full_path, upload_file.body)
    except Exception as exc:
        return response.redirect('/?error='+ str(exc))
    return response.redirect('/ask_data')

结论 (Conclusion)

The library 'vkf.cpython-38-x86_64-linux-gnu.so' contains a lot of hidden classes and algorithms. They use advanced mathematical concepts such as operations 'closed-by-one', lazy computations, coupled Markov chains, transient Markov chain states, metric of the total variation, and others.

库'vkf.cpython-38-x86_64-linux-gnu.so'包含许多隐藏的类和算法。他们使用高级数学概念，例如“一一封闭”运算，惰性计算，耦合马尔可夫链，瞬态马尔可夫链状态，总变化量度等。

In practice, experiments with datasets from the Machine Learning repository (University of California, Irvine) have proven the applicability of the C++-program ‘VKF Method’ to medium-sized data (the Adult array contains 32560 training and 16280 test examples).

实际上，对来自机器学习存储库(加利福尼亚大学欧文分校)的数据集的实验证明，C ++程序“ VKF方法”适用于中型数据(“成人”数组包含32560个训练和16280个测试示例)。

The 'VKF Method' system has outperformed the CLIP3 system (cover learning with integer programming v.3) in terms of accuracy of prediction on the SPECT array, where CLIP3 was a program created by the authors of the SPECT data. On the mushrooms array (randomly divided into training and test samples), the ‘VKF Method’ system reached 100% accuracy (both with respect to poisonous, and with respect to edibility of mushrooms). The program was also applied to the Adult array to generate more than one million hypotheses (without failures).

就SPECT阵列的预测准确性而言，“ VKF方法”系统的性能优于CLIP3系统(使用整数编程进行覆盖学习v.3)，其中CLIP3是SPECT数据作者创建的程序。在蘑菇阵列上(随机分为训练样本和测试样本)，“ VKF方法”系统达到了100％的准确度(包括有毒蘑菇的食用性)。该程序还应用于Adult数组，以生成超过一百万个假设(无失败)。

Sources of CPython library 'vkf' are under review on savannah.nongnu.org. Codes of auxiliary library 'vkfencoder' can be retrieved from Bitbucket.

savannah.nongnu.org上正在审查CPython库'vkf'的来源。可以从Bitbucket检索辅助库'vkfencoder'的代码。

The author would like to thank his colleagues and students (D.A. Anokhin, E.D. Baranova, I.V. Nikulin, E.Y. Sidorova, L.A. Yakimova) for their support, useful discussions and joint research. However, as usual, the author is solely responsible for any errors.

作者要感谢他的同事和学生(DA Anokhin，ED Baranova，IV Nikulin，EY Sidorova，LA Yakimova)的支持，有益的讨论和共同研究。但是，与往常一样，作者应对任何错误承担全部责任。