数据挖掘术语简介

accuracy


Accuracy is an important factor in assessing the success of data mining. When
applied to data, accuracy refers to the rate of correct values in the data. Wh
en applied to models, accuracy refers to the degree of fit between the model a
nd the data. This measures how error-free the model's predictions are. Since a
ccuracy does not include cost information, it is possible for a less accurate
model to be more cost-effective. Also see precision.


activation function


A function used by a node in a neural net to transform input data from any dom
ain of values into a finite range of values. The original idea was to approxim
ate the way neurons fired, and the activation function took on the value 0 unt
il the input became large and the value jumped to 1. The discontinuity of this
0-or-1 function caused mathematical problems, and sigmoid-shaped functions (e
.g., the logistic function) are now used.


antecedent


When an association between two variables is defined, the first item (or left-
hand side) is called the antecedent. For example, in the relationship "When a
prospector buys a pick, he buys a shovel 14% of the time," "buys a pick" is th
e antecedent.


API


An application program interface. When a software system features an API, it p
rovides a means by which programs written outside of the system can interface
with the system to perform additional functions. For example, a data mining so
ftware system may have an API which permits user-written programs to perform s
uch tasks as extract data, perform additional statistical analysis, create spe
cialized charts, generate a model, or make a prediction from a model.


associations


An association algorithm creates rules that describe how often events have occ
urred together. For example, "When prospectors buy picks, they also buy shovel
s 14% of the time." Such relationships are typically expressed with a confiden
ce interval.
backpropagation


A training method used to calculate the weights in a neural net from the data.


bias


In a neural network, bias refers to the constant terms in the model. (Note tha
t bias has a different meaning to most data analysts.) Also see precision.


binning


A data preparation activity that converts continuous data to discrete data by
replacing a value from a continuous range with a bin identifier, where each bi
n represents a range of values. For example, age could be converted to bins su
ch as 20 or under, 21-40, 41-65 and over 65.


bootstrapping


Training data sets are created by re-sampling with replacement from the origin
al training set, so data records may occur more than once. In other words, thi
s method treats a sample as if it were the entire population. Usually, final e
stimates are obtained by taking the average of the estimates from each of the
bootstrap test sets.
CART


Classification And Regression Trees. CART is a method of splitting the indepen
dent variables into small groups and fitting a constant function to the small
data sets. In categorical trees, the constant function is one that takes on a
finite small set of values (e.g., Y or N, low or medium or high). In regressio
n trees, the mean value of the response is fit to small connected data sets.



categorical data


Categorical data fits into a small number of discrete categories (as opposed t
o continuous). Categorical data is either non-ordered (nominal) such as gender
or city, or ordered (ordinal) such as high, medium, or low temperatures.


CHAID


An algorithm for fitting categorical trees. It relies on the chi-squared stati
stic to split the data into small connected data sets.


chi-squared


A statistic that assesses how well a model fits the data. In data mining, it i
s most commonly used to find homogeneous subsets for fitting categorical trees
as in CHAID.


classification


Refers to the data mining problem of attempting to predict the category of cat
egorical data by building a model based on some predictor variables.


classification tree


A decision tree that places categorical variables into classes.


cleaning (cleansing)


Refers to a step in preparing data for a data mining activity. Obvious data er
rors are detected and corrected (e.g., improbable dates) and missing data is r
eplaced.


clustering


Clustering algorithms find groups of items that are similar. For example, clus
tering could be used by an insurance company to group customers according to i
ncome, age, types of policies purchased and prior claims experience. It divide
s a data set so that records with similar content are in the same group, and g
roups are as different as possible from each other. Since the categories are u
nspecified, this is sometimes referred to as unsupervised learning.


confidence


Confidence of rule "B given A" is a measure of how much more likely it is that
B occurs when A has occurred. It is expressed as a percentage, with 100% mean
ing B always occurs if A has occurred. Statisticians refer to this as the cond
itional probability of B given A. When used with association rules, the term c
onfidence is observational rather than predictive. (Statisticians also use thi
s term in an unrelated way. There are ways to estimate an interval and the pro
bability that the interval contains the true value of a parameter is called th
e interval confidence. So a 95% confidence interval for the mean has a probabi
lity of .95 of covering the true value of the mean.)


confusion matrix


A confusion matrix shows the counts of the actual versus predicted class value
s. It shows not only how well the model predicts, but also presents the detail
s needed to see exactly where things may have gone wrong.


consequent


When an association between two variables is defined, the second item (or righ
t-hand side) is called the consequent. For example, in the relationship "When
a prospector buys a pick, he buys a shovel 14% of the time," "buys a shovel" i
s the consequent.


continuous


Continuous data can have any value in an interval of real numbers. That is, th
e value does not have to be an integer. Continuous is the opposite of discrete
or categorical.


cross validation


A method of estimating the accuracy of a classification or regression model. T
he data set is divided into several parts, with each part in turn used to test
a model fitted to the remaining parts.
data


Values collected through record keeping or by polling, observing, or measuring
, typically organized for analysis or decision making. More simply, data is fa
cts, transactions and figures.


data format


Data items can exist in many formats such as text, integer and floating-point
decimal. Data format refers to the form of the data in the database.


data mining


An information extraction activity whose goal is to discover hidden facts cont
ained in databases. Using a combination of machine learning, statistical analy
sis, modeling techniques and database technology, data mining finds patterns a
nd subtle relationships in data and infers rules that allow the prediction of
future results. Typical applications include market segmentation, customer pro
filing, fraud detection, evaluation of retail promotions, and credit risk anal
ysis.


data mining method


Procedures and algorithms designed to analyze the data in databases.


DBMS


Database management systems.


decision tree


A tree-like way of representing a collection of hierarchical rules that lead t
o a class or value.


deduction


Deduction infers information that is a logical consequence of the data.


degree of fit


A measure of how closely the model fits the training data. A common measure is
r-square.


dependent variable


The dependent variables (outputs or responses) of a model are the variables pr
edicted by the equation or rules of the model using the independent variables
(inputs or predictors).


deployment


After the model is trained and validated, it is used to analyze new data and m
ake predictions. This use of the model is called deployment.


dimension


Each attribute of a case or occurrence in the data being mined. Stored as a fi
eld in a flat file record or a column of relational database table.


discrete


A data item that has a finite set of values. Discrete is the opposite of conti
nuous.


discriminant analysis


A statistical method based on maximum likelihood for determining boundaries th
at separate the data into categories.


entropy


A way to measure variability other than the variance statistic. Some decision
trees split the data into groups based on minimum entropy.


exploratory analysis


Looking at data to discover relationships not previously detected. Exploratory
analysis tools typically assist the user in creating tables and graphical dis
plays.


external data


Data not collected by the organization, such as data available from a referenc
e book, a government source or a proprietary database.

Top of page


feed-forward


A neural net in which the signals only flow in one direction, from the inputs
to the outputs.


fuzzy logic


Fuzzy logic is applied to fuzzy sets where membership in a fuzzy set is a prob
ability, not necessarily 0 or 1. Non-fuzzy logic manipulates outcomes that are
either true or false. Fuzzy logic needs to be able to manipulate degrees of "
maybe" in addition to true and false.
genetic algorithms

A computer-based method of generating and testing combinations of possible inp
ut parameters to find the optimal output. It uses processes based on natural e
volution concepts such as genetic combination, mutation and natural selection.


GUI

Graphical User Interface.




hidden nodes

The nodes in the hidden layers in a neural net. Unlike input and output nodes,
the number of hidden nodes is not predetermined. The accuracy of the resultin
g model is affected by the number of hidden nodes. Since the number of hidden
nodes directly affects the number of parameters in the model, a neural net nee
ds a sufficient number of hidden nodes to enable it to properly model the unde
rlying behavior. On the other hand, a net with too many hidden nodes will over
fit the data. Some neural net products include algorithms that search over a n
umber of alternative neural nets by varying the number of hidden nodes, in the
end choosing the model that gets the best results without overfitting.



independent variable

The independent variables (inputs or predictors) of a model are the variables
used in the equation or rules of the model to predict the output (dependent) v
ariable.


induction

A technique that infers generalizations from the information in the data.


interaction

Two independent variables interact when changes in the value of one change the
effect on the dependent variable of the other.


internal data

Data collected by an organization such as operating and customer data.



k-nearest neighbor

A classification method that classifies a point by calculating the distances b
etween the point and points in the training data set. Then it assigns the poin
t to the class that is most common among its k-nearest neighbors (where k is a
n integer).


Kohonen feature map

A type of neural network that uses unsupervised learning to find patterns in d
ata. In data mining it is employed for cluster analysis.



layer

Nodes in a neural net are usually grouped into layers, with each layer describ
ed as input, output or hidden. There are as many input nodes as there are inpu
t (independent) variables and as many output nodes as there are output (depend
ent) variables. Typically, there are one or two hidden layers.


leaf

A node not further split -- the terminal grouping -- in a classification or de
cision tree.


learning

Training models (estimating their parameters) based on existing data.


least squares

The most common method of training (estimating) the weights (parameters) of a
model by choosing the weights that minimize the sum of the squared deviation o
f the predicted values of the model from the observed values of the data.


left-hand side

When an association between two variables is defined, the first item is called
the left-hand side (or antecedent). For example, in the relationship "When a
prospector buys a pick, he buys a shovel 14% of the time", "buys a pick" is th
e left-hand side.


logistic regression (logistic discriminant analysis)

A generalization of linear regression. It is used for predicting a binary vari
able (with values such as yes/no or 0/1). An example of its use is modeling th
e odds that a borrower will default on a loan based on the borrower's

MARS

Multivariate Adaptive Regression Splines. MARS is a generalization of a decisi
on tree.


maximum likelihood

Another training or estimation method. The maximum likelihood estimate of a pa
rameter is the value of a parameter that maximizes the probability that the da
ta came from the population defined by the parameter.


mean

The arithmetic average value of a collection of numeric data.


median

The value in the middle of a collection of ordered data. In other words, the v
alue with the same number of items above and below it.


missing data

Data values can be missing because they were not measured, not answered, were
unknown or were lost. Data mining methods vary in the way they treat missing v
alues. Typically, they ignore the missing values, or omit any records containi
ng missing values, or replace missing values with the mode or mean, or infer m
issing values from existing values.


mode

The most common value in a data set. If more than one value occurs the same nu
mber of times, the data is multi-modal.


model

An important function of data mining is the production of a model. A model can
be descriptive or predictive. A descriptive model helps in understanding unde
rlying processes or behavior. For example, an association model describes cons
umer behavior. A predictive model is an equation or set of rules that makes it
possible to predict an unseen or unmeasured value (the dependent variable or
output) from other, known values (independent variables or input). The form of
the equation or rules is suggested by mining data collected from the process
under study. Some training or estimation technique is used to estimate the par
ameters of the equation or rules.


MPP

Massively parallel processing, a computer configuration that is able to use hu
ndreds or thousands of CPUs simultaneously. In MPP each node may be a single C
PU or a collection of SMP CPUs. An MPP collection of SMP nodes is sometimes ca
lled an SMP cluster. Each node has its own copy of the operating system, memor
y, and disk storage, and there is a data or process exchange mechanism so that
each computer can work on a different part of a problem. Software must be wri
tten specifically to take advantage of this architecture.


neural network

A complex nonlinear modeling technique based on a model of a human neuron. A n
eural net is used to predict outputs (dependent variables) from a set of input
s (independent variables) by taking linear combinations of the inputs and then
making nonlinear transformations of the linear combinations using an activati
on function. It can be shown theoretically that such combinations and transfor
mations can approximate virtually any type of response function. Thus, neural
nets use large numbers of parameters to approximate any model. Neural nets are
often applied to predict future outcome based on prior experience. For exampl
e, a neural net application could be used to predict who will respond to a dir
ect mailing.


node

A decision point in a classification (i.e., decision) tree. Also, a point in a
neural net that combines input from other nodes and produces an output throug
h application of an activation function.


noise

The difference between a model and its predictions. Sometimes data is referred
to as noisy when it contains errors such as many missing or incorrect values
or when there are extraneous columns.


non-applicable data

Missing values that would be logically impossible (e.g., pregnant males) or ar
e obviously not relevant.


normalize

A collection of numeric data is normalized by subtracting the minimum value fr
om all values and dividing by the range of the data. This yields data with a s
imilarly shaped histogram but with all values between 0 and 1. It is useful to
do this for all inputs into neural nets and also for inputs into other regres
sion models. (Also see standardize.)
OLAP

On-Line Analytical Processing tools give the user the capability to perform mu
lti-dimensional analysis of the data.


optimization criterion

A positive function of the difference between predictions and data estimates t
hat are chosen so as to optimize the function or criterion. Least squares and
maximum likelihood are examples.


outliers

Technically, outliers are data items that did not (or are thought not to have)
come from the assumed population of data -- for example, a non-numeric when y
ou are expecting only numeric values. A more casual usage refers to data items
that fall outside the boundaries that enclose most other data items in the da
ta set.


overfitting

A tendency of some modeling techniques to assign importance to random variatio
ns in the data by declaring them important patterns.


overlay

Data not collected by the organization, such as data from a proprietary databa
se, that is combined with the organization's own data.



parallel processing

Several computers or CPUs linked together so that each can be computing simult
aneously.


pattern

Analysts and statisticians spend much of their time looking for patterns in da
ta. A pattern can be a relationship between two variables. Data mining techniq
ues include automatic pattern discovery that makes it possible to detect compl
icated non-linear relationships in data. Patterns are not the same as causalit
y.


precision

The precision of an estimate of a parameter in a model is a measure of how var
iable the estimate would be over other similar data sets. A very precise estim
ate would be one that did not vary much over different data sets. Precision do
es not measure accuracy. Accuracy is a measure of how close the estimate is to
the real value of the parameter. Accuracy is measured by the average distance
over different data sets of the estimate from the real value. Estimates can b
e accurate but not precise, or precise but not accurate. A precise but inaccur
ate estimate is usually biased, with the bias equal to the average distance fr
om the real value of the parameter.


predictability

Some data mining vendors use predictability of associations or sequences to me
an the same as confidence.


prevalence

The measure of how often the collection of items in an association occur toget
her as a percentage of all the transactions. For example, "In 2% of the purcha
ses at the hardware store, both a pick and a shovel were bought."


pruning

Eliminating lower level splits or entire sub-trees in a decision tree. This te
rm is also used to describe algorithms that adjust the topology of a neural ne
t by removing (i.e., pruning) hidden nodes.
range

The range of the data is the difference between the maximum value and the mini
mum value. Alternatively, range can include the minimum and maximum, as in "Th
e value ranges from 2 to 8."


RDBMS

Relational Database Management System.


regression tree

A decision tree that predicts values of continuous variables.


resubstitution error

The estimate of error based on the differences between the predicted values of
a trained model and the observed values in the training set.


right-hand side

When an association between two variables is defined, the second item is calle
d the right-hand side (or consequent). For example, in the relationship "When
a prospector buys a pick, he buys a shovel 14% of the time," "buys a shovel" i
s the right-hand side.


r-squared

A number between 0 and 1 that measures how well a model fits its training data
. One is a perfect fit; however, zero implies the model has no predictive abil
ity. It is computed as the covariance between the predicted and observed value
s divided by the standard deviations of the predicted and observed values.




sampling

Creating a subset of data from the whole. Random sampling attempts to represen
t the whole by choosing the sample through a random mechanism.


sensitivity analysis

Varying the parameters of a model to assess the change in its output.


sequence discovery

The same as association, except that the time sequence of events is also consi
dered. For example, "Twenty percent of the people who buy a VCR buy a camcorde
r within four months."


significance

A probability measure of how strongly the data support a certain result (usual
ly of a statistical test). If the significance of a result is said to be .05,
it means that there is only a .05 probability that the result could have happe
ned by chance alone. Very low significance (less than .05) is usually taken as
evidence that the data mining model should be accepted since events with very
low probability seldom occur. So if the estimate of a parameter in a model sh
owed a significance of .01 that would be evidence that the parameter must be i
n the model.


SMP

Symmetric multi-processing is a computer configuration where many CPUs share a
common operating system, main memory and disks. They can work on different pa
rts of a problem at the same time.


standardize

A collection of numeric data is standardized by subtracting a measure of centr
al location (such as the mean or median) and by dividing by some measure of sp
read (such as the standard deviation, interquartile range or range). This yiel
ds data with a similarly shaped histogram with values centered around 0. It is
sometimes useful to do this with inputs into neural nets and also inputs into
other regression models. (Also see normalize.)


supervised learning

The collection of techniques where analysis uses a well-defined (known) depend
ent variable. All regression and classification techniques are supervised.


support

The measure of how often the collection of items in an association occur toget
her as a percentage of all the transactions. For example, "In 2% of the purcha
ses at the hardware store, both a pick and a shovel were bought."

test data

A data set independent of the training data set, used to fine-tune the estimat
es of the model parameters (i.e., weights).


test error

The estimate of error based on the difference between the predictions of a mod
el on a test data set and the observed values in the test data set when the te
st data set was not used to train the model.


time series

A series of measurements taken at consecutive points in time. Data mining prod
ucts which handle time series incorporate time-related operators such as movin
g average. (Also see windowing.)


time series model

A model that forecasts future values of a time series based on past values. Th
e model form and training of the model usually take into consideration the cor
relation between values as a function of their separation in time.


topology

For a neural net, topology refers to the number of layers and the number of no
des in each layer.


training

Another term for estimating a model's parameters based on the data set at hand
.


training data

A data set used to estimate or train a model.


transformation

A re-expression of the data such as aggregating it, normalizing it, changing i
ts unit of measure, or taking the logarithm of each data item.


unsupervised learning

This term refers to the collection of techniques where groupings of the data a
re defined without the use of a dependent variable. Cluster analysis is an exa
mple.




validation

The process of testing the models with a data set different from the training
data set.


variance

The most commonly used statistical measure of dispersion. The first step is to
square the deviations of a data item from its average value. Then the average
of the squared deviations is calculated to obtain an overall measure of varia
bility.


visualization

Visualization tools graphically display data to facilitate better understandin
g of its meaning. Graphical capabilities range from simple scatter plots to co
mplex multi-dimensional representations.


windowing

Used when training a model with time series data. A window is the period of ti
me used for each training case. For example, if we have weekly stock price dat
a that covers fifty weeks, and we set the window to five weeks, then the first
training case uses weeks one through five and compares its prediction to week
six. The second case uses weeks two through six to predict week seven, and so
on.

转载于:https://www.cnblogs.com/sciencefy/archive/2005/01/05/87052.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值