Learning A TRADITIONAL CLASSIFIER FROM NONTRADITIONAL INPUT
Let x be an example and let y ∈ \in ∈ {0, 1} be a binary label. Let s = 1 if the example x is labeled, and let s = 0 if x is unlabeled. Only positive examples are labeled so y = 1 is certain when s = 1, but s = 0 when either y = 1 or y = 0 may be true.
Nontraditional training set consists of unlabeled examples <x, s=0> and labeled examples <x, s=1>. Only positive examples are labeled.
Two scenarios:
- training data are drawn randomly from p(x,y,s), but for each tuple <x, y, s> that is drawn, only <x, s> is recorded. All x such that s = 1 are recorded.
- Two training sets are drawn independently from p(x,y,s). All x are recorded.
Goal: To learn a function f(x) such that f(x) = p(y=1|x) as closely as possible.
Assumption: the labeled positive examples are chosen completely randomly from all positive examples. This means if y = 1, the probability that a positive example is labeled is the same constant regardless of x.
So a training set is a random sample from a distribution p(x, y, s). Such training set consists of two subsets:
- labeled (s=1)
- unlabeled (s=0)
Training algorithm will yield a function g(x) such that g(x) = p(s=1 | x)
Lemma 1:
Prove
p
(
y
=
1
∣
x
)
=
p
(
s
=
1
∣
x
)
p
(
s
=
1
∣
y
=
1
p(y=1 \mid x) = \frac{p(s=1 \mid x)}{p(s=1 \mid y=1}
p(y=1∣x)=p(s=1∣y=1p(s=1∣x)
p
(
s
=
1
∣
x
)
=
p
(
y
=
1
,
s
=
1
∣
x
)
=
p
(
y
=
1
∣
x
)
p
(
s
=
1
∣
y
=
1
,
x
)
=
p
(
y
=
1
∣
x
)
p
(
s
=
1
∣
y
=
1
)
p(s=1 \mid x) \\ = p(y = 1, s = 1 \mid x) \\ = p(y = 1 \mid x)p(s = 1 \mid y=1, x) \\ = p(y = 1 \mid x)p(s = 1 \mid y = 1)
p(s=1∣x)=p(y=1,s=1∣x)=p(y=1∣x)p(s=1∣y=1,x)=p(y=1∣x)p(s=1∣y=1)
Note that f f f is is an increasing function of g. This means if the classifier f f f is only used to rank examples x according to the chance that they belong to class y = 1, then the classifier g can be used directly instead of f.
It is impossible to have g > p ( s = 1 ∣ y = 1 ) p(s=1 \mid y = 1) p(s=1∣y=1) . This is reasonable because the “positive” (labeled) and “negative” (unlabeled) training sets for g are sampels from overlapping regions in x space.