> K(0
d/0( 0;[0
0000$([\{b00000000000 0=] 00
0000 2 3 !A0C0E0G0I0c00000000000000000!%),.:;?]}acdeghijklmnopDTimes New RomanPPyؖׯ0ؖDArialNew RomanPPyؖׯ0ؖ@.@ @@``
@n?" dd@ @@``($\
!#lAA1?@g4]d]dlׯ0jppp@<4ddddvS0Py"ʚ;ʚ;<4ddddS0r0___PPT10
pp2___PPT9/0DLearning
Learning
Learning
Learning
Learning
Learning
ID3 definitionsentropy: measure of how much an attribute matches the conclusion
match 1:1 -- low entropy (high information content, low uncertainty)
eg. a:1, b:2, c:3, d:4
differ on every value matching - high entropy (low info content, high uncertainty)
eg. a:1, a:2, a:3, a:4
Information or entropy: mathematical measurement of an entity that provides the answer to a question, or certainty about an outcome, eg. to describe whether a coin will be heads or tails
a value between 0 and 1, measured in bits
1 bit = enough information to answer a yes/no question of a fair, random event (because log2 (2) = 1)
4 events require log2(4) = 2 bits etc.
Note: log2 (X) = ln(X) / ln(2)AUEUUUSUUU:ESPm
!ID3 definitionsInformation content: average entropy of different events weighted by probabilities of those events: Pr(E) log2 E
Formula: - Pr(E1)log2 E1 - Pr(E2)log2 E2 - ... - Pr(Ek) log2 Ek
when attribute A has events E1,...,Ek
eg. fair coin: IC(0.5, 0.5) = -0.5 log2 0.5 - 0.5 log2 0.5 = 1 bit
eg. weighted heads:
IC(0.01, 0.99) = -0.01 log2 0.01 - 0.99 log2 0.99 = 0.08 bits
eg. always heads: IC(0, 1) = 0 bits
Note: if event has prob 0, don t use in equation (log 0 = ???)
instead, set value as 0.
Information gain: difference between original information content and new information content, after attribute A selected r@'|_@'ie:$D[5"
ID3 Algorithmminimal trees are useful when example table is completely adequate for observations of interest, and user input is certain
for uncertain input, then additional tests are required for support
ID3: induction algorithm that tries to find a small tree efficiently
not guaranteed to be minimal, but it is generally small
1. Given the example set C, find an attribute A that gives the highest information content
--> this is the most discriminatory test for distinguising the data
--> it yields the highest information gain: ideally, if we have X entropy before, then after A we have 0 entropy (ie. perfect decision strategy!)
2. Add it as the next node in tree
3. Partition the examples into subtables, and recurse (fills in remainder of tree){UEUGU8U{EG7P
b
ID3 algorithmHconsider p positive and n negative examples
probability of p s: p+ = p / (p + n)
probability of n s: p- = n / (p + n)
compute the above from the example set
information content for each attribute value:
IC(Vi) = - (p+) log2 (p+) - (p-) log2 (p-)
where Vi is value of attribute A
information content for table after attribute A is used as test:
B(C, A) = sum_i : [ (prob value of A is Vi) * IC(Ci) ]
for attribute A, value Vi (i=1,...,#values),
subset of examples Ci corresp. to each Vi
We wish to select an attribute which maximizes the information gain at that node in the tree. Repeat the following for all attribute in example set:
compute p+, p- for each value of attribute
compute IC for each attribute, value pair
compute overall info gain B(C, A) for that attribute
--> select attribute A which maximizes information gain, ie. yields low information content after it is applied --> lowest B(C,A) value.,w.RB,w.RB;%m JeN ID3: entropy extremes attr. 1 attr. 2 value IC(C) = -(3/4 log2 3/4) - (1/4 log2 1/4)
a c x = -(.75)(-.45) - (.25)(.5) = 0.725
b c y
a c x
a c x
attr 1:
IC(a) = -1 log2 1 - 0 log2 0 = 0
IC(b) = -1 log2 1 -0 log2 0 = 0
B(C, attr1) = Pr(a)*IC(a) + Pr(b)*IC(b) = 0 + 0 = 0
Gain: IC(C) -B(C, attr1) = 0.725 - 0 = 0.725
--> maximum gain! All values precisely predicted using attribute 1.
attr 2:
IC(c) = -(3/4) log2 (3/4) - (1/4) log2 (1/4) = 0.725
B(C, attr2) = Pr(c)*IC(c) = 1*.725 = 0.725
Gain: IC(C) - B(C, attr2) = 0.725 - 0.725 = 0
--> minimum gain; no information gained at all fromusing attribute 2.
Note: we can simply select attribute yielding minimum information content B for table; computing the gain is redundant (doesn t help us).@L DF*K DF*?q &ID3: example (from Durkin, pp.496-498)'
IC(C) = -(4/8) log2 (4/8) - (4/8) log2 (4/8) = 1 (initial info content of all examples)
(a) test wind :
IC(North) = -3/5 log2 (3/5) - 2/5 log2 (2/5) = .971
IC(South) = -1/3 log2 (1/3) - 2/3 log2 (2/3) = .918
B(C, Wind ) = 5/8* 0.971 + 3/8 log 0.918 = 0.951
gain = IC(C) - B(C, Wind ) = 1 - 0.951= - .049
(b) test sky :
clear : all are negative (= 0)
IC(cloudy) = -4/5 log2(4/5) - 1/5 log2 (1/5) = .722
B(C, sky ) = 3/8 * 0 + 5/8 * .722 = .45
gain = IC(C) - B(C, sky ) = 1 - .45 = .548
(c) barometer: gain is .156
therefore sky gives highest info gain, and is selected.
Algorithm partitions example set for each new subcategory, and recurses
Note: we re simply finding attribute that yields smallest information content for remaining table table after it is appliedjkkbk, dExample (cont)&Must now apply ID3 to remaining examples that have differing result values --> new layers of decision tree
(a) barometer:
IC(rising) = - 1 log2 1 - 0 log2 0 = 0
IC(steady) = - 1 log2 1 - 0 log2 0 = 0
IC(falling) = - 1/2 log2 1/2 - 1/2 log2 1/2 = .951
B(C, barometer) = (2/5)*0 + (1/5)*0 + (2/5)*.951= .38
(b) wind:
IC(south) = - 1/2 log2 1/2 - 1/2 log2 1/2 = .951
IC(north) = - 1 log 2 1 - 0 = 0
B(C, wind) = (2/5)*.951 + (3/5)*0 = .38
--> choose either
note that you ll need both attributes together to classify remaining table`{
y]{
y]b{
k)
Example^
If we choose barometer , then remaining table left is:
5 Cloudy, Falling, North +
7 Cloudy, Falling, South -
Should be obvious that wind is only possibility now.
L@87@87Example: final tree
ID3
"ID3 generalizes to multivalued classifications (not just plus and minus): information content expression extends to multiple categories...
IC(a, b, c) = -pa log2 pa - pb log2 pb - pc log2 pc
Note: final tree can have no data leafs, meaning that the example set does not cover that combination of tests
can presume that such a leaf is impossible wrt the examples
otherwise, implies that example set is missing information
--> must assume that examples are complete and correct; this is responsibility of knowledge engineer!L4s4s
Learning
Learning
Learning
Learning
Learning
Learning
Learning
Possible ID3 problems
clashing examples: need more data attributes, or must correct knowledge
continuous values (floating point): must create ranges,
eg. 0 < x < 5
noise: if compressing a database, noise can unduly influence decision tree
trees can be too large: production rules are therefore large too
break up table and create hierarchy of tables (structured induction)
flat rules: only one or more conclusions, no intermediate rules`E@E@
ID3 enhancements
xIf you have a large example set, can use a subset ( window ) of examples
then need to verify that resulting decision tree is valid, in case that window wasn t inclusive or had noise
knowledge bases must be 100% correct!
Data mining: finding trends in large databases
ID3 is one tool used there
don t care about 100% correctness
rather, a good random sample of database may yield enough useful information
can also apply to entire database, in an attempt to categorize information into useful classes and trends
C4.5 is the successor of ID3. Uses more advanced heuristic.`I/<I/<
=Learning: Genetic algorithms
Another way to create small decision trees from examples: genetic programming
1. Create a population of random decision trees
2. Repeat Until a suitably correct and small tree is found:
a. Rate all trees based on: (i) size, (ii) how many examples they cover (iii) how many examples they miss
--> fitness score
b. Create a new population:
(i) mate trees using Crossover: swap subtrees between parents
(ii) mutate trees using Mutation: random change to tree
This will search the space of decision trees for a correct and small sized tree
Much slower than ID3, but possibly better results
Remember: finding the smallest tree for an example set is NP-complete6{Learning comments
P} 0` ̙33` ` ff3333f` 333MMM` f` f` 3>?" dd@ z? " dd@ " @ `" n?" dd@ @@``PR @ ` `p>>
O(
`ǵxaxa1 ?
T Click to edit Master title style!
!@
Zʵxaxa1 ?`
RClick to edit Master text styles
Second level
Third level
Fourth level
Fifth level!
S
Zϵ8c8c1?
kB. Ross Cosc 4f79a
Zյ8c8c1?hh
D*UB
s*h ? a( 4P79 overheads0ZR@(
@
Z$MgMg1 ?
K
RClick to edit Master text styles
Second level
Third level
Fourth level
Fifth level!
Sp
01 ?HC
B
s*j ? a(80___PPT10.^M`N
00( X
B
s*j ? a(80___PPT10.^MP^$N
0N00( ( P4d
#lpxaxa1 ?
p
Zs8c8c1?
6" Machine learning is an area of AI concerned with the automatic learning
of knowledge
" some ways that machine learning can be used in expert systems
1. increase efficiency of inference engine and knowledge base processing
2. testing the knowledge base
3. use learning principles to acquire knowledge itself
4. ???
" Most learning techniques exploit heuristics: problem-specific information
which makes the search for a solution more efficient
" Without heuristics, typical learning problems either take too long to execute
effectively, or produce results which are too large & general to be usefulUH
0h ? a(
0N0PB(
#ld7sxaxa1 ?P[
s
B
ZZs8c8c1?xzb
`1. Increase inference engine efficiency
" 20-80 principle: 20% of rules in KB account for 80% of diagnoses
" these 20% rules should take precedence in order to make execution faster
" otherwise, roughly half of the KB needs to be looked at for every diagnosis,
which is a waste of time for most (80%) problems
" However, it is also possible that the set of rules most often used can vary
according to who, where, and how the expert system is used
" One way to fix this: keep a record of the number of times a high-level
rule was successful in making a diagnosis
eg. record(rule12, 102).
record(rule6, 25). etc
" Save this information in a file, and reload it every session.
" Use these rule stats to determine the order in which diagnoses are to be
executed 1U1LH
0h ? a(
0N0ia`( xa
#lmsxaxa1 ?p[
s
Z4\s8c8c1?hx C
x 1. ordering rules: p.118 Schnupp!U!
Z8c8c1?(hs
e
p.118 schnuppUH
0h ? a(
0N0ogp(
t6
#l>sxaxa1 ?[
s
ZDps8c8c1?X
zD" Another possibility is to ask some preliminary questions to determine
general high-level information, and then order the high-level inference
accordinglyU
Zws8c8c1?H c
ip.112-113 SchnuppU
H
0h ? a(
0N0T L $(
$
$ #lsxaxa1 ? [
s
$
ZPs8c8c1?3
N2. Knowledge acquisitionU
$
Zs8c8c1?0hV
&(i) Learning rules from examples
" user inputs typical examples (decision table) for a given rule domain; or, can
process a database to automatically generate production rules
" system constructs a rule or set of rules (decision tree) for this example set
" can then generate production rules from this tree
" trivial to make comprehensive tree but more involved to make a minimal one
" inductive inference: learning technique which constructs a general rule from
specific examples (compare with mathematical induction)
" popular algorithm: Quinlan's ID3, used in shells such as VP-expert,
ExpertEase, RuleMaster, and others
UPC
,k
d
$
<1?hxH
$0h ? a(
0N0T L ((
(
( #lsxaxa1 ?[
s
(
Zs8c8c1?Z
Z" Decision table: table of attribute values, with one or more conclusions
- each row is an example or true instance of attribute value - conclusion
" Convenient for classification & identification problems
" Could create production rules directly from table: one rule per row
" induction: tries to generalize information in table, disregarding superfluous
information, and yielding an efficient smaller decision tree
- results in "smarter" system
- good example of machine learning : computer tries to generalize and
abstract from examples
" Given a table, can look at conclusions, and see if particular attributes
have any effect on them. If not, then disregard those attributes when
deriving that conclusion.
" There exist one or more "minimal" sets of tests for a table; however, finding
this minimal set can be intractable in general
`oUP IH
(0h ? a(
0N00(0(
0
0 #lhpxaxa1 ?
p
0 #l@pxaxa1 ?@
p
H
00h ? a(
0(
l
CTBs
s
l
Cs`s
H
0h ? a(
0(
l
C$
l
C
0
H
0h ? a(
0N00(8(
8
8 #l@xaxa1 ?PM
8 #lxaxa1 ?
H
80h ? a(
0(
l
CD
l
C-
H
0h ? a(
0N00(<( xa
<
< #ld[xaxa1 ?0W
< #l(\xaxa1 ?p
H
<0h ? a(
0N00(@(
@
@ #ltxaxa1 ?
@ #luxaxa1 ?`
H
@0h ? a(
0N00(D(
D
D #lxaxa1 ? N
D #lЀxaxa1 ?`
H
D0h ? a(
0N00( H(
H
H #l xaxa1 ?f
H #l䆖xaxa1 ?`
H
H0h ? a(
0N00(0L( xa
L
L #ltxaxa1 ?0 0
L #l8xaxa1 ?`
H
L0h ? a(
0N091@P(
P
P #lhxxaxa1 ?[
P
Z8c8c1?X3
\Inductive Inference:U
P
Zp8c8c1?H(C
Q p.129-132
U
H
P0h ? a(
0N0UMPT(
T
T #lxaxa1 ?[
T
Zت8c8c1?S
[Inductive InferenceU
H
T0h ? a(
0N0h``X(
X
X #lxaxa1 ?[
H
X0h ? a(
0N01)p\( <
\
\ #lxaxa1 ?[
\
ZL8c8c1?H0s
U
ID3 algorithmU
\
Z8c8c1? 3
Pp. 134-5 U
H
\0h ? a(
0N0h``( 6:6
`
` #lxaxa1 ?
H
`0h ? a(
0N0h`d(
d
d #lĖxaxa1 ?[
H
d0h ? a(
0N0h`h(
h
h #lȖxaxa1 ?@[
H
h0h ? a(
0N00(l(
l
l #l̖xaxa1 ?P4
l #l̖xaxa1 ?`
H
l0h ? a(
0N00(p(
p
p #l(іxaxa1 ? `
p #lіxaxa1 ?`
H
p0h ? a(
0N00(t( xa
t
t #lٖxaxa1 ?
t #lږxaxa1 ?e
H
t0h ? a(|
0N0
x (
x
x #lxaxa1 ?
x
Z8c8c1?hZr
" Inductive inference is a convenient way of obtaining productions automatically
from databases or user examples.
" Good for classification & diagnosis problems
" Assumes that domain is deterministic: that particular premises lead to only one
conclusion, not multiple ones
" Need to have: good data, no noise, no clashes
- assumes that the entire universe of interest is encapsulated in example set
or database
- clashes probably mean you need to identify more attributes
" ID3 doesn't assure the minimal tree, but its entropy measure is a heuristic
that often generates a small tree
" Note that a minimal rule is not necessarily desireable: when you discard
attributes, you discard information which might be relevant, especially
later when the system is being upgraded
" Not desireable to use this technique on huge tables with many attributes.
Better approach is to modularise data hierarchically.6U>
V
H
x0h ? a(rTVKX_hl
qO{
@s3ӗsYҪ"³b"1Oh+'0X
DP\
ht|Ethical and Legal issuesBrian RossBrian Ross8Microsoft PowerPoint 4.0@cf}@`f@^ua@(Q^M8GLWg +&" WMFC "Blxu\ EMFBQh$F(GDICx!b$$==%%V0xx x %%$$AA"FGDICF(GDICqwFGDICRp@Arial70Al0<2
(8m0-g0-#p}eU0d۵h0-0eU0TDdK0`~0- 0%0(ڵ(h0Arial2|(a dv%TruAAuLtB. Ross Cosc 4f79p%
(F(GDICqxFGDICRp@Arial70Al0<b10p}7M020p}eU0d۵h0a -0eU0TDdK0`~0- 0%0(ڵ(h0ArialL6|(a dv%TTquAAuLP1o%
(F(GDIC8eFGDICRp@Arial b10p}7M0X10p}eU0d۵h0a -0eU0TDdK0`~0- 0%0(ڵ(h0Arial s|(a dv%T|@\AA@L\Learning%
(F(GDICmFGDICRp@Arial70
4Al0s @s
s(K8m0-g0-sp}eU0ܵh0-0eU0TDdK0`~0- 0%0(ڵ(h0Arial@s|(a dv%TTAALP" %
(Rp@Arial70
4Al0h0-0eU0TDdK04|Q|Hm|
cڵ(h0Arial%0(d۵| a dvArԷ%(
(De! dv%T~AA>LMachine learning is an area of AI concerned with the automatic%
(Rp@Arial70
4Al0Q|Hm|
cڵ(h04|Q|Hm|
dd۵| a dvArԷ%0(ܵ De! dvAr%(
(De! dv%T|AAL\learning%
(Rp@Arial70
4Al0<s@s
s(Z8m0-g0-sp}eU0d۵h0-0eU0TDdK0`~0- 0%0(ڵ(h0Arial@s|(a dv%TAALdof knowledge%
(Rp@Arial70
4Al0s@s
s(8m0-g0-sp}eU0ܵh0-0eU0TDdK0`~0- 0%0(ڵ(h0Arial@s|(a dv%TT%)AA)LP" %
(Rp@Arial70
4Al0h0-0eU0TDdK04|Q|Hm|
pڵ(h0Arial%0(d۵| a dvAr%(
(De! dv%T%)AA)=Lsome ways that machine learning can be used in expert systems%
(Rp@Arial70
4Al0sj@s
s(8m0-g0-sp}eU0ܵh0-0eU0TDdK0`~0- 0%0(ڵ(h0Arial@s|(a dv%T-|1AA1=L1. increase efficiency of inference engine and knowledge bas%
(Rp@Arial70
4Al0h0-0eU0TDdK04|Q|Hm|
xڵ(h0Arial%0(d۵| a dvAr%(
(De! dv%T}-1AA}1Lde processing%
(Rp@Arial70
4Al0<s:@s
s(8m0-g0-sp}eU0d۵h0-0eU0TDdK0`~0- 0%0(ڵ(h0&" WMFC BArial@s|(a dv%T5@9AA9L2. testing the knowledge base%
(Rp@Arial70
4Al0<s@s
s(K8m0-g0-sp}eU0d۵h0-0eU0TDdK0`~0- 0%0(ڵ(h0Arial@s|(a dv%T=lAAAA7L3. use learning principles to acquire knowledge itself%
(Rp@Arial70
4Al0<s@s
s(W8m0-g0-sp}eU0d۵h0-0eU0TDdK0`~0- 0%0(ڵ(h0Arial@s|(a dv%TxEIAAIL\4. ???%
(Rp@ArialsX@s
s(8m0-g0-sp}eU0Dݵh0-0eU0TDdK0`~0- 0%0(ڵ(h0Arial@s|(a dv%TTRVAAVLP" %
(Rp@Arialh0-0eU0TDdK04|Q|Hm|]
tڵ(h0Arial%0(d۵| a dvAr%(
(De! dv%TRmVAAV5LMost learning techniques exploit heuristics: problem%
(Rp@ArialQ|Hm|]
tڵ(h04|Q|Hm|]
ud۵| a dvAr%0(ܵ De! dvAr%(
(De! dv%TTnRnVAAnVLP-%
(Rp@ArialQ|Hm|]
ud۵| a 4|Q|Hm|]
vܵ De! dvAr%0(ܵ De! dvAr<%(
(De! dv%ToRVAAoVLxspecific information %
(Rp@Arial70
4Al0<sR@s
s(8m0-g0-sp}eU0d۵h0-0eU0TDdK0`~0- 0%0(ڵ(h0Arial@s|(a dv%TVjZAAZ4Lwhich makes the search for a solution more efficient%
(Rp@Arial70
4Al0s@s
s(38m0-g0-sp}eU0ܵh0-0eU0TDdK0`~0- 0%0(ڵ(h0Arial@s|(a dv%TTbfAAfLP" %
(Rp@Arial70
4Al0h0-0eU0TDdK04|Q|Hm|]
ڵ(h0Arial%0(d۵| a dvAr,%(
(De! dv%TbzfAAf>LWithout heuristics, typical learning problems either take too %
(Rp@Arial70
4Al0Q|Hm|]
ڵ(h04|Q|Hm|]
d۵| a dvAr,%0(ܵ De! dvArD%(
(De! dv%T{bfAA{fLllong to execute%
(Rp@Arial70
4Al0s
@s
s(8m0-g0-sp}eU0ܵh0-0eU0TDdK0`~0- 0%0(ڵ(h0Arial@s|(a dv%TfxjAAj=Leffectively, or produce results which are too large & generalZ&WMFCB%
(Rp@Arial70
4Al0h0-0eU0TDdK04|Q|Hm|
ڵ(h0Arial%0(d۵| a dvAr4%(
(De! dv%T{fjAA{jLdto be useful%
(x--$xx--'@Arial???????????-. $2
uB. Ross Cosc 4f79n."System-@Arial??????????-. 2
u1z.-@Arial ?-. 2
@Learning.-@Arial???????????-. 2
z.-@Arial???????????-. d2
>Machine learning is an area of AI concerned with the automatic.-@Arial???????????-. 2
learning.-@Arial???????????-. 2
of knowledge.-@Arial???????????-. 2
)z.-@Arial???????????-. c2
)=some ways that machine learning can be used in expert systemsc.-@Arial???????????-. c2
1=1. increase efficiency of inference engine and knowledge basc.-@Arial???????????-. 2
1}e processing.-@Arial???????????-. 42
92. testing the knowledge base.-@Arial???????????-. Z2
A73. use learning principles to acquire knowledge itself.-@Arial???????????-. 2
I4. ???g.-@Arial??-. 2
Vz.-@Arial??-. W2
V5Most learning techniques exploit heuristics: probleml.-@Arial??-. 2
Vn-z.-@Arial??-. '2
Vospecific information .-@Arial???????????-. U2
Z4which makes the search for a solution more efficient.-@Arial???????????-. 2
fz.-@Arial???????????-. d2
f>Without heuristics, typical learning problems either take too .-@Arial???????????-. 2
f{long to execute.-@Arial???????????-. c2
j=effectively, or produce results which are too large & general .-@Arial???????????-. 2
j{to be useful.-՜.+,0p
On-screen Show&'Times New RomanArial4P79 overheads Learning Learning Learning Learning Learning LearningID3 definitionsID3 definitionsID3 AlgorithmID3 algorithmID3: entropy extremes'ID3: example (from Durkin, pp.496-498)Example (cont)ExampleExample: final treeID3 Learning Learning Learning Learning Learning Learning LearningPossible ID3 problemsID3 enhancementsLearning: Genetic algorithmsLearning commentsFonts UsedDesign Template
Slide Titles"_z
Brian RossBrian Ross
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`bcdefghijklmnopqrstuvwxyz{|}~Root EntrydO)Current UserSummaryInformation(aYPowerPoint Document(DocumentSummaryInformation8