Title: | Hyperrelations in version space |
Authors: |
Hui Wang,
School of Information and Software Engineering ,
University of Ulster
Ivo Düntsch , Günther Gediga , Dept of Computer Science , Brock University, St Catherines, Ontario, L2S 3A1, Canada Andrzej Skowron , Institute of Mathematics; University of Warsaw (Equal authorship implied) |
Status: | International Journal of Approximate Reasoning (2004), to appear. |
Abstract: |
A version space is a set of all hypotheses consistent with a given set of training examples,
delimited by the specific boundary and the general boundary.
In existing studies (Mitchell, 1977, 1982, Hirsh 1994), a hypothesis is a conjunction
of attribute-value pairs, which is shown to have limited expressive power (Mitchell, 1997).
In a more expressive hypothesis space, e.g. disjunction of conjunction of
attribute-value pairs, a general version space becomes uninteresting unless some restriction
(inductive bias) is imposed (Mitchell, 1997).
In this paper we investigate version space in a hypothesis space where a hypothesis is a hyperrelation, in effect a disjunction of conjunctions of disjunctions of attribute-value pairs. Such a hypothesis space is more expressive than the conjunction of attribute-value pairs and the disjunction of conjunction of attribute-value pairs. However, given a dataset, we focus our attention only on those hypotheses which are consistent with given data, and are maximal in the sense that the elements in a hypothesis can not be merged further. Such a hypothesis is called an E-set for the given data, and the set of all E--sets is the new version space which is delimited by the least E-set (specific boundary) and the greatest E-set (general boundary). Based on this version space we propose three classification rules for use in different situations. The first two are based on E-sets, and the third one is based on "degraded" E-sets called weak hypotheses, where the maximality constraint is relaxed. We present an algorithm to calculate E--sets, which, however, is computationally expensive in the worst case. We also present an efficient algorithm to calculate weak hypotheses. The third rule is evaluated using public datasets; the results compare well with the C5.0 decision tree classifier. |
View technical report version |