Hyperrelations in version space

Title:	Hyperrelations in version space
Authors:	Hui Wang, School of Information and Software Engineering , University of Ulster Ivo Düntsch , Günther Gediga , Dept of Computer Science , Brock University, St Catherines, Ontario, L2S 3A1, Canada Andrzej Skowron , Institute of Mathematics; University of Warsaw (Equal authorship implied)
Status:	International Journal of Approximate Reasoning (2004), to appear.
Abstract:	A version space is a set of all hypotheses consistent with a given set of training examples, delimited by the specific boundary and the general boundary. In existing studies (Mitchell, 1977, 1982, Hirsh 1994), a hypothesis is a conjunction of attribute-value pairs, which is shown to have limited expressive power (Mitchell, 1997). In a more expressive hypothesis space, e.g. disjunction of conjunction of attribute-value pairs, a general version space becomes uninteresting unless some restriction (inductive bias) is imposed (Mitchell, 1997). In this paper we investigate version space in a hypothesis space where a hypothesis is a hyperrelation, in effect a disjunction of conjunctions of disjunctions of attribute-value pairs. Such a hypothesis space is more expressive than the conjunction of attribute-value pairs and the disjunction of conjunction of attribute-value pairs. However, given a dataset, we focus our attention only on those hypotheses which are consistent with given data, and are maximal in the sense that the elements in a hypothesis can not be merged further. Such a hypothesis is called an E-set for the given data, and the set of all E--sets is the new version space which is delimited by the least E-set (specific boundary) and the greatest E-set (general boundary). Based on this version space we propose three classification rules for use in different situations. The first two are based on E-sets, and the third one is based on "degraded" E-sets called weak hypotheses, where the maximality constraint is relaxed. We present an algorithm to calculate E--sets, which, however, is computationally expensive in the worst case. We also present an efficient algorithm to calculate weak hypotheses. The third rule is evaluated using public datasets; the results compare well with the C5.0 decision tree classifier.

View technical report version