UU
HOME cs.uu.nl


Use of prior knowledge in learning from data

In learning from data one can imagine two extreme situations with respect to the availability of prior knowledge. The first is that no prior knowledge whatsoever is available, the second is that the relationship sought is known with certainty up to a limited number of parameters. Both extremes are unlikely to occur in practice.

Data mining is often associated with the situation where little prior knowledge is available and an extensive search over possible models is performed. Of course one has to have some ideas, for how else does one decide which explanatory variables to include in the model? But often the algorithm is able to select the relevant variables from a large collection of variables (like in stepwise regression) and furthermore flexible functions are used, i.e. few assumptions are made concerning the functional form. Even though data mining is often applied to domains where little theory is available, in some cases useful prior knowledge is available, and one would like the learning algorithm to make use of it one way or the other.

One type of prior knowledge that is often available in applications concerns the sign of a relation between the dependent and explanatory variables. For example: economic theory would state that people tend to buy less of a product if its price increases (all else equal), so price elasticity of demand should be negative. We also say that the relation between price and demand is antitone: when the one increases the other decreases (when an increase in the one variable tends to lead to an increase in the other, we say the relationship is isotone). The strength of this relationship and the precise functional form are however not always dictated by economic theory. The usual assumption that such relationships are linear are mostly imposed for mathematical convenience. Our work in this area has focussed on the use of monotonicity constraints for two kinds of models: classification trees and Bayesian networks.


The use of monotonicity constraints in classification trees

For classification problems with ordinal attributes very often the class attribute should increase with each or some of the explanatory attributes. These are called classification problems with monotonicity constraints.

Monotonicity can be an important model requirement with a view toward explaining and justifying decisions, such as acceptance/rejection decisions. Consider for example a university admission procedure where candidate a scores at least as good on all admission criteria as candidate b, but a is refused whereas b is admitted. Such a nonmonotone admission rule would clearly be unacceptable. Similar considerations apply to selection procedures for applicants for e.g. a job or a loan.

Standard classification tree algorithms such as CART or C4.5 are not guaranteed to produce monotone trees, even if the data set is completely monotone. We have looked at pruning based methods to build monotone classification trees from monotone as well as nonmonotone data sets. We developed a number of fixing methods, that make a non-monotone tree monotone by additional pruning steps. These fixing methods can be combined with existing pruning techniques to obtain a sequence of monotone trees. The performance of our algorithms has been evaluated through experimental studies on artificial as well as real life data sets. The conclusion of those studies is that the monotone trees have a slightly better predictive performance and are considerably smaller than trees constructed by the standard algorithms.

Some papers I have (co-)written on this subject:


Monotonicity constraints in Bayesian networks

Bayesian networks are widely accepted as powerful tools for representing uncertain knowledge in decision-support systems. A Bayesian network in essence is a model of a probability distribution; it consists of an acyclic directed graph that captures the qualitative dependence structure of the distribution and a numerical part that specifies the conditional probability distributions of each variable given its parents.

For constructing a Bayesian network, often knowledge is acquired from experts in its domain of application. Experience shows that domain experts can quite easily and reliably specify the qualitative structure of the network, but have more problems in coming up with the probabilities for its numerical part. If data from every-day problem solving in the domain is available, therefore, one would like to use these data for estimating the required probabilities. In many cases, unfortunately, the available data sample is quite small, which may give rise to inaccurate estimates. The inaccuracies involved may in turn lead to a reasoning behaviour of the resulting network that runs counter to the qualitative knowledge of the experts in the domain.

We argue that expert knowledge about the qualitative influences between the variables in a Bayesian network can be used to improve the probability estimates obtained from small data samples. The problem of learning probabilities under the order constraints that result from such influences, is a special case of isotonic regression. Building upon this property, we present an estimator that is guaranteerd to produce estimates that satisfy the order constraints that have been specified by the experts. The resulting network as a consequence is less likely to exhibit counterintuitive reasoning behaviour and is more likely to be accepted than a network with unconstrained estimates.

Our experimental results for a small Bayesian network in the medical domain show that taking prior knowledge of influences into account leads to an improved fit of the true distribution, especially when only a small sample of data is available.

Some papers I have (co-)written on this subject: