Having studied a non-symbolic approach to machine learning (Artificial Neural Networks), we return to a logical approach, namely Inductive Logic Programming (ILP). As the name suggests, the representation scheme used in this approach is logic programs, which we covered in lecture 6. As a quick overview, one search strategy for ILP systems is to invert rules of deduction and therefore induce hypotheses which may solve the learning problem.
In order to understand ILP, we will define a context for ILP, and use this to state the machine learning problem being addressed. Following this, we will look at the search operators in ILP, in particular the notion of inverting resolution in order to generate hypotheses. We will consider how the search is undertaken and run through a session with the Progol ILP system. We end by looking at some of the applications of Inductive Logic Programming.
The development of Inductive Logic Programming has been heavily formal (mathematical) in nature, because the major people in the field believe that this is the only way to progress and to show progress. It means that we have to (re-)introduce some notation, and we will use this to formally specify the machine learning problem faced by ILP programs. To do this, we first need to refresh and re-represent our knowledge about logic programs, and define background, example and hypothesis logic programs. Following this, we will specify some prior conditions on the knowledge base that must be met before an agent attempts a learning task. We will also specify some posterior conditions on the learned hypothesis, in such a way that, given a problem satisfying the prior conditions, if our learning agent finds a hypothesis which satisfies the posterior conditions, it will have solved the learning task.
Logic programs are a subset of first order logic. A logic program contains a set of Horn clauses, which are implication conjectures where there is a conjunction of literals which imply a single literal. Hence a logic program consists of implications which look like this example:
X, Y, Z ( b1(X,Y) b2(X) ... bn(X,Y,Z) h(X,Y))
Remember also that, in Prolog, we turn the implication around so that the implication sign points from left to right, and the head of the clause comes first. We also assume universal quantification over all our literals, so that can be removed. Hence we can write Horn clauses like this:
h(x,y) b1(X,Y) b2(X) ... bn(X,Y,Z)
and everybody understands what we are saying. We will also adopt the convention of writing a conjunction of literals with a capital letter, and a single literal with a lower case letter. Hence, if we were interested in the first literal in the body of the above Horn clause, but not interested in the others, then we could write:
h(X,Y) b1, B
We see that the conjunction of literals b2(X) ... bn(X,Y,Z) has been replaced by B and we have used a comma instead of a sign.
Also, we need to specify when one logic program can be deduced from another. We use the entails sign to denote this. If logic program L1 can be proved to be true using logic program L2, we write: L2 L1. We use the symbol to denote that one logic program does not entail another. It is important to understand that if L2 L1, this does not mean that L2 entails that L1 is false, only that L2 cannot be used to prove that L1 is true.
Note also that, because we have restricted our representation language to logic programs, we can use a Prolog interpreter to prove the entailment of one logic program from another. As a final notation, it is important to remember that a logic program can contain just one Horn clause, and that the Horn clause could have no body, in which case the head of the clause is a known fact about the domain.
We will start off with three logic programs. Firstly, we will have the logic program representing a set of positive examples for the concept we wish to be learned, and we denote the set of examples E+. Secondly, we will have a set of negative examples for the concept we wish to be learned, labelled E-. Thirdly, we will have set of Horn clauses which provide background concepts, and we denote this logic program B. We will denote the logic program representing the learned hypothesis H.
Normally, E+ and E- will be ground facts, i.e., Horn clauses with no body. In this case, we can prove that an example of E follows from the hypothesis, as they are all still logic programs. When an example (positive or negative) is proved to be true using a hypothesis H, we say that H (taken along with B) explains the example.
Firstly, we must make sure that our problem has a solution. If one of the negative examples can be proved to be true from the background information alone, then clearly any hypothesis we find will not be able to compensate for this, and the problem is not satisfiable. Hence, we need to check the prior satisfiability of the problem:
e in E- (B e).
Any learning problem which breaks the prior satisfiability condition has inconsistent data, so the user should be made aware of this. Note that this condition does not mean that B entails that any negative example is false, so it is certainly possible to find a hypothesis which, along with B entails a negative example.
In addition to checking whether we will be able to find a solution to the problem, we also have to check that the problem isn't solved already by the background information. That is, if the problem satisfies the prior satisfiability condition, and each positive example is entailed by the background information, then the background logic program B would itself perfectly solve the problem. Hence, we need to check that at least one positive example cannot be explained by the background information B. We call this condition the prior necessity condition:
e in E+ (B e).
Given a problem which satisfies the prior conditions, we define here two properties that the hypothesis learned by our agent, H, will satisfy if it solves the concept learning problem. Firstly, H should satisfy the posterior satisfiability condition that, taken together with the background logic program, it does not entail any negative example:
e in E- ((B H) e).
Also, we must check that all the positive examples are entailed if we take the background program in conjunction with the hypothesis. This is called the posterior sufficiency condition:
e in E+ ((B H) e).
It should be obvious that any hypothesis satisfying the two posterior conditions will be a perfect solution to the learning problem.
Given the above context for ILP, we can state the learning problem as follows: we are given a set of positive and a set of negative examples represented as logic programs E+ and E- respectively, and some background clauses making up a logic program B. These logic programs satisfy the two prior conditions. Then the learning problem is to find a logic program, H, such that H, B, E+ and E- satisfy the posterior conditions.
Suppose a particular person alternates the two hats they own every day. Every time they wear hat X, they get a pain in the head. Every time they wear hat Y, they get no pain. They know only one thing about headaches: that if they ever get a pin stuck in their hat, they will get a pain in the head. Given the repetition of the pain, they induce the statement that hat X has a pin in it, and they have a look under the hat. They see a pin, so they can then use Modus Ponens to prove that the pin was causing (some of) the pain.
Note that their assumption that hat X had a pin in it was unsound - there could be a number of reasons for the pain in the head. However, given their single rule about headaches, they knew that the only way they would be able to prove a reason for the pain was true was if there was a pin in the hat. Rationalising this formally, the person was aware of the Modus Ponens rule of deduction:
|P Q & P|
They also knew the axiom P Q (i.e., pin pain) and they knew from observation that Q (pain) was true. They then induced P as a hypothesis under the knowledge that, if true, they could use the above rule to prove Q.
Suppose another scenario occurs: the same person is getting a pain in the head when they wear hat X. They look under the hat for no apparent reason, and see it has a pin in it. They then induce the rule that hats with pins in cause pain. In this case, the person has induced the P Q part of Modus Ponens for the same reason as before: they want to prove that the pain is caused by the pin. Again, this was an unsound assumption - there could have been many other rules or other facts about the hat from which the person could have used Modus Ponens to determine the cause of pain.
In both cases, the person has inverted the Modus Ponens rule of inference so that, if they induce a hypothesis in this way, then they will have the peace of mind that, if the hypothesis is true, then they will be able to prove that the theorem statement follows (that they are getting pain). Given this, we see that Modus Ponens is not an ideal choice, because using Modus Ponens alone is not a complete inference method: there are theorems in first order logic which cannot be proved using Modus Ponens alone.
Luckily, we know an inference rule which is complete: resolution. Therefore, it makes sense to work out how to invert this rule. As with Modus Ponens, there are multiple ways to invert the rule, and we look at four possibilities. The first two induce a single hypothesis in such a way that, once the new hypothesis has been induced, a single resolution step is required to carry out the verification (proof). The second two are such that three new hypotheses are induced and it takes two resolution steps to carry out the verification.
If we have the clauses: (q A) and ( p A, B), then we can use the absorption rule as depicted below. Note that we write the given clauses above the line and the induced clauses below the line. Remember also that a capital letter represents a conjunction of literals, whereas a lower case letter represents a single literal.
|q A & p A, B|
|q A &   p q, B|
We see that a new hypothesis has been introduced (the one on the right below the line). This is similar to the one on the right above the line, but the conjunction A has been replaced by the single literal q. The reason we have written the first clause again, is to highlight which clauses below the line are needed in a resolution step to prove the clauses we started with above the line. We can demonstrate that our known clauses follow from our hypothesised clause by carrying out this resolution step. Instead of writing it upside down to demonstrate the deduction, which might be confusing, we write a "v" diagram as follows:
|q A p q, B|
|p A, B|
Here, the q literal has been resolved to leave the fact that we wanted our hypotheses to prove, as required. We say that absorption is a V-operator. As an exercise, translate this into CNF and convince yourself that resolution does indeed work here.
As an example, suppose we performed the following piece of (inductive) reasoning. We know that anyone who lives in the White House and is a politician is the president. We also know that anybody who is a republican politician living in the White House is male, and we want to explain this. If we make the following substitution into the absorption rule:
p = is_male(X)
q = is_president(X)
A = lives_in_whitehouse(X) is_politician(X)
B = is_a_republican(X)
Then we get this induction:
is_male(X) lives_in_whitehouse(X) is_politician(X), is_a_republican(X)
is_male(X) is_president(X), is_a_republican(X)
This means that, in response to the question: "why are all republican politicians living in the White House male?", we have answered: "all republican presidents are male and any politician living in the White House is the president, therefore, all republican politicians living in the White House are male." Hence, we have used inductive inference to hypothesise that republican presidents are male. In this case, the induced rule is correct.
If we have the clauses p A, B and p A, q, we could hypothesise that these are explained by the clause: q B. That is, we could use the following way to invert resolution:
|p A, B & p A, q|
|q B & p A, q|
and we again show that the starting clauses can be deduced from the hypothesis with a "v" diagram:
|p A, q q B|
|p A, B|
In this case, the q literal has been resolved away. As with absorption, we say that identification is a V-operator. Using the same substitution as above, we get a slightly different story: suppose we know that, as before, anybody who is a republican politician living in the White House is male. Suppose further we know that politician presidents who live in the White House are male, and we want to explain this. Then we can use identification to induce this rule:
As all republicans are not president, even though this induced clause explains the observations, it is false.
Suppose we start with two known clauses which have the same head, p, and bodies which both contain a subset of conjoined literals, A:
p A, B
p A, C
In this case, we can induce three additional hypotheses as follows:
|p A, B & p A, C|
|q B & p A, q & q C|
Note that we have introduced a new symbol: q, which was not previously defined. q is a new predicate, and hence this is a process called predicate invention. Few machine learning techniques actually introduce new concepts not defined as part of the background knowledge in such a general way. For instance, in an illustrative (toy) example in the ILP literature, a predicate for insertion in a list is discovered while the ILP system learns the concept of insertion sort.
To prove that the given clauses follow from the hypotheses, we can portray the two resolution steps required in a "w" diagram as follows:
|q B p A, q q C|
|p A, B p A, C|
We see that in both resolution steps, the invented predicate q is resolved away. We say that Intra-construction is a W-operator.
If we start with two clauses with different heads, but for which the bodies both contain a subset of literals, A:
p A, B
q A, C
then we can induce three hypotheses as follows:
|p A, B & q A, C|
|p r, B & r A & q r, C|
In this case, predicate invention has again occurred to introduce new predicate r.
As usual, to prove that the given clauses follow from the hypotheses, we can portray the two resolution steps required in the following "w" diagram:
|p r, B r A q r, C|
|p A, B q A, C|
We see that, as before, the invented predicate has been resolved away in both steps. As with Intra-construction, Inter-construction is a W-operator.
As with the previous two learning methods we've looked at - in particular artificial neural networks - within the overall framework, there is a huge amount of variation in the way in which the learning is carried out. The same is true for ILP - under the general context given in section 14.1, it seems that almost every possibility for generating a suitable logic program has been tried. We look at the tip of this iceberg here.
In Muggleton and de Raedt's comprehensive study of ILP theory and Methods (Journal of Logic Programming, 1994), they define induction to be the inverse of deduction and introduce the following definitions:
A hypothesis G is more general than a hypothesis S if and only if G S. S is also said to be more specific than G.
A deductive inference rule, r, maps a conjunction of clauses G onto a conjunction of clauses S such that G S. r is called a specialisation rule.
An inductive inference rule, r, maps a conjunction of clauses S onto a conjunction of clauses G such that G S. r is called a generalisation rule.
Under this formalism, the deductive rules of inference we have looked at in lecture 7 will be thought of as specialisation rules, and the inductive rules of inference derived from inverting resolution above are thought of as generalisation rules. One big contrast between different ILP systems is the way in which they search: either from the general to specific or from the specific to general.
One class of systems work from the specific to general, and start from the examples and background knowledge. That is, they start with the most specific hypotheses, which can be obtained by grounding variables and conjoining terms (see Progol example). They then repeatedly apply generalisation (inductive) rules so that they generate a hypothesis which entails more and more positive examples. Of course, they also take care to make sure the hypothesis does not imply negative examples. Another class of systems work from the general to the specific. These start with the empty clause as the hypothesis, which entails everything. They repeatedly apply specialisation (deductive) rules so that they generate a hypothesis which entails fewer and fewer negative examples. Of course, these systems take care to make sure the hypothesis continues to imply all the positive examples.
Because we can test whether each hypothesis explains (entails) a particular example, we can associate to a hypothesis a set of positive elements that it explains and a similar set of negative elements. There is also a similar analogy with general and specific hypotheses as described above: if a hypothesis G is more general than hypothesis S, then the examples explained by S will be a subset of those explained by G.
We will assume the following generic search strategy for an ILP system: (i) a set of current hypotheses is maintained, QH (ii) at each step in the search, a hypothesis H is taken from QH and some inference rules applied to it in order to generate some new hypotheses which are then added to the set (we say that H has been expanded) (iii) this continues until a termination criteria is met.
This leaves many questions unanswered. Looking first at the question of which hypothesis to expand at a particular stage, ILP systems associate a label with each hypothesis generated which expresses a probability of the hypothesis holding given that the background knowledge and examples are true. Then, hypotheses with a higher probability are expanded before those with a lower probability, and hypotheses with zero probability are pruned from the set QH entirely. This probability calculation is derived using Bayesian mathematics and we do not go into the derivation here. However, we hint at two aspects of the calculation in the paragraphs below.
In specific to general ILP systems, the inference rules are inductive, so each operator takes a hypothesis and generalises it. As mentioned above, this means that the hypothesis generated will explain more examples than the original hypothesis. As the search gradually makes hypotheses more general, there will come a stage when a newly formed hypothesis H is general enough to explain a negative example, e-. This should therefore score zero for the probability calculation because it cannot possibly hold given the background and examples being true. Furthermore, because the operators only generalise, there is no way by which H can be fixed to not explain e-, so pruning it from QH because of the zero probability score is a good decision.
A similar situation occurs in general to specific ILP systems, where the inference rules are deductive, hence they specialise. At some stage, a hypothesis will become so specialised that it fails to explain all the positive examples. In this case, a similar pruning operation can be imposed because further specialisation will not rectify the situation. Note that in practice, to compensate for noisy data, there is more flexibility built into the systems. In particular, the posterior conditions which specify the problem can be relaxed, and the pruning of hypotheses which explain small numbers of negative examples may not be immediately dropped.
We can see how the examples could be used to choose between two non-pruned hypotheses: if performing a specific to general search, then the number of positive examples explained by a hypothesis can be taken as a value to sort the hypotheses with (more positive examples explained being better). Similarly, if performing a general to specific search, then the number of negatives still explained by a hypothesis can be taken as a value to sort the hypotheses with (fewer negatives being better).
This may, however, be a very crude measure, because many hypotheses might score the same, especially if there is a small number of examples. When all things are equal, an ILP system may employ a sophisticated version of Occam's razor, and choose between two equal scoring hypotheses according to some function derived from Algorithmic Complexity theory or some similar theory.
Another way to reduce the search space is to be more specific about how the predicates in the hypothesis Horn clauses will look. One approach to this is to limit the number of existentially quantified variables allowed in the learned clauses. Another approach is to more explicitly specify what the learned hypotheses will look like, in terms of restrictions on both their head and their body. In particular, for each predicate to appear in the body of hypotheses, in some ILP systems, it is possible to specify whether each argument in the predicate is to be a ground term, a new variable or a variable given in the head. Such fine-tuning of the background information can reduce search times dramatically.
Progol is a state of the art ILP system developed by Stephen Muggleton (a Professor here at Imperial). There are 5 versions of Progol available for download here: http://www.doc.ic.ac.uk/~shm/Software/. Note that each version performs differently, so the results for the same data set may differ between systems. One of the example files supplied is the "animals" toy dataset which we have used for illustrative purposes previously. Our purpose here is to look at this input file and look at the output produced by Progol, and highlight the most important parts of both.
The purpose of this machine learning exercise is to learn a set of rules for the taxonomic classification of animals into one of the following classes: mammal, fish, reptile and bird. The rules will be defined in terms of a set of attributes that are given as background concepts. These attributes include physical properties such as the number of legs (0,1,2,4) and the covering of the animal (fur, feathers, none, scales); and other attributes such as the habitat(s) the animal lives in (land, air, water) and whether the animal produces eggs, etc. 16 animals are supplied and 7 attributes to derive rules from are also given.
At the top of the input file, there are a set of mode declarations which are language restrictions as described above. The first mode declaration dictates what the head of the hypothesis logic program clauses will look like. The mode declaration is written thus:
This stipulates that the head of the hypothesis clauses will have predicate name class and that the first argument will take in a given animal name (specified by the +) and the second argument will return a ground instance (specified by the #) of type class. Hence this means that the learned predicate will take in the name of an animal and return a class, as required.
The other mode declarations stipulate the format of the predicates in the body of the clauses. For instance, this mode declaration:
specifies that the predicate has_eggs/1 will be available for the body of the hypothesis clauses (although it isn't necessary to include it). Also, this mode declaration:
:- modeb(1,has_legs(+animal, #nat))?
stipulates that the predicate has_legs/2 can be put in the body of the hypothesis clauses. It will appear with an instantiated natural number for the second argument.
Following the mode declarations, there is information about the types of ground variables in the background information:
animal(dog). animal(dolphin) ... animal(penguin). class(mammal). class(fish). class(reptile). class(bird). covering(hair). covering(none). covering(scales). covering(feathers). habitat(land). habitat(water). habitat(air). habitat(caves).
Following the type information comes the background concepts, instantiated for the animals available:
has_covering(dog, hair). ... has_covering(crocodile, scales). etc. has_legs(dog,4). ... has_legs(penguin, 2). etc. has_milk(dog). ... has_milk(platypus). etc. homeothermic(dog). ... homeothermic(penguin). etc. habitat(dog, land). ... habitat(penguin, water). etc. has_eggs(platypus). ... has_eggs(eagle). etc. has_gills(trout). ... has_gills(eel). etc.
Finally, the input file has the positive and negative examples:
% Positives: class(lizard, reptile). class(trout, fish). class(bat, mammal). etc. % Negatives: :- class(trout, mammal). :- class(herring, mammal). :- class(platypus, reptile). etc.
In the output file, we get some indication of how the search progressed. First of all, Progol starts with the example class(lizard,reptile). It then uses the background predicate values for lizard to come up with this most specific hypothesis for an animal being a reptile:
class(A, reptile) :- has_covering(A, scales), has_legs(A, 4), has_eggs(A), habitat(A, land).
Progol then tries various generalisations of this hypothesis such as:
class(A, reptile) :- has_covering(A, scales). class(A, reptile) :- has_eggs(A), has_legs(A, 4).
From the 12 generalisations found, it chooses this one as the best:
class(A, reptile) :- has_covering(A, scales), has_legs(A, 4).
The generation of the most specific hypothesis followed by the generalisation routine continues in order to find rules for the animals classed as mammal, fish and bird. At the end of the file, the chosen rules are summarised thus:
class(A,reptile) :- has_covering(A,scales), has_legs(A,4). class(A,mammal) :- homeothermic(A), has_milk(A). class(A,fish) :- has_legs(A,0), has_eggs(A). class(A,reptile) :- has_covering(A,scales), habitat(A,land). class(A,bird) :- has_covering(A,feathers).
Note that the hypothesis is designed to be used in this order, so that there are, for instance, two tries at checking whether an animal is a reptile. Note that this hypothesis scores 100% for predictive accuracy over the training set. Furthermore, the rules are easily understood.
As noted above, the output from ILP systems is quite easy to understand, because it is in the form of logic programs. This is a distinct advantage of ILP over other learning approaches. In particular, this makes ILP a good choice for scientific applications, where it is often more important that the hypotheses discovered by the system can be understood by the scientist user, than the predictive accuracy of the hypotheses. Below are some applications which give a flavour of the real world tasks to which ILP has been applied.
Finite element methods are used by engineers to analyse stresses in physical structures. These methods depend on modelling the structures with finite meshes at particular resolutions. Considerable expertise is required to choose the correct resolution to avoid unnecessary computational overheads (too fine a mesh), but to reduce approximation errors (not fine enough). ILP systems were used to determine rules for the mesh resolution of edges in the structure in terms of certain properties of the structure being modelled. The Golem, LINUS, FOIL and CLAUDIEN systems were used and produced novel, understandable rules which achieved around 78% predictive accuracy.
As lamented in previous lectures, drug companies lose millions of pounds by developing drugs which eventually turn out to be toxic to humans. In one of a number of experiments with example drugs split into toxic and non-toxic sets, Progol was used to determine why certain chemicals were mutagenic (a property related to carcinogenicity) and others were not. It produced rules which achieved 89% predictive accuracy over a set of 188 compounds.
In another data set of 42 chemicals chosen because they were "regression unfriendly", Progol produced a single rule with 88% predictive accuracy. Because of the understandability of it's results, the rule could easily be interpreted as a new structural predictor which is the substructure given above, where the ring represents an aromatic carbon ring. Note the generality of this result: atoms Y and Z can be any atoms.
The structure of a protein determines to some extent its function, and given that most biological processes are influenced by protein functions, the prediction of protein structure is a very important problem. The structures have macro properties containing many atoms which are known as secondary structures. These properties are shapes such as sheets, helices and hair-pins. The application of the Golem ILP system to this prediction task produced better results than any contemporary learning approach. Also, in experiments to qualitatively model the structure-activity relationship of a series of drugs, while Golem didn't produce better accuracy than linear regression, it did provide insight into the stereochemistry of the system on each occasion.
Programs written in certain languages can be formally proved to be correct (i.e., perform as per their specifications). To do so, it is necessary to find suitable conditions that always hold at given points in the program. One particular case is to find suitable conditions within loops, and these conditions are called loop invariants. An ILP system was used to generate such loop invariants and did so successfully and straightforwardly. The induction of an invariant for a parallel program was also demonstrated.