We have looked at the automation of intelligent tasks using deduction to infer new information from old. We now look at the use of inductive reasoning to infer new information from old.
Learning in humans is composed of many things. Firstly, certain things must be memorised, like dates, places, formulae and so on. Also, we must undertake comprehension tasks, so that we can extract the most important notions from books, web pages, lectures, television programmes, etc. Finally, we must be able to learn from examples. For instance, if a teacher told us that the following are square numbers:
1, 4, 9, 16
and then pointed out that:
1 = 1 * 1
4 = 2 * 2
9 = 3 * 3
16 = 4 * 4
we could probably generalise the notion of square numbers from being 1, 4, 9 or 16 to being any number which can be written as the multiplication of another number by itself. That is, we have learned the definition of square numbers without being explicitly taught it.
As with many areas in AI, machine learning has become fairly specialised. In this case, learning from examples dominates the field, partially because of the applications it affords. If a learning agent can look at the examples of share prices and learn a reason why some shares fall during the first financial quarter, this is of great commercial advantage. If another agent can learn reasons why certain chemicals are toxic and others are not, this is of great scientific value. If another agent yet can learn what a tank looks like given just photographs of tanks, this is of great military advantage.
Machine learning, as with automated reasoning, aims to use reasoning to find new, relevant information given some background knowledge. This information may then be used towards completing an intelligent task of more complexity. The same way that the deductive process was harnessed for the particular task of proving theorems, machine learning is harnessed for particular tasks. One task is categorisation:
Of course, if an agent has learned a way of correctly categorising examples, then this can be used to predict the category of unseen examples. By stating prediction tasks as categorisation tasks, machine learning agents can be powerful tools for making predictions. Hence another task is:
Machine learning application domains include:
If you have a classification or prediction task you wish to program an agent to perform, then it is likely that a machine learning algorithm will be of use. In this case, you will need to correctly specify a learning problem in terms of the categorisation required, the examples available, and the background information about those examples. You will also have to worry about the errors in your data.
We will be automating the process whereby an agent learns from examples. The agent will learn a way of categorising examples into categories, and this may, in turn, be used for prediction tasks. One of the most important specifications of a machine learning problem is therefore the examples. These will depend on the particular problem: for instance in medical diagnosis applications, the example set will be the patients. In handwriting recognition, the examples may be graphics files derived from scanning in hand-written letters, e.g, 8 by 8 matrices of black and white pixels such as this:
Often the problem faced by a learning agent is expressed as a concept learning problem, whereby the agent must learn a concept which achieves the correct categorisation. The concept will therefore be a function which takes an example as input and outputs the correct categorisation of that example.
Usually, to get a machine learning agent to learn a concept, you will have to supply both positive and negative examples of the concept. Positive examples - often called just positives - are pairs (E, C) where E is an example from the data set and C is the correct categorisation of E. Negative examples - often called just negatives - are pairs of (E, X) where E is an example from the data set and X is an incorrect categorisation of E.
For instance, if our learning problem was to learn the categorisation of animals into one of four classes: bird, fish, mammal and reptile, then we would supply some positives such as:
positive(cat, mammal). positive(dog, mammal). positive(trout, fish). positive(eagle, bird). positive(crocodile, reptile).
and some negatives such as:
negative(condor, fish). negative(mouse, bird). negative(lizard, mammal). negative(platypus, bird). negative(human, reptile).
If a machine learning approach can work without negative examples of the concept, then we say it can learn from positives only.
The background information (knowledge) associated with a machine learning problem is the set of concepts which we hope to see utilised in the concept learned to achieve categorisations. The background knowledge describes the examples supplied. For instance, in the animals problem, it is possible to learn a way to categorise animals correctly using concepts such as the habitat in which animals live (land, air, water), the number of legs they have, whether they produce eggs, etc.
Some background knowledge will be necessary in order to specify the examples. For instance, to be able to use the examples in the handwriting problem, we need to describe them in terms of their pixels. Other knowledge may not be necessary to describe the examples, but useful in finding solutions. Other knowledge still may be surplus to requirements and not appear in solutions.
For instance, in some domains, such as handwriting recognition, it is expected that something like a neural network (see later) approach will work with only the pixels as background information. In other domains such as chemistry, some methods may work with only the background information required to describe molecules (atoms, bonds between atoms, etc.), but it is generally believed that more high level concepts will enhance the effectiveness of learned solutions. This high level information could be chemical qualities (solubility, latent heat, etc.) and substructure information (benzene rings, hydroxyl groups, etc).
Also, background information can include relationships between background concepts. These relationships are analogous to the axioms supplied to automated reasoning programs, and they may help a machine learning program to search the space more efficiently.
There are many toy problems in Artificial Intelligence, such as the Michalski Train problem described below. Toy problems are artificial and often contrived to be difficult for various techniques. Programming agents to overcome certain problems and solve such toy problems leads to advances in the understanding and effectiveness of AI techniques, and toy problems can be used to compare competitive methods.
The Michalski train problem was invented by Ryszard Michalski around 20 years ago. It is simply stated - find a concept which explains why the five trains on the left are travelling eastbound and the ones on the right are travelling westbound. The solution must involve the concepts on view: size, number, position, contents of carriages, etc. It is left as an exercise to find the solution... This is a standard problem which has been used to test and demonstrate many machine learning techniques. In particular, Inductive Logic Programming implementations (which we discuss later) use this example for demonstrative purposes. |
However, it is worth remembering that in machine learning, we often have to deal with data taken from the real world. And real world data contains errors. Errors come in various flavours, including: (i) incorrect categorisation of examples, such as saying that a platypus is a bird - it does lay eggs - when it is actually a mammal (ii) wrong values for background information about the examples, such as low quality scanning of hand written letters leading to a pixel which was not written on being scanned as black instead of white (iii) missing data, for instance, examples for which we do not know the value of all the background concepts and (iv) repeated data for examples, possibly containing inconsistencies.
Certain learning algorithms are fairly robust with respect to errors in the data, and other algorithms perform badly in the presence of noisy data. For instance, the learning algorithm FIND-S we discuss later is not very robust to noisy data. Hence it is important to assess the amount of error you expect to be in your data before you choose the machine learning techniques to try.
Writing a machine learning algorithm comprises three things: how to represent the solutions, how to search the space of solutions for a set of solutions which perform well and how to choose from this set of best solutions.
The solution(s) to machine learning tasks are often called hypotheses, because they can be expressed as a hypothesis that the observed positives and negatives for a categorisation is explained by the concept learned for the solution. The hypotheses have to be represented in some representation scheme, and, as usual with AI tasks, this will have a big effect on many aspects of the learning methods. We will look at a number of ways to represent solutions and the associated methods for agents to learn using them.
It is important to bear in mind that a solution to a machine learning problem will be judged in worth along (at least) these three axes: (i) accuracy: as discussed below, we use statistical techniques to determine how accurate solutions are, in terms of the likelihood that they will correctly predict the category of new (unseen) examples (ii) comprehensibility: in some cases, it is highly desirable to be able to understand the meaning of the hypotheses (iii) utility: there may be other criteria for the solution which override the accuracy and comprehensibility, e.g., in biological domains, when drugs are predicted by machine learning techniques, it is imperative that the drugs can actually be synthesised.
Each learning task will be better suited by one or more representation schemes. For example, to some extent, it is not important exactly how a learned hypotheses predicts stock market movements, as long as it is accurate. Hence, in this case, it is perfectly acceptable to use so called black box representations and techniques, such as neural nets. These methods often yield high predictive accuracy, but provide hypotheses which are difficult to understand (they are black boxes, and one cannot look inside). In scientific domains, training a neural network to perform prediction tasks may well be more effective in terms of predictive accuracy than using a another approach. However, the other approach may yield an answer which is more understandable and from which more science will flow.
For some methods, such as training neural networks, it's not all that useful to think of the method as searching, as it is really just performing calculations. In other techniques, such as Inductive Logic Programming, however, search certainly takes place, and we can think about the specifications of a search problem as we have done in game playing and automated reasoning. One important consideration in machine learning is whether the algorithm will search for more general or more specific solutions first. More general solutions might be advantageous because the user may be able to instantiate variables as they see fit, and general solutions may offer a range of possibilities for the task at hand. However, more specific solutions, which specify more precisely a property of the target concept, might also be advantageous.
Certain learning techniques learn a range of hypothesis as solutions to the problem at hand. These hypotheses usually range over two axes: their generality and their predictive accuracy over the set of examples supplied. We have mentioned that the application of machine learning algorithms is to predicting the categorisation of unseen examples, and this is also how learning techniques are evaluated and compared. Hence, machine learning algorithms must choose a single hypothesis, so that it can use this hypothesis to predict the category of an unseen example.
The overriding force in machine learning assessment is the predictive accuracy of learned hypotheses over unseen examples. The best bet for predictive accuracy over unseen examples is to choose the hypothesis which achieves the best accuracy over the seen examples, unless it overfits the data (as explained later). Hence, the set of hypotheses to choose from is usually narrowed down straight away to those which achieve the best accuracy when used to predict the categorisation of the examples given to the learning process. Within this set, there are various possibilities for choosing the candidate to use for the prediction.
Often, Occam's Razor is called into effect: the simplicity of the hypotheses is evaluated and the simplest one is chosen. However, it is worth noting that there are other reasons why a particular hypothesis may be more useful than another, as discussed further below.
To start our exploration of machine learning techniques, we shall look at a very simple method which searches through hypotheses from the most specific to the most general. This is called the FIND-S (find specific) method by Tom Mitchell (the author of the standard machine learning text). To use this method, we need to choose a structure for the solutions to the machine learning task. Once we have chosen a structure, then our learning agent first finds all the possible hypotheses which are as specific as possible. How we measure how specific our hypotheses are is dependent on the representation scheme used. In first order logic, for instance, a more general hypothesis will have more variables in the logic sentence describing it than less general hypotheses.
The FIND-S method takes the first positive example P_{1} and finds all the most specific hypotheses which are true of that example. It then takes each hypothesis H and sees whether H is true of the second positive example P_{2}. If so, that is fine. If not, then more general hypotheses, H' are searched for so that each H' is true of both P_{1} and P_{2}. Each H' is then added to the list of possible hypotheses. To construct more general hypotheses, each old hypothesis is taken and the least general generalisations are constructed. These are such that there is no more specific hypothesis which is also true of the two positives. Note that, if we are using a logic representation of the hypotheses, then all that is required to find the least general generalisation is to keep changing ground terms into variables until we arrive at a hypothesis which is true of P_{2}. Because the more specific hypothesis that we used to generalise from was true of P_{1}, then the generalised hypothesis must also be true of P_{1}.
Once this process has been exhausted for the second positive, the FIND-S method takes the enlarged set of hypotheses and does the same generalisation routine using the third positive example. This continues until all the positives have been used. Of course, it is then necessary to start the whole process with a different first positive.
Only once it has found all the possible hypotheses, ranging from the most specific to the most general, does the FIND-S method check how good the hypotheses are at the particular learning task. For each hypothesis, it checks how many examples are correctly categorised as positive or negative, and the hypotheses learned by this method are those which achieve highest predictive accuracy on the examples given to it. Note that because this method looks for the least general generalisations, it is guaranteed to find the most specific solutions to the problem.
This method is best demonstrated by example. Suppose we have a bioinformatics application in the area of predictive toxicology, as described in the box below.
Drug companies make their money out of developing drugs which can cure, vaccinate against and eleviate the suffering from certain illnesses. For each disease, they have many leaders: drugs which may turn out to be useful against the disease. Unfortunately, after some development, it becomes obvious - sometimes as late as human trials - that some of the leaders are toxic to humans. Of course, this usually means that the drug will be abandoned and the money spent on developing the drug will have been wasted. Therefore, it is highly advantageous for drug companies to determine, at any stage of development, whether a given new drug will turn out to be toxic: they want to predict toxicology. Because they have examples of drugs in a family similar to one which is under investigation, they can use the drugs which turned out to be toxic as positives, and those which didn't as negatives, and this problem can be stated as a machine learning problem. This is one of the places where Artificial Intelligence overlaps with biology/chemistry, and forms part of the rapidly growing area of bioinformatics. |
Suppose further that we have been given 7 drugs, 4 of which are known to be toxic and 3 of which are known to be non-toxic, as drawn below:
Positives | Negatives |
The chemists think that the toxicity might be caused by a substructure of the molecules consisting of three atoms joined by two bonds, for example: c-c-h or c-c-n. Structure, rather than actual chemicals, sometimes plays a more important part in the activity of drugs, so the chemists also suggest that we look for generalisations, i.e., substructures where some of the chemicals are not known, for example: c-?-n or c-?-c.
Hence, to solve this problem, we can use a FIND-S
method where the solutions are simply triples of letters
< A,
B, C > where A, B and C are taken from the set of chemical letters
{c, h, n, o, ?}. We include the ? so that we can find more general
solutions. For instance, the solution < c, ?, n > means that
the agent has learned that the toxic chemicals have a substructure
consisting of a carbon atom bonded to something which is in turn
bonded to a nitrogen atom, and that the non-toxic chemicals do
not. This isn't, of course, a good solution, because it is true of
only 2 out of 4 positives and is also true of 1 of the 3 negatives.
To design our search strategy, we start with the simple fact that any concept learned will be true of at least one positive (toxic) drug. If we look at P1, then there are only two triples of atoms in the molecule (if we do not allow a triple to be written backwards):
< h, c, n > and < c, n, o >
We now see whether these substructures are also
found in P2, and generalise them if not. Firstly, the structure
< h, c, n > is not a substructure of P2. So, we will need to
generalise it, and to do this, we should introduce one variable only,
in such a way that the generalised structure is found in P2. By
generalising only one variable, we will find only the least general
generalisations. In this case, only the following generalised
substructure is true of P2:
< h, c, ? >
If we now look at < c, n, o > then it is also not found in P2 but it can be generalised to:
< c, ?, o >
which is true of P2. Our set of candidate hypotheses
now contains these four:
< h, c, n >, < c, n, o >,
< h, c, ? > and < c, ?, o >. We now turn to P3 to
generalise these further, which gives us these nine possible
hypotheses:
< h, c, n >,
< ?, c, n >,
< h, c, ? >,
< h, ?, ? >,
< ?, c, ? >,
< c, n, o >,
< c, ?, o >,
< c, ?, ? > and
< ?, ?, o >
Using P4 to generalise these, we do not get any more possible hypotheses. We now need to check the accuracy of these hypotheses and choose the best. The table below scores the hypotheses in terms of their predictive accuracy over the given examples:
Hypothesis | Solution | Positives true for | Negatives true for | Accuracy |
1. | < h, c, n > | P1 | N2 | 3/7 = 43% |
2. | < c, n, o > | P1 | 4/7 = 57% | |
3. | < h, c, ? > | P1,P2,P3 | N1,N2 | 4/7 = 57% |
4. | < c, ?, o > | P1,P2,P3 | 6/7 = 86% | |
5. | < ?, c, n > | P1,P3,P4 | N1,N2 | 4/7 = 57% |
6. | < h, ?, ? > | P1,P2,P3 | N1,N2,N3 | 3/7 = 43% |
7. | < ?, c, ? > | P1,P2,P3,P4 | N1,N2,N3 | 4/7 = 57% |
8. | < c, ?, ? > | P1,P2,P3,P4 | N1,N2,N3 | 4/7 = 57% |
9. | < ?, ?, o > | P1,P2,P3 | N1,N3 | 4/7 = 57% |
Hence, the best hypothesis learned by this method is number 4. This hypothesises states that the toxic substances have a submolecule consisting of a carbon atom joined to some other atom, which is in turn joined to an oxygen atom. This correctly predicts the toxicity of 6 out of 7 of the given examples, so it scores 86% for predictive accuracy over the given examples.
Note that, to finish the FIND-S method, the whole procedure would have to be repeated using P2 as the first positive, and generalising using P1, P3 and P4. Once the possible solutions had been collected using this, then the procedure would be repeated again using P3 as the first positive, and so on.
Disclaimer: Please note that this is an entirely fabricated example. The chemists among you will have no doubt noticed that some of the drugs are not even valid chemicals and the existence of the learned substructure has nothing to do with toxicity.
Many machine learning problems have binary categorisations, where the question is to learn a way of sorting unseen examples into one of only two categories, known as the positive and negative categories. [Note that this is not to be confused with supplying positive and negative examples]. Suppose an agent has learned a method to perform a binary categorisation. Suppose further that it is given an example which it categorises as positive using it's learned method. In this case, if the answer should have been categorised as negative, then we say this is a false positive: the learned method has falsely categorised the example as positive. Similarly, if the method categorises an example as negative, but this is incorrect, this is a false negative.
In some cases, having a false positive may not be as disastrous as having a false negative or vice versa. For instance, machine learning techniques are used to diagnose whether patients have a particular illness, given their symptoms as background information. Here, it may be the case that the doctors don't mind false positives as much as false negatives, because a false negative means that someone with the disease has been incorrectly diagnosed as not having the disease, which is perhaps more worrying than a false positive. Of course, if the medicine used to treat patients had severe side-effects (or was very expensive), then it is possible that the doctors may prefer false negatives to false positives.
To calculate the predictive accuracy of a particular hypothesis over the set of examples supplied, we simply have to calculate the percentage of examples which are correctly classified as either positive or negative. Suppose we are given 100 positives and 110 negatives and our learning agent learns a hypothesis which correctly categorises 95 positives and 98 negatives. We can therefore calculate that, given any of the examples, positive or negative, the hypothesis has a 92% chance of correctly categorising it. This is because:
(95 + 98)/(100 + 110) = 0.919 (3 d.p).
It is very important to remember, however, that this gives us only a weak indication of how likely the hypothesis is of correctly categorising an example it has not seen before. To see this point, think how easy it would be to program an agent to find a hypothesis which correctly classifies all the examples for a particular learning problem. Suppose it was given positives P1, P2, ..., Pn then a "good" hypothesis it could learn would be something like:
A is positive if A is P1 or A is P2 or ... or A is Pn
A is negative otherwise.
This would score 100% in terms of predictive accuracy over the set of examples given. Imagine, however, how badly this hypothesis would perform when used to predict whether a new example was positive or negative: it would always predict that the new example is negative.
A standard machine learning technique is to separate the set of examples into a training set and a test set. The training set is used in order to produce hypotheses, and the test set - which is never seen during the hypothesis forming stage - is used to test the accuracy of the hypothesis in predicting the categorisation of unseen examples. In this way, we can have more confidence that the learned hypothesis will be of use to us when we have a genuinely new example for which we do not actually know the categorisation.
There are various ways in which to separate the data into training and test sets, and established ways by which to use the two sets to assess the effectiveness of a machine learning technique. In particular, we use n-fold cross validation to test the predictive accuracy of machine learning methods over unseen examples. To do this, we partition the set of examples into n equal-sized sets randomly. A partition of a set is a collection of subsets for which the intersection of any pair of sets is empty. That is, no element of one subset is an element of another subset in a partition.
For each set in the partition, we hold back that set as the test set, and use the examples in the other n-1 subsets to train our learning agent. Once the learning agent has learned a hypothesis to explain the categorisation into positives and negatives over the training set, we determine the percentage of examples in the test set which are correctly categorised by the hypothesis. Each set is held back in turn, and the predictive accuracy over the test set of the learned hypothesis is tested. To produce a final calculation of the n-fold cross validation predictive accuracy, an average over all the percentages is taken.
For learning methods which produce multiple competing hypotheses for a learning task, in order to perform the cross validation, the method must be forced to choose a single hypothesis after each learning session. As mentioned above, this could be in terms of generality, or in terms of Occam's razor - if two hypotheses have the same predictive accuracy over the training set, then Occam suggests we choose the least complicated one. In cases where everything is equal between hypotheses, a learning method may have to resort to randomly choosing one of a set.
Note that we are assessing the method for learning hypotheses with n-fold cross validation, rather than particular hypotheses learned by the method (because with each test set held back, the method may learn a different hypothesis). The cross-validation measure therefore gives us an estimate of the likelihood that, given all possible data to train on, and given a genuinely unknown example, the method will learn a hypothesis which will correctly categorise the new example.
n-fold cross validation is a useful method when data is limited to, say, a few hundred examples. Often, 10-fold cross validation is used. When the data is even more limited (to fewer than around 30 examples), 1-fold cross validation is used. This is called the leave-one-out method of assessing a learning method, and for each test, a single example is left out and the learned hypothesis is tested to see whether it correctly categorises that example. For larger data-sets, cross validation may be unnecessary, and a hold-back method may be employed. In this case, a certain number of examples are held back as the test set, and the learned hypothesis is tested on them. The hold back set is usually chosen randomly.
One very simple learning method is to look at the training examples, and see which class is larger, positives or negatives, and to construct the hypothesis that an example is always categorised as being a member of the larger class. This trivial method is called majority class categorisation and is a yardstick against which we can test machine learning results: if a method cannot produce hypotheses better than the default categorisation, then it really isn't helping much. We say that a machine learning method is overfitting a particular problem if it produces a hypothesis H, and there is another hypothesis which scores worse than H on the training data, but better than H on the test data. Overfitting is clearly not desirable, and is a problem with all machine learning techniques. Overfitting is often referred to as the problem of memorising answers rather than generalising concepts from them.
Machine learning and statistics have much overlap. In particular, some AI techniques (in particular neural networks) can be seen as statistical learning techniques. Also, machine learning draws heavily on statistics in the evaluation of techniques given notions about the data being used. Moreover, certain machine learning techniques, for instance ILP, draw from the statistical theory of probability distributions. And there are many statistical methods which perform prediction tasks such as those undertaken by learning algorithms. |
The remaining lectures on machine learning can be described in terms of the representation over which the learning will take place.
Lecture 11: Decision trees.
Lecture 12: Neural Networks.
Lecture 13: Logic Programs.
Some other representations which are very popular in AI are Bayes nets, Hidden Markov Models and Support Vector Machines.Unfortunately,we don't have time to cover them in this course.