Notes on lecture 5

Genetic algorithms

First of all, it should be noticed that genetic algorithms are an optimization technique, which can be used is several ways. For example, we can learn the other model like Bayesian network or Neural network structure by genetic algorithms (The population consist of different graph sructures with optimal parameters and the fittness is measured by some common score function. The problem is to decide, how the best individuals are crossed and the result mutated.) However, here we were interested in how to solve the problem itself by genetic algorithms. The restriction is that we should be able to encode the possible solutions as "chromosomes" and define a fittness function for them.

The nice thing in genetic algorithms is that we don't have to be able to model the problem explicitely, but instead let the genetic algorithm to find a good solution directly. Genetic algorithms work especially well, when

Search space is large, complex or poorly understood
Domain knowledge is scarse or not available
No mathematical analysis of the problem is available
Traditional search methods fail

A natural application area is simulation of living processes, which evolve in time. E.g. we can model development of genes, behaviour of immune system, and several ecological phenomena. A totally different application area is robotics. The basic problem is robot navigation in an unknown environment. The robot is given the starting point and the goal and it should plan a sequence of actions, which leads to the desired goal. An extra problem is that the environment can contain moving objects, which the robot should avoid. This is a really interesting application - a nursing robot!

Case-based reasoning

Case-based reasoning is similar to genetic algorithms in the sense that it doesn't construct any explicit model. It belongs to larger group of learning methods called instance-based learning, and its cousin is the famous k-nearest neihbours method. These methods are sometimes called "lazy learning", because they try to postpone learning new things as long as possible. Instead, they try to solve problems by old solutions to similar problems. Only when this is not possible, they try to construct new solutions from the old ones.

The main problem in case-based reasoning is how to combine or adapt the old solutions? Maybe we could use genetic algorithms for this?! Usually the rules are defined by system designer and the system cannot generate any creative answers.

Case-based reasoning is often compared to human way of learning (see e.g. here) and thus it has been suggested that they would suit especially well for intelligent tutoring systems.

Some material

Hidden Markov models

Hidden Markov models seemed to be so difficult topic that I try explain more.

What is a HMM?

A Hidden Markov model is a temporal probabilistic model - i.e. it models dynamic behaviour of system.

The model consists of following components:

A set of hidden states Q1,...,Qn. These are the actual states of the process, but they cannot be observed directly.
A set of observations O1,...,Om. These can be observed or measured.
Transition probabilities P(Qj|Qi), which tell the probability to move from state Qi to Qi for all i,j=1,...,n.
Observation probabilities P(O_i|Q_j), which tell the probability to observe Oi when the actual state is Qj.

In each time step t, the model in is some hidden state Q[t] and we get observations O[t]. Now we make three assumptions:

(1st order) Markov assumption: P(Q[t+1]|Q[1],Q[2],...,Q[t])=P(Q[t+1]|Q[t]). I.e. the current state depends on only the previous state. In some applications, it is convenient to enlarge this assumption so that the current state depends on k previous states (=kth order HMM).
Stationary assumption: P(Q[t1+1]=Qi|Q[t1]=Qj)=P(Q[t2+1]|Q[t2]) in all time steps t. I.e. the transition probabilities are always same, independetn of time.
Output independence assumption: P(O[t+1]|Q[t],O[t])=P(O[t+1]|Q[t]). I.e. the current observation is independent from previous observations and depends on only the current state.

Now Yuriy's point (if I understood it correctly) on my artificial example was that in different courses we have (probably) different transition probabilities. This violates assumption 2. Thus we would need different models for each course. In addition, it is possible that 1st order model is not sufficient, but we would need a higher order HMM.

Notice that assumption 3 holds rarely in practice, and the predictions can be corrupted. We can enlarge the model by allowing dependencies between observations, but the resulting model is more complex to learn and use.

What we can do with a HMM?

Given a Markov model, we can solve several interesting problems:

How probable is the current state? We are given all observations in the past, and want to know, how common or exceptional the current state is. I.e. we calculate posteriot probability P(Q[t]|O[1],...,O[t]).
Predict the future state: Once again we have all observation so far, but now we want to predict a future state after k time steps: P(Q[t+k]|O[1],...,O[t]).
Reason about past: like the previous cases, but now we want to predict the most probable state k time step ago: P(Q[t-k]|O[1],...,O[t]).
Find explanation: given a sequence of observations O[1],...,O[t], what is the most probable state sequence Q[1],...,Q[t], which has produced it?

How can we get a HMM?

The most interesting question is how to construct a HMM? Unfortunately, it is a very difficult problem. Of course we can construct different kind of models and compute the probability that the current model has produced our data (sequence of observations). (See review task in lecture material.)

If the model structure (state and observation variables) is fixed, we can learn the model parameters quite simply by forward-backward algorithm or Baum-Welch algorithm described in lecture notes.

Finally, could we learn the model structure also from data? Especially, how many hidden states would we need? And which observation variables we should select? Bayesian learning has been suggested as a solution, i.e. we select a model M such that P(M|X) is maximal, given data X. I don't know, if it has been applied in practice? Do you know any other solutions?

How can we use HMMs in expert systems?

As already noticed, HMMs suit especially for pattern recognition purposes, i.e. the low-level elements of ES. For example, they can help a lot in speech recognition, where phonemes have temporal order. (In a simplest form, we define a model for each word and ask which model is the most probable, given the sequence of observations -- i.e. the recorded sounds). Nowadays, they have become popular in bioinformatics, e.g. in recognizing DNA sequences. Russell and Norvig sketch an interesting application for diabetes control system. In this model, we have BloodSugar and StomachContent as hidden states, MeasuredBloodSugar and PulseRate as observations and FoodIntake and InsulinDose as actions. Now the doctor or patient her/himself can predict, when s/he should take insulin or eat something, given the observations and actions in the past.

It is told that Andrei Markov developed Markovian models originally to analyze Pushkin's Eugene Onegin. I don't know, if the original model was Markov chain, hidden Markov model (hardly) or something else, but maybe some of you could find it out!