Notes on lecture 4

TM-systems

ATMS can be interpreted as multiple JTMS. While JTMS calculates which propositions are believed or disbelieved based on our assumptions and justifications, ATMS calculates for all propositions all possible sets of assumptions, under which the proposition is believed (given the justifications). Thus ATMS does more work, but once it has calculated everything, the queries are fast.

The traditional TMS do not differ much from rule-based systems, they just offer a nice visual presentation for structure of beliefs, and efficient and sound algorithms for updating beliefs. In principle, they can be used like decision trees which are defined by an expert. However, nowadays the desfigner of expert systems demand more: 1) We would like to represent uncertain information and make probabilistic predictions, and 2) learn the model from data. Can TM-systems meet these challenges??

Probabilistic enlargements of ATMS have been suggested in the literature:
Laskey and Lehner: "Assumptions, beliefs and probabilities", in Artificial Intelligence, vol 41, num 1, pp 65-77, 1989. (In fact, they show that any Bayeisan or Dempster-Shafer model can be represented by an enlarged ATMS)
de Kleer and Williams: "Diagnosing multiple faults", in Artificial Intelligence, pp 97-130, vol 32, num 1, 1987.
d'Ambrosio: "Truth maintenance with numeric certainty estimates", in Proceedings of the 3rd IEEE Conf. on AI Applications", 1987, pp 244-249.

A nice thing in ATMS is its ability to maintain consistent knowledge. It is not required that the database should be consistent, but it can search a consistent context for every belief and in addition tell the reasons for inconsistency efficiently. For this reason, it has been suggested that ATMS could be used in constructing e.g. Bayesian networks, where expert knowledge and data can be contradictory.

Task: can you find applications of TM systems in current expert systems?

Bayesian networks and Naive Bayes models

Expert knowledge can be combined with data in model construction in the following way: the conditional probabilities are learnt from data, but the prior probabilities are assigned by an expert. This is especially convenient, when we know only some rules (dependences by attributes, measured by conditional probabilities), but the probabilities are not known. For example, let's suppose we want to predict the course outcomes O (passed or failed) based on execise points in three categories A,B, and C. From previous year course data we have learnt probabilities P(O|A,B,C), but this year the course is more difficult. Thus we cannot use the previous year estimates for P(A), P(B) and P(C). Instead, an expert (teacher) can assign them values which take into account this change in difficulty (expect that students get in average less points than previous year). In the same way, we could take into account information that this year students are more competent than previous year.

Even better approach is if we use a Naive Bayes model, and learn probabilities P(A|O), P(B|O), and P(C|O) from data and estimate P(O) according to our domain knowledge. Now it doesn't matter so much, if our prior estimate is wrong, because we can update it three times based on our observation on A, B and C. However, now A,B and C should be conditionally independent, given O. I.e. they can depend on each other indirectly, through O, but when O is fixed, like O=pass, then A, B and C should be independent. Conditional independence holds, if

P(A|O)=P(A|O,B)=P(A|O,C)
P(B|O)=P(B|O,A)=P(B|O,C)
P(C|O)=P(C|O,A)=P(C|O,B)

for O=pass and O=fail.

What is the complexity of a (general) Bayesian network? Let's suppose that all variables have just boolean values (1,0). If we have n+1 nodes, then the worst case is when one node X depends on all other n nodes A1,..,An. Thus, we have to define P(X|A1,...,An) for all truth value combinations (A1,...,An), i.e. 2^n probabilities. The complement probabilities P(~X|A1,...,An) are obtained from these by P(~X|A1,...,An)=1-P(X|A1,...,An). In addition, we have to define prior probabilities for all P(Ai), i.e. n more probabilities (and once again, complements are obtained from them). So we have to define totally 2^n + n probabilities. As a rule of thumb, each parameter estimation should be based on at least five instances (data rows). Thus, we would need at least 5*2^n rows data. If you have several attributes, you will need also a lot of data! Notice also that the more complex model you have, the more demanding is reasoning.