Some notes on exercises 2

Task 1

Score function: simply a function, which measures the goodness of a model. Usually it simply measures the lack of fit: compares the correct values and predicted values and calculates the total error. In addition, the score function can contain terms which punish too complex models or give preference to more useful models (according to utility in the given problem).

For example, in our mouse trap task, the natural choice for score function is the classification rate: number of correctly classified rodents from all rodents. However, we have also defined a utility: the system should be safe for Eliomys (it should never misclassify them). In addition, we could favour simple models, because we have small data set (to avoid overfitting). The problem is how you could combine all these goals in one score function!

Overfitting means that the model has adapted to the current training set so well that it doesn't generalize to any other data. This happens typically when we have small and maybe exceptional training data (i.e. the training data contains several outliers, but doesn't represent the typical population). Antoher reason is too complex model structure. (Complex models are always less general than simple models.)

For example, if our training set in task 2 would have contained only baby mise and vales, we would have learnt quite errourneous model, even if it had predicted all rodents correctly in our training set.

Inductive bias: This is a difficult but important concept in modelling. It means simply a set of assumptions (concerning data and problem), which restrict the set of possible models or prefers some models to others. Under these assumptions the model should work and generalizes to new data points (outside the training set). On the other hand, if these assumptions are not met the model may not work at all!

For example, in linear regression it consists of assumptions that 1) the data has linear tendency, 2) the independent variables are independent of each other, and 3) the data represent the whole population well. (The same assumptions can be expressed in other ways, too, e.g. as properties concerning residuals = errors between real and predictied Y values.)

In decision trees the assumptions are 1) each data point belongs to exactly one class 2) simpler models are better than complex (if you put the most discriminative attribute near the root, then the tree is shorter).

Concept map: Many of you have made great concept maps. If you hadn't done it at all or it was very simple, I recommend to do it now. These are very important concepts and principles we will need in the future (and all the time). It is also a good idea to give names for the relations - it explains you why the concepts are connected, and helps to improve your model. When you draw the concept map yourself, you in the same time arrange the things in your mind and learn them!

Task 2

How good the model is? The classification rate is 10/14, which is not much, but the system is safe for protected Eliomys. If we add the outlier, the resulting model is no more safe for them.

Improving model: First of all, we should get more data. However, this small dataset already reveals that the linear regression does not work very well. We should either add new features (e.g. colour) or try another kind of model. If we have only a small set of data, it is not wise to increase the number of attributes (the model would only overfit data). In fact, body/tail ratio classifies most of rodents correctly. We could construct a decision tree with a "regression node" (which computes a function of attributes body and tail):

If body/tail >2.2 and <2.5, then the species is 2
If body/tail > 1.1 and <1.6 then the species is 4
If body/tail <1.1, then  
    if body>10cm then species is 5
    else species is 3
   

Evaluation

Task 1: each concept 1/2p, concept map 1.5 p.
Task 2: max 3 p, -1p, if the model was not constructed with the outlier. -1/2p, if the goodness of the model (classification error) was not estimated. Other shortcomings typically -1/2p, but other good ideas could compensate them.