Monthly Archives: October 2014

A very simple introduction to machine learning…

Machine learning quite simply refers to the use of algorithms to make predictions and solve problems involving properties of things and how they behave in response to an outcome.

The nature of problems

Properties themselves can take a variety of forms, sometimes they are categories – as in whether a person has a disease or has a higher education qualification, for instance; or they could be continuous – as in height or weight, as can outcomes – you could talk about whether someone is rich or poor based on having qualifications, or have a continuous outcome – such as overall income.

The outcomes that are subject to predictions are called response variables and measurements or groupings samples are put into are called features. Machine learning is in essence the practise of building models that can predict the response using rules that are fit on features.

Problems where the response variable consists of groups are called classification problems, because things are being classified. Where the outcome is continuous it is a regression problem. The kinds of classification or regression rules available depend on the nature of the data – some methods, like linear regression, can handle both categorical and continuous feature types, whereas some methods require things to have to be converted to continuous variables first; there simply is a bewildering array of models that can be applied.

The underlying assumption

The very effectiveness of machine learning is predicated on there being non-random patterns in the data however with respect to the measurements of the features being considered – if there is no link between the measurements of the features you’ve chosen to build a model and the problem you are trying to address you will see lousy performance.

Measuring performance and overfitting

Performance is often measured in terms of accuracy for classification problems – and if there are only two classes then measures like sensitivity and specificity are often used and in regression problems it is often measured by how much of the variation in the outcome is explained by a model and by how much of an error there is between the predicted outcome and the actual outcome.

It is important that performance be measured when overfitting is minimised. Overfitting refers to the fact that a model can fit your data too well, including inconsistencies, and learn to account for those present within datasets but not those outside. This can give you a spurious measure of how good your models are and make it look better than it actually is.

To account for this it is ideal to train on a proportion of the data and test using another proportion that the training process never included – either by leaving out a portion at the outset or using an altogether independent dataset; or using cross-validation, where a certain percent of samples is held out, the model is trained on the remaining data, and the held-out bit is used as a test sample, and this is done over and over again to get more accurate estimates of performance.

Where can I learn more about machine learning?

There is a very nice introductory LinkedIn tech talk here http://www.youtube.com/watch?v=wjTJVhmu1JM

And here is a full set of lectures that are a treat

and

I might do a set of posts looking at very particular applications in the next few months, until then, feel free to knock yourself out.

Advertisements