Outlier detection

Outlier samples can occur in datasets for a number of reasons. The reasons are usually categorised as either technical or biological. If the biological material are cell cultures, it is likely that the reason is technical. If the biological material are human biopsies, the reason can be either technical (something went wrong with one of the samples during preparation) or biological (e.g. medical condition). If a biopsi is taken from the same part of the tissue for all but one individual, then one sample could be an outlier due to cell type composition. This could be termed a technical outlier because the sample was taken from the wrong place, or it can be termed biological outlier because the cell type composition is different. Most likely all samples from biopsies will have slightly different cell type composition, meaning that all samples could be termed biological outliers. Since this doesn't really help us proceed, it is often termed a dataset with a lot of individual variance. In these cases we would keep all the samples, knowing that we will probably only be able to identify large changes in the dataset.

The underlying assumption when analysing microarray data is that the distribution of signals is the same for all of the arrays, or at least that the samples are reasonably similar. A sample that behaves differently from the other samples is termed an outlier sample. Outlier samples may cause problems during data analysis and should be considered left out in further downstream analysis. Outlier detection is often based on distance measures, clustering and spatial methods. As you can see from the examples below, sometimes it is clear when a sample is an outlier, while other times it is more difficult to decide. Usually it is a good idea to evaluate the data by different methods and plots. Here we will show some examples and try to give some guidelines for when a sample should be considered excluded from the dataset. Important: If you remove a sample from the dataset, you have to say that you removed it and then give your reason for doing so!

Shortlisted guideline:

Use different plots to look for single outlier samples.
If you can find a technical reason explaining why a samples is an outlier it should be removed from the dataset.
1. Look at quality of RNA, labelling and hybridisation for this sample. This was probably evaluated during the experiment, but you should still go back and take another look to see if there is anything that makes this samples deviate from the others at these levels.
2. If all looks good at step 2a, then check notes for other types of technical deviations, e.g. was one sample left on the bench for longer than the other
If you cannot find any technical reason why the sample behaves differently, there may be a biological reason for it behaving differently. If there is a biological reason why it behaves differently you should try to identify what it could be, before deciding whether to remove it or not.
1. Check any type of meta-data that may be available for the sample, e.g. age, sex, medical history
If you cannot find any reason why a sample behaves differently, then do not remove it, unless it is so different that something is clearly very wrong.

Example:

Two-colour microarray

Boxplot

A boxplot is used to compare the distribution of log2 ratios for the different samples. The different samples look like they have similar distributions, although some samples, e.g. sample "hyb 24" has thicker tails than the other samples.

CAplot

Next we do a correspondance analysis plot and see that there are two samples that may be regarded as outliers. The sample named "hyb 24" looks like a clear outlier, while "hyb 12" is less clear.

Hierarchical clustering

Hierarchic clustering (using Pearson Correlation as distance measure) we see that "hyb 11" and "hyb 24" do not cluster with the other samples, but instead create two singltons on their own.

Conclusion: Sample "hyb 24" shows up as a outlier in all of the plots and is therefore regarded as being an outlier.