Preparing your data for analysis

If you are bringing your data along to a consultant, there are several things you might consider doing before you meet. Careful attention to the issues listed below will mean that you can get down to the business of data exploration and analysis much more quickly!



Consultants in the SCC work with a variety of software packages, but if you don't have access to a statistical package you can enter your data on an Excel spreadsheet. Even if you have access to a statistical package, it can be useful to enter the data in Excel first.

You may have your data set up in Excel in a way you find easy to use and understand. This doesn't always suit the requirements of a statistical analysis, so you may need to set up a separate Excel file to give to the consultant.

Here a few tips about entering your data in Excel for a consultant:

Errors in the data

error in data spreadsheet

Before a consultant can start on serious analysis of your data, it is vital to be confident that the data are "clean". There is often some confusion about what this means.

Errors can arise from:

Data entry errors might arise from a simple slip in typing that can be picked up by careful checking once the data are entered. Gross errors can be easy to detect. However other errors may not be so apparent. You might (and people have) record the results from one person (or case) for the next person in the data file; this is easy to do if you are transcribing results from one source to another. You might record the results for one variable in the column meant for another variable, or even the same variable in two different columns. These errors can be (and have been) identified once data analysis is underway; however it is better if they are avoided by checking your data entry as you proceed.

Sometimes errors arise from known deviations from a study protocol. A biological sample might have become contaminated, for example. Such cases are generally not included in the data set.


outlier on dotplot

You may have done some exploration and analysis of your data before your meeting with a consultant. You might, for example, have examined the distribution of the outcomes you have measured using a visual display such as an individual value plot, dotplot, scatterplot or boxplot.

This kind of data exploration can assist in identifying errors and values that are relatively unusual. Sometimes the relatively unusual values are referred to as "outliers". However because a value is "unusual" or labelled an "outlier", this does not mean that the value is incorrect or that it should be removed from the data set. It may mean that it is worth checking that indeed no errors have occurred, or that there is not some explanation for why the value is unusual.

Some textbooks recommend removing outliers or even "adjusting" outliers. We strongly recommend that you do not remove or adjust data. The data you bring to a consultant should contain all the correct original data you observed. In a telling example of why not to do this, Antarctic satellite collection data systems automatically deleted outliers, and as a result the hole in the ozone layer was detected much later than it could have been.

There are many reasons why removing valid but somewhat unusual values is not appropriate. A consultant can help you determine if there are unusual values in the data set that cause problems in the analysis and interpretation of your data.

Removing data which are simply at the extremes or adjusting values at the extremes is an extremely dubious scientific practice. Some people regard it as scientific fraud.

Make an appointment:

Call on 8344 6995. Graduate Enquiry form and submit Email: