Logo Quantitative consulting

Dr. B oris V aillant - Quantitative Consulting

[Training] [Short articles]

Dos and don'ts in Excel

[Download the xls-example]

There's no denying it. As much as quantitative consultants would wish that dedicated statistics tools like R, Splus, SAS or business intelligence tools like BO and Cognos played a more prominent role, the program that is virtually ubiquitous when it comes to analyzing business data is Excel. Its large distribution base as part of the Office package, its ease of use and its flexibility make it the most popular analysis tool by far.

The advantages of Excel constitute at the same time its largest drawback. Excel does not prescribe how to store or format data, it does not distinguish between data management and data presentation and in general invites the user to be repetitive and uneconomical with his or her data.

If in a project data sources are given to me in Excel, I usually count in several extra days of work to bring the data into a usable format. Even if you are not working with a consultant: The necessity to share data and to regularly update it, the fact that the data may be useful in answering new questions not thought of before, and not least the fact that we all belong to a rather forgetful species, place a number of restrictions on the way how to Excel should be used.

The following list of recommendations is mainly concerned with how you as the user should use the program: be specific, be precise, be encompassing and be systematic, if you know BO or Access: use Excel like you use BO or Access. Some of the recommendations are concerned with Excel's built-in "intelligence". Here, the message is: Do not trust it.

One remark before we start: Many people use Excel to "build models" by chaining formulas, putting different functions of their models on different pages etc.. While I do believe that most of this modelling should better be done in a statistical programming language like R, the following "commandments" can only in part be applied in that case.

Dos and Don'ts in Excel I

Do not Do
Do not mix data and analysis. Do keep your data in one table and as close to "normal form" as possible. Use the pivot table function for analysis.
If you keep your data in a standardized form, you don't have to reinvent the wheel, every time you make an analysis. You can use standard tools like Pivot-tables, and are able to describe your data concisely.
Do not keep separate sheets of data (e.g. January, February, etc.). Do create a single sheet for your data.
Just imagine you've stored every month in another sheet and somebody asks you to compare Q3 to Q2! You will have to start linking formulas across sheets, but of course the data positions in the sheets will have moved etc. etc.
Do not keep any kind of formatting in your data. Do ensure that your data is completely unformatted by copying the whole data sheet and then using "Paste special: Values" into a completely new sheet
Formatting has a way to lead a life of its own. What was supposed to be a number becomes a text or a date etc. If you want to make sure there is no formatting left, use a completely new sheet.
Do not use linked cells, colors, decimal points or the like. Especially do not use formatting like colors, or bold type to code information Do make your color coded information explicit in a separate column. If you must by using values like "green", "blue" etc.
Have you ever tried to navigate through a sheet with linked cells with your cursor? Or to link a formula to a linked cell? Or to just copy data from an area with linked cells?
Never use the date format for dates. Do use a safe date format, e.g. 2006, 12, 24 in separate columns or 20061224 instead.
No matter how good it looks right now, it will not survive any move to a machine with another localization than yours.
Never use the currency format. Do keep track of currencies as pure text in a separate column.
Again, you won't be able control changes of localization.
Do not use tags or abbreviations that could be construed to be dates. Do make sure there are no ambiguities.
If, for instance, you should decide to label lines by "3.1", "3.2" etc., rest assured that Excel will start interpreting this as the 3rd of January etc.. Usually this happens when you least expect it and you may only find out much later that you have worked with corrupted data.
Do not keep formulas in your data sheet. If you need formulas to calculate some columns (e.g. revenue from pieces sold and price/piece): Do use formulas for the calculation, but paste back the values immediately afterwards. Keep a copy of the first few lines of your data sheet with formulas in a separate sheet, so that you can repeat the procedure, later.
In large data sets, formulas will simply kill your performance. They will also bug you, whenever you interchange columns or sort your data.In the worst case, your results may be wrong after sorting without your noticing it.
Do not hide information in formulas (e.g. by using constants: A1*2,33) Do make all information used in formulas explicit in separate columns
This is just one more instance of the principle that everything in your data should be explicit.
Do not use intensive (not summable) quantities in your data sheet, except for the calculation of extensive quantities Do use extensive (summable) quantities in your data sheet
I'll be somewhat less assertive on this requirement. The point is that you cannot sum quantities in EUR/Kg or the like. You might want to look at minimums or quantiles in a given segment, though. So, be sure what you do.
Do not use different spellings for names (e.g. Region, Product names etc.) Do make sure that names are always written exactly the same way.
It is a real pain when you have to search for all the sales with "Meier&Co", "Meier &Co." and "Meier & Co". A common cause for nonunique names are double spaces. Get rid of double spaces by using Replace: "  " (2 spaces) by " " (1 space). For "Meier&Co", along with all their possible spellings, try replacing ".", "&", " " by "" (nothing).
Do not rely on the "numeric" format. Do make sure that numbers are really numbers by multiplying them with 1 in a separate column and using this result instead of the original data.
What looks like a number in Excel may still not be recognized as a number. The formatting options usually only influence the display but not the data type itself.
Do not hide columns or lines in your data sheet Do introduce a separate column with 1 or 0 indicating, whether a specific line is appropriate for your purpose or not.
Hiding elements may make sense in presentation sheets, not in your data sheet.
Do not allow empty cells as values. Do make your intention explicit.
Did you mean 0, or "to be discussed", "does not apply" etc.?
If you are German and are keen to belong: do not pronounce the name of the program "Ecksl" Do try to pronounce it more like "Exelle"

So, now that you have put your data in a nice unformatted table, you can start making analyses using the pivot table function. Have a look at the [xls-example].

© 2006-2008   www.quantitative consulting.eu