Better practices for sample spreadsheet creation
Target audience: Researchers who frequently create sample spreadsheets, typically "the biologist".
Aim: Help the biologist in the future make better informed decisions when creating sample spreadsheets.
Really short version
- File names: only letters, numbers, - and _. No blank spaces! ex. sample_info.csv Read more about naming
- Columns names: only letters, numbers and _. ex. tumor_size Read more about naming
- Sample identifiers: start with at least one letter, pad with zeros. ex. s003 Read more about naming
- Date format: YYYY-MM-DD. ex. 2013-02-29 Read more about dates
- Formatted information: no colourcoding Read more about storing information
Also short but longer version
Spreadsheets often need to be read and understood by both other humans than the biologist and computer programs used for data analysis. It may not be a topic many biologists pay a lot of attention to, but unfortunate choices in the initial creation of the spreadsheet may later give opportunities for mistakes to be made in the analysis. In addition, it usually means that a significant amount of time must be used in back-and-forth messages between the biologist and the bioinformatician to clear out ambiguities. For those who know how to do it, generating a good spreadsheet is not a difficult task and it can save a lot of valuable time later.
This guide focuses on the creation of tidy spreadsheets that are less error-prone and easier for computers to process, although they may seem slightly less pleasant to the human eye. If you need to show your data in a presentation, you should make a pretty ad-hoc version but keep the main file computer-friendly.
We hope that the following advices help the biologist realize which practices are best avoided, understand why they are problematic and learn how to fix them. At the same time, we acknowledge that some of the "unfortunate" practices have value for the biologist and will be continued, but now after a better informed decision.
A practical example: "the Spreadsheet File from Hell"
In this illustration we have collected examples of the most common problematic practices (highlighted with red numbers) to a create a spreadsheet that would be the nightmare of any data analyst. Below you can find a list with
the corresponding problems and some recommendations on how to fix them. You can click on the
Read more options to get an extended explanation. Under the list you can see how the spreadsheet looks after we have applied the better practices.
Which problems do we find in this spreadsheet? And how can we solve them?
- Weird file name: File name contains blank spaces. It is a good practice to substitute blank spaces with a hyphen ‐ or an underscore _. Read more about naming
- Weird column names: Several variable names have blank spaces, newline and/or special characters (beware of Norwegian characters!). It can cause problems when a different program reads those columns. Blank spaces can be substituted with an underscore _. It is also recommended not to use capitals in column names to prevent spelling mistakes. Read more about naming
- Long column names: Having short and meaningful column names helps to work with the data. If several columns refer to related information (for example different time points), they should have consistent names. Additional information about the variable (description, units...) can be kept in a separate data dictionary. Read more about naming
- Inconsistent sample identifiers: Some of the identifiers start with numbers and some with letters, and some use dots to separate the groups of numbers while others use underscores. Besides, theses numbers have different amounts of digits. This lacking of consistency could lead to problems if the individuals are sorted by their identifiers. Read more about consistency Read more about naming
- None-categorical categorical variables: The disease variable has only two options:
case, but these values are written in several different ways, which could be interpreted by a program as different conditions. Read more about consistency
- Decimal variability: The numeric columns show values with a different number of decimals. Read more about consistency
- Information in formatting: The information about the type of sickness (DCIS or benign) is codified by their position in the spreadsheet. There are also some samples with odd problems shown in green and special cases marked with an asterisk. It is better to store this data in new variables. Read more about storing information
- Horror vacui: Some programs throw mistakes when they read a column with blank spaces. It is better to replace all of them with
NA, which is universally recognized as an empty value. Read more about storing information
- Numbers are numbers: Column
CD4-/CD8-shows two values as
<0.6. Numeric columns should not include non-numeric symbols and specific comments can be stored in an extra
Commentscolumn. Read more about storing information
- Impractical date format: The dates have different formats, and one of the cell shows extra information. The recommended format is YYYY-MM-DD and the extra information can be included in the
Commentscolumn. Read more about dates
- Merged cells: Several cells have been merged into a single one. This kind of changes in the structure of the spreadsheet should be completely avoided. Read more about structure
- Anglo-phobia: Some columns have norwegian names. If the ambition is to make a publication in an international journal from the study, this must be translated to English. Read more about language
This is how the spreadsheet looks after the modifications. We have highlighted the columns or cells modified or added to the previous table:
Naming of files and variables
It is important that the file can be read by most programs. Thus, file names should not have any characters apart from letters and numbers, with the exception of hyphens ‐ or underscores _. Blank spaces can create problems when the data file is read by some computer programs, especially bioinformatic programs which often are not very robust for unexpected input. Instead you can separate the words using hyphens or underscores. If you never use capital letters, a lot of spelling mistakes will be avoided. These advices apply also to column names (variables), since they may end up in file names as plots and tables. In addition variables may be used in the analysis as part of mathematical formulas where a lot of characters, like minus or plus, have a special meaning. All special characters are therefore best avoided in variable names, with the exception of underscore which has no known alternative interpretation.
Naming of Ids and values
Identification numbers for the samples should have the same number of digits. This is critical in order to prevent mistakes when sorting, as sometimes this field is treated as text and then
20. These identifiers should start with a letter, as some programs may remove the zeros at the left of a numeric value. For instance if you have 20 samples, it is better to give them ids like "s01", "s02" ... than just numbering them 1-20. The main concept here is that the alphabetical sort order should be sensible and identical to what a numeric sort order would be (if possible). For instance the common naming of chromosomes, "chr1", "chr2".. violates this and dumb programs (which there are a lot of) would put "chr10" immediately after "chr1". In addition, sample IDs often end up as part of file names and should therefore ideally follow the same principles, i.e no special characters
Categorical values with no meaningful numeric or rank interpretation should be written in a format which makes it impossible for a program to interpret it in a numerical sense. For instance, if patients are given pure numbers as identifiers then some programs may interpret patient "4" to be twice as much patient as patient "2". This will be avoided by calling the patients "p4" and "p2". In some situations categorical values can also have a natural rank, for instance the cancer stage "4" is not twice as much cancer as "2", but it is more severe and in some analysis this rank order is utilized. However, when in doubt what analysis will be done later, try to enforce the non-numeric interpretation, i.e use "stage2" and "stage4" for cancer stages.
You have to be aware that any kind of information in formatting, like encoding data with different colour or font, may get lost when the spreadsheet is read by a different program than the one used to create it. Few bioinformatics programs knows how to interpret excel formatting. This information should instead be encoded in a new variable. Ideally the file should be in a simple text format as csv or tab-text, since this is the format most bioinformatics programs understand. You may still use an excel-like format as long as no important information is lost and the column-row connection is kept when saving as cvs.
Each cell should contain only one piece of information. For example, the column
with a value of "plate07-B03" it is better registered as two or three different columns:
plate with the value plate07 and
well with the value B03 or even split in B and 3.
Several functions in different programs throw errors when they find a blank cell. Avoid blank cells in your spreadsheet by registering missing data with a non-numeric value which is consistent across the table.
NA is widely used and can be recognized by several programs. In addition the biologist sometimes forget to fill in a value. If a blank cell means missing data, it will create an ambiguity and the bioinformatician can not know if it means "no measurement available" or "the biologist forgot to enter the measurement". If the biologist actively denotes an unavailable value as such with NA, then the bioinformatician will correctly interpret the empty cell as a mistake and get it sorted out.
When using a numeric variable, do not write any symbol apart from numeric characters in the cells. Including any other symbol can result in the program considering the values in that column as text instead of numbers and subsequent
calculation would fail. If you need to make a note about an specific cell, use an extra
Being consistent is one of the most important things when creating a spreadsheet. A consistent spreadsheet is much easier to read and understand by others and it can save hours of delay while the bioinformatician and the biologist
exchange questions and answers about the data file. For example, using
dop_20w as column names in the same spreadsheet can be confusing and lead to an avoidable waste of time until all the
doubts are answered. Try to use short and meaningful names for your columns and a consistent structure for their names. The same principle applies to identification names. Having several samples labeled as
rat_3 is unnecessarily confusing. And be aware that consistency is specially critical when you are using categorical variables:
M may mean the same to a human eye, but they are
certainly not the same to a program.
It is not only important to be consistent within a certain spreadsheet, but also among your different files. If you have several spreadsheets with the same information (for example same experiment at different time points), use the same layout. If you have several files that are related, use consistent names.
Dates can be the source of several problems if they are stored in different formats. Although Excel can recognize most of the different formats as dates and sort them accordingly, other programs may treat them as text and sort them alphabetically. Thus, it is important to use the order year-month-day with zero-padding to force the alphabetical sort order to also be chronological. In addition, local conventions of the month and day order may vary. One easy way to avoid problems is to routinely use the ISO 8601 standard YYYY-MM-DD everywhere.
Each spreadsheet should contain only one rectangular table where rows are the samples and columns are the values. There should never be merged cells in your spreadsheet, as it can create a huge mess when the spreadsheet is read by a different program.
Sometimes it could be wise to split the information into separate files. For instance if you have several samples for each patients as in a time series, you could have patient information in one file (disease, age, sex etc) and sample related information (date of sample, dosage etc) in another file. But remember to use consistent identifiers so the connection is not lost. This is the way data is organized in relational databases.
It is important that your data file contains only raw data and you do not modify it after is has been created. Any kind of modification applied to your data file, like using calculations and formulas, can lead to undesired data manipulation. If you need to apply any calculation to your data, use a duplicate file. It is also recommended to avoid having information about the data in the spreadsheet. A good practice is to have in a separate file a data dictionary that can contain variable description, units, etc.
Most biologists want their study to be published in an international journal. Therefore, the information in the sample spreadsheet needs to be in English even though the biologist might speak a different language. The best point in time to do the translation is right at the start when the file is created. Postponing this will often lead to extra work, since the variable names and values may be embedded in tables and plots.
Some of the information reflected in this guide comes from
Karl W. Broman & Kara H. Woo (2017) Data Organization in Spreadsheets The American Statistician, 72:1, 2-10, DOI: 10.1080/00031305.2017.1375989
In this paper you can find additional information, a extended explanation of the typical mistakes and lots of good examples. We encourage everyone to read it.
Data Carpentry has an online tutorial about data organization in spreadsheets with several exercises. It contains notes for instructors who wish to use it for a small workshop or presentation.
Sample File from Hell.xls
This document was made by Manuel Ramirez and Vegard Nygaard.
Bioinformatics Core Facility
Institute for Cancer Research
Oslo University Hospital