Karl W. Broman & Kara H. Woo (2018) Data Organization in Spreadsheets, The American Statistician, 72:1, 2-10, DOI: 10.1080/00031305.2017.1375989
- Be consistent.
- Consistent values for categorical variables (eg: male, female).
- Same for missing variables.
- In R, we would prefer NA.
- You can also use hyphen.
- Never explain why it is missing inplace of the missing value, make a seperate column for that.
- Consistent variable and subject names, consistent layout throughout one file or one ecosystem.
- Consistent date format.
- Consistent phrases in notes.
- No extra spaces within cells.
- Choose informative and succinct names for subjects.
- Prohibited of using spaces.
- Be consistent with either underscores or hyphens.
- Avoid special characters ($, @, %, #, &, *, (, ), !, /, etc.).
| Good name |
Good alternative |
Avoid |
| Max_temp_C |
MaxTemp |
Maximum Temp (◦C) |
| Precipitation_mm |
Precipitation |
precmm |
| Mean_year_growth |
MeanYearGrowth |
Mean growth/year |
| sex |
sex |
M/F |
| weight |
weight |
w. |
| cell_type |
CellType |
Cell Type |
| Observation_01 |
first_observation |
1st Obs. |
-
Write dates in consistent manner.
- Recommend the ISO 8601 standard, yyyy-mm-dd.
- Set text format for column before entering dates will prevent auto conversion.
- If a column already contain dates, setting the text format will convert values into Excel’s underlying numeric representation.
- Add an apostrophe before a value to treat them as text (’2014-06-14).
- Use seperate columns for day, month and year. Or a straight 8-digit integer of form YYYYMMDD.
-
No empty cells
- Use NA or a hyphen to indicate that the data is missing rather than leaving it blank unintentionally.
- Define categorical variables and fill in, rather than leaving it in merging cells, it will be hard to deal with later on if using programming languages.
| id |
date |
glucose |
| 101 |
2015-06-14 |
149.3 |
| 102 |
|
95.6 |
| 103 |
2015-06-18 |
97.5 |
| 104 |
|
121.2 |
| 105 |
|
111.6 |
| 106 |
2015-06-20 |
108.6 |
| 107 |
|
169.4 |
Only a several dates were include
|
1 min |
|
|
|
5 min |
|
|
|
| strain |
normal |
|
mutant |
|
normal |
|
mutant |
|
| A |
261 |
379 |
192 |
340 |
329 |
335 |
443 |
294 |
| B |
407 |
388 |
350 |
257 |
268 |
305 |
276 |
451 |
Complicated layout
| strain |
genotype |
min |
replicate |
response |
| A |
normal |
1 |
1 |
370 |
| A |
normal |
1 |
2 |
160 |
| B |
normal |
1 |
1 |
356 |
| B |
normal |
1 |
2 |
355 |
| A |
mutant |
1 |
1 |
252 |
| A |
mutant |
1 |
2 |
320 |
| B |
mutant |
1 |
1 |
397 |
| B |
mutant |
1 |
2 |
314 |
| A |
normal |
5 |
1 |
227 |
| A |
normal |
5 |
2 |
187 |
| B |
normal |
5 |
1 |
453 |
| B |
normal |
5 |
2 |
283 |
| A |
mutant |
5 |
1 |
267 |
| A |
mutant |
5 |
2 |
425 |
| B |
mutant |
5 |
1 |
283 |
| B |
mutant |
5 |
2 |
273 |
A tidy version
You can see it is melting version of the complicated version earlier.
-
One type of data in a cell, no combining and merging.
-
The best layout for your data within a spreadsheet is as a single big rectangle with rows corresponding to subjects and columns corresponding to variables.
- For CSV saving, keep each table in one file only. You can also consider keep one file with multiple tables in multiple worksheets, as long as it is consistent.

Some layouts that are causing problems.

Reorganization of the D table above.

Instead of this, we can use a single header and name them date_W4, date_W6, …

Or like this
- Create a seperate table explaining all the variables, or metadata. The table can includes:
- The exact variable name as in the data file;
- A version of the variable name that might be used in data visualizations;
- A longer explanation of what the variable means;
- The measurement units;
- Expected minimum and maximum values.
