Data Organization in Spreadsheets

Karl W. Broman & Kara H. Woo (2018) Data Organization in Spreadsheets, The American Statistician, 72:1, 2-10, DOI: 10.1080/00031305.2017.1375989

Be consistent.
- Consistent values for categorical variables (eg: male, female).
- Same for missing variables.
  - In R, we would prefer NA.
  - You can also use hyphen.
  - Never explain why it is missing inplace of the missing value, make a seperate column for that.
- Consistent variable and subject names, consistent layout throughout one file or one ecosystem.
- Consistent date format.
- Consistent phrases in notes.
- No extra spaces within cells.
Choose informative and succinct names for subjects.
- Prohibited of using spaces.
- Be consistent with either underscores or hyphens.
- Avoid special characters ($, @, %, #, &, *, (, ), !, /, etc.).

No empty cells

Use NA or a hyphen to indicate that the data is missing rather than leaving it blank unintentionally.
Define categorical variables and fill in, rather than leaving it in merging cells, it will be hard to deal with later on if using programming languages.

Only a several dates were include

Complicated layout

strain	genotype	min	replicate	response
A	normal	1	1	370
A	normal	1	2	160
B	normal	1	1	356
B	normal	1	2	355
A	mutant	1	1	252
A	mutant	1	2	320
B	mutant	1	1	397
B	mutant	1	2	314
A	normal	5	1	227
A	normal	5	2	187
B	normal	5	1	453
B	normal	5	2	283
A	mutant	5	1	267
A	mutant	5	2	425
B	mutant	5	1	283
B	mutant	5	2	273

A tidy version

You can see it is melting version of the complicated version earlier.

One type of data in a cell, no combining and merging.
The best layout for your data within a spreadsheet is as a single big rectangle with rows corresponding to subjects and columns corresponding to variables.
- For CSV saving, keep each table in one file only. You can also consider keep one file with multiple tables in multiple worksheets, as long as it is consistent.

Untitled

Some layouts that are causing problems.

Untitled

Reorganization of the D table above.

Untitled

Instead of this, we can use a single header and name them date_W4, date_W6, …

Untitled

Or like this

Untitled