Working with large piles of complex data can be a difficult task, even for seasoned experts. What happens when a non-specialist is tasked with collecting, managing, and ultimately warehousing large amounts of painstakingly collected data? What happens when multiple non-specialists are concurrently working on these data? How can a revision history be maintained when a small set of files are being passed around via email, or appended to a "master" document?

These are some of the first questions that come to mind when watching most people deal with data storage. Researchers and applied science technicians commonly collect and manage a lot of data. The resulting flurry of spreadsheets doesn't usually cause problems until an error is discovered, or when someone used an incorrect formula to compute a range of cells. These are common mistakes, with simple solutions: version control systems, a separation of data and computation, and constraints imposed by an RDBMS. Why then, are these strategies not actively pursued outside of computer science and mathematics?

It has everything to do with training. All students working toward a career in science are required to take several technical writing classes. Writing is an essential part of research, and any student lacking in this respect would be expected to improve their writing skills- or else. Concepts of data management would therefore be a natural extension to technical writing courses, especially for those interested in research or applied science. Although it might take some arm twisting, I think that the time spent in a well designed, quarter-long data management course would be time well spent for the majority of new students out there.

A data management class might consist of several data and example-driven modules:

  1. Intro Material
    1. file formats and what they are used for
    2. when to use a plain text file
    3. when to use a spreadsheet
    4. when to use a database
    5. converting between the above three
  2. Revision Control
    1. concepts
    2. SVN
    3. strategies
  3. Stream Editors
    1. awk
    2. sed
    3. UNIX toolbox
  4. Practical Programming (fixing data)
    1. python
    2. perl
    3. R
    4. awk
    5. ???
  5. Database Constructs
    1. SQL
    2. tables
    3. constraints
    4. joins

The above outline represents about 15 minutes of thought, and is by no means comprehensive or suitably generalized. However, I think that with a small amount of training it would be possible to educate a critical mass of individuals on the finer points of managing data. Getting past corporate and government agency habits, which are in many cases propped up by hacked together Excel-Access-VBA applications, would take considerably more effort.

Disclaimer
The opinions in this page are based on extensive conversation with individuals spanning education, private sector, and government agencies. The generalizations which are presented are just that- generalizations. I realize that there are plenty of very competent people who keep those institutions running. However, it appears that a required course in data management could improve efficiency, accuracy, and creativity in research and applied science situations. This little bit of semi-structured venting closely parallels the discussion going on here.