Why baking and data archiving really aren’t all that different

I imagine many of you reading this will be familiar with the infamous bake-off technical challenges, but for those who aren’t let me provide a small summary. Bakers are all given the same ingredients, with minimal instructions and challenged to create the same baked goods across the board, whether that be scones, bread, cakes etc. Now this is obviously meant to test the bakers in-depth knowledge of how to make these kitchen staples, and is great fun to watch. However, the way in which these highly skilled bakers still often produce different results, should demonstrate how even with significant contextual knowledge, it’s very difficult to reproduce something without having all the pieces of the puzzle.  

Applying this same example to our scientific research outputs, we should be wary of letting our scientific records become a technical challenge to future scientists, or even our future selves. Archiving your data in a re-usable fashion is challenging, but archiving your methods and tools alongside that is arguably even more challenging, and if not done properly can cause no end of headaches in the future. I cannot emphasize enough how much this needs to be considered at ALL stages of a project, and ideally as early as possible as otherwise things get missed. So lets break this down into the components of a baking task.  

1. List the ingredients  

These are the core components that you will need to bake your cake, and that’s pretty simple right? Flour eggs butter and sugar. BOOM. 

data archiving, Abstract cake, explosion, generative ai

But wait! How much of each ingredient do you need? Self raising flour or not? What type of sugar? Does it need any flavours?   

You need to apply the same type of detail to your data, do your data files and headings make sense? How can a researcher ascertain if all of the files are present and correct? Are your files understandable? Is all of your code there? Is it commented?   

2. Detail the pre-requisites   

Baking a cake isn’t just about having the ingredients, or indeed the method, you need the equipment, and it needs to be appropriate for the task at hand.   

Does the amount of ingredients listed work for the specified cake containers? Do you have an oven (and the appropriate temperature requirements), do you have the mixing apparatus? Does any of this work rely on outdated equipment that is no longer available?  

A painful and frequent issue with trying to reproduce or resurrect legacy data is the lack of pre-requisite information. For example, if you have left database dumps of your tables, is there a schema to go with it to explain how the database SHOULD work? If you have coding scripts, what libraries and or installations do they require? if you have websites what installations and drivers are required to run the website locally, have you included all of the underpinning data/images/code that is required to run it? Is the software you are using OS specific and if so have you tried and tested alternatives? Is any of your data/code stored on outdated media that may no longer be readable such as USB sticks, RAID Drives (or to my utter horror that any were still kicking about – Floppy Disk drives).   

3. Write the instructions  

Once you have the ingredients and the methods then you can list out the instructions! Now this sounds simple but also often assumes prior knowledge. For example, “cream together butter and sugar”. Most bakers know what that means, a newbie does not, an even if they could extrapolate the meaning, it would be much simpler to include an explanation of that terminology, and or just put it in laymans terms to begin with.   

“Mix until ready”, again experienced bakers will get a feel for when their mixture is ready to put in the oven and the consistency they expect, a brand new baker or even someone who is skilled in say baking bread won’t necessarily know what to look for in a cake batter. There are also inbuilt assumptions that people will know certain bits of information, for example if you put icing on a warm cake, it will melt! This may be common sense but how many brand new bakers do know that? If any of you watch Failed it to Nailed it seemingly even people who have been baking for years don’t know that!   

Instructions on how to reproduce your work should be clear, not assume prior knowledge and cover ALL steps from start to finish. This is why its advantageous to document these things as you go through rather than wait until the end, its very easy to look at a working set of code, and a finalised dataset and think “great that’s what I’ll deposit, all the information is here” but you’ll be forgetting all of the hours, days, weeks, months that went into making all the different components work together, how did you get your database to talk to the rest of your code? What scripts did you use to generate your data in the first place? If you created or curated a dataset, what was your approach? Did you document it? Could anyone else coming to this fresh take your methods and apply them to achieve the same results?  

Ultimately this is a very complex task, and honestly the more I work in research data management, and particularly the more I get involved with projects to resurrect legacy data, the more things I add to my mental list of “things that MUST BE documented”. “HAVE YOU DOCUMENTED THIS?! Is the most common questions I ask my interns and students about their projects, and I would argue whenever you take a new step in a project, that’s the question you need to ask yourself. And if you are unsure whether you have documented things in a replicable or understandable fashion, ask your peers. “If I gave you this, do you think you could reproduce my work” and if the answer isn’t a resounding yes (and they aren’t fibbing to make you feel better) then you need to have a rethink!  

To add to complications, obviously this article has predominantly focused on code and data, but there are a whole range of different components that need to be considered depending on the project. But despair not thankfully there are lots of great resources out there to help with this sort of thing that cover a wide range of media and data types and provide explanations on different types of digital preservation. 

Samantha Pearman-Kanza

Leave a Reply

Your email address will not be published. Required fields are marked *