Week 3 – File Format and Transformation

Overview

In this module you were introduced to concepts of data file formatting, compression, normalisation, and other kinds of data transformation and why they are useful and important.

Key Concepts & Definitions

If you need to convert or migrate your data files from one format to another, you need to be aware of the potential risk of the loss or corruption of your data and take appropriate steps to avoid/minimise it.

When compressing your data files for the purpose of storage, transportation or transmission, you encode the information using fewer bits than the original representation. Commonly used compression programs are, zip, GNU Zip (.gzip or .tar.gz) and Stuffit.

In your research you may also use the process of data normalisation; two meanings of this that may be of relevance to you are statistical normalisation and database normalisation.

You may also need to compute new values from old in your data, a process which is called data transformation, which may also be a necessary prelude to analysing your data. Three techniques which could be classified as data transformation are aggregation (combining data into larger units), anonymization (removing information identifying human subjects) and perturbation (distortion).

Key take-away: If you are saving files with proprietary formats for the long-term, consider creating a plain text or open format version to save along with them.

Additional Resources

Arms, C. R, Fleischhauer, C., Murray, K. (2013). ‘Sustainability Factors’ in Sustainability of Digital Formats: Planning of Library of Congress Collections. Library of Congress. Compilation. Last updated 20 March 2013. Retrieved from www.digitalpreservation.gov/formats/sustain/sustain.shtml

DBnormalization.com. Database Design Basics: Introduction. Last viewed 22 March 2016. Retrieved from http://www.dbnormalization.com/

Etzkorn, B. (6 November 2011). Data normalization and standardization. Retrieved from www.benetzkorn.com/2011/11/data-normalization-and-standardization

Francois, D. (29 January 2010). Data normalization for statistical analysis. Retrieved from www.damienfrancois.be/blog/pivot/entry.php?id=8

State Archives of North Carolina. (2012). File format guidelines for management and long-term retention of electronic records. Retrieved from http://archives.ncdcr.gov/Portals/3/PDF/guidelines/file_formats_in-house_preservation.pdf?ver=2016-03-11-084033-067