Structuring a Data Science Repository

A repository is simply a folder for your analysis work. The working copy of the repository should live on your COE computer. Code in the repository (R, python, Jupyter notebooks, etc.) should be in version control and, if you have Stuart’s approval, should be synced with your remote repository on GitHub. Because the storage on your computer isn’t backed up, a copy should also be kept up-to-date on the shared drive.

This guide is based on the Cookiecutter Data Science structure created by the folks at @drivendata and adapted for COE projects. The guidelines and structure in this post should be followed for COE projects whenever possible but the principles should be helpful regardless of the project or exact implementation.

Make your analysis repeatable. Whenever possible, avoid following analysis that cannot be repeated automatically. Extracting a data set from SQL, transforming it with a pivot table in Excel and then importing it into R to make your regression model may seem simple enough now but good luck doing it in again in three months or explaining the steps to someone else.
When you have something you think is worthwhile keeping, refactor the code you’ve written so that you can reproduce your work by running a script. If you need a data set for several purposes, keep the code to create the data set in src/data and save the result to data/processed where you can reference it as needed.

If you use constants to filter data such as a specific year or record type, define those constants once at the top of your script and use the constant throughout your analysis. It’s easy to rerun the analysis again later for new values and you avoid magic constants floating around your code.
Make your work easy to follow. Following a common structure for your analysis will make it easier for others to understand your work and will also help you stay organized. You’ll be reading through several reports from previous years over the course of your project so consider those who will be following your work this year and next.
We strongly recommend using plain text or markdown README files in folders where necessary to give instructions or communicate your intentions to others. You’ll need to keep them up to date though: bad or out-of-date documentation may be more misleading than no comments. Commit messages are also a great tool but harder to find.
Never modify the original data. The original data source is sacred and should never be altered. Don’t edit the original files and preferably don’t even open them, especially in Excel (since Excel can change things like date formats in unexpected ways). Make a copy and open the copy so you can’t corrupt the original file. Even small changes such as re-naming columns could create complications later if someone tries to repeat the analysis. Remember, data isn’t stored in version control so anyone who clones your repository will need to copy the data from the shared drive and won’t have your local changes. This is an important step to ensure repeatability.
Keep sensitive data out of version control.Whether you are using Git version control locally or syncing with a remote repository on GitHub, you should keep all sensitive data locally. This is most often extracts of data in CSV, JSON or binary formats (Excel, R Rds, Python pickle, etc.) but it could also be the password and connection string for a database. Use the env folder to store connection strings and other sensitive data in text files that you can load and reference in your code programatically. If other people use the code, they can use their own text file with their password and connection string.

The file structure for the repository should resemble the hierarchy below. Brief descriptions of each folder and file are included but do check the Cookie Cutter site or ask Will Jenden if you want more information.

YourProjectName2017
├── README.md          <- A high-level README for anyone interested in your project.
├── requirements.txt   <- Requirements for reproducing the computing environment
|                         (e.g. generated with `pip freeze > requirements.txt`). 
|                         This is a must if you are using Python.
├── data*
│   |── original       <- The original, immutable data dump.
│   ├── external       <- Data from third party sources (often open source).
│   ├── interim        <- Intermediate data sets that have been derived from original 
|   |                     or external data but aren't used directly.
│   └── processed      <- The final, canonical data sets for modeling or analysis.
├── docs               <- Documentation for any code that is part of your deliverables.
|                         Use a tool such as Sphinx or roxygen2 to generate 
|                         documentation from comments in your code so you don't have
|                         to maintain it manually.
├── env*               <- Location for virtualenv's, database info, etc.
├── models*            <- Saved models, model predictions, or model summaries
├── notebooks          <- Jupyter notebooks and experimental scripts.
├── reports*           <- Generated analysis in HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
├── src                <- Source code for use in this project.
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                   predictions
│   │   ├── train_model.py
│   │   └── predict.py
│   └── visualization  <- Scripts to create exploratory and results oriented figures
│       └── visualize.py

Note: directories with a star (e.g. data*) must be listed in .gitignore so that they are  from the repository.

Please note that this structure is not meant to replace the shared project folder structure (i.e. 001 Client Data, 002 References, etc.) but should be part of the analysis stored in 003 Work in Progress and later may be part of your deliverables to the COE and client.

Use judgement and adapt as needed

Your project may require a slightly different structure so go ahead and change the layout rather than try to force your work to fit this format. Just remember that any new folders should come with appropriate READMEs so that others will be able to understand your work.

If you love the idea of analysis as a directed acyclic graph (DAG) then you may want to look at using Make. At a high level, Make allows you to specify dependencies in your analysis. When you make changes to a source file (such as a data extract) you can easily run any models or figures that rely on that data without running all the intermediate scripts yourself. It will require some familiarity with the command line but may be worth the effort. For more information check out this post written by Shaun Jackman and Jenny Bryan from STAT545 here at UBC or this explanation from data-vis superstar Mike Bostock.

COE Toolbox

Useful tools of the operations research trade

Structuring a Data Science Repository

Leave a Reply Cancel reply