{"id":1582,"date":"2017-04-20T09:31:21","date_gmt":"2017-04-20T16:31:21","guid":{"rendered":"https:\/\/blogs.ubc.ca\/coetoolbox\/?p=1582"},"modified":"2017-04-20T09:31:21","modified_gmt":"2017-04-20T16:31:21","slug":"structuring-a-data-science-repository","status":"publish","type":"post","link":"https:\/\/blogs.ubc.ca\/coetoolbox\/2017\/04\/20\/structuring-a-data-science-repository\/","title":{"rendered":"Structuring a Data Science Repository"},"content":{"rendered":"<p class=\"md-end-block md-heading\"><span class=\"md-line md-end-block\">A repository is simply a folder for your analysis work. The working copy of the repository should live on your COE computer. Code in the repository (R, python, Jupyter notebooks, etc.) should be in version control and, if you have Stuart&#8217;s approval, should be synced with your remote repository on GitHub. Because the storage on your computer isn&#8217;t backed up, a copy should also be kept up-to-date on the shared drive.<\/span><\/p>\n<p><span class=\"md-line md-end-block\">This guide is based on the <a href=\"http:\/\/drivendata.github.io\/cookiecutter-data-science\/\">Cookiecutter Data Science<\/a> structure created by the folks at <a href=\"https:\/\/twitter.com\/drivendataorg\">@drivendata<\/a> and adapted for COE projects. The guidelines and structure in this post should be followed for COE projects whenever possible but the principles should be helpful regardless of the project or exact implementation.<\/span><\/p>\n<ul class=\"ul-list\">\n<li><span class=\"md-line md-end-block\"> <strong>Make your analysis repeatable<\/strong>. <\/span><span class=\"md-line md-end-block\">Whenever possible, avoid following analysis that cannot be repeated automatically. Extracting a data set from SQL, transforming it with a pivot table in Excel and then importing it into R to make your regression model may seem simple enough now but good luck doing it in again in three months or explaining the steps to someone else. <\/span>\n<p><span class=\"md-line md-end-block\">When you have something you think is worthwhile keeping, refactor the code you&#8217;ve written so that you can reproduce your work by running a script. If you need a data set for several purposes, keep the code to create the data set in <code>src\/data<\/code> and save the result to <code>data\/processed<\/code> where you can reference it as needed. <\/span><\/p>\n<p><span class=\"md-line md-end-block\">If you use constants to filter data such as a specific year or record type, define those constants once at the top of your script and use the constant throughout your analysis. It&#8217;s easy to rerun the analysis again later for new values and you avoid <a href=\"https:\/\/en.wikipedia.org\/wiki\/Magic_number_(programming)#Unnamed_numerical_constants\">magic constants<\/a> floating around your code.<\/span><\/li>\n<li><span class=\"md-line md-end-block\"><strong>Make your work easy to follow.<\/strong> <\/span><span class=\"md-line md-end-block\">Following a common structure for your analysis will make it easier for others to understand your work and will also help you stay organized. You&#8217;ll be reading through several reports from previous years over the course of your project so consider those who will be following your work this year and next. <\/span>\n<p><span class=\"md-line md-end-block\">We strongly recommend using plain text or markdown README files in folders where necessary to give instructions or communicate your intentions to others. You&#8217;ll need to keep them up to date though: bad or out-of-date documentation may be more misleading than no comments. Commit messages are also a great tool but harder to find.<\/span><\/li>\n<li><span class=\"md-line md-end-block\"><strong>Never modify the original data<\/strong>. <\/span><span class=\"md-line md-end-block\">The original data source is sacred and should never be altered. Don&#8217;t edit the original files and preferably don&#8217;t even open them, especially in Excel (since Excel can change things like date formats in unexpected ways). Make a copy and open the copy so you can&#8217;t corrupt the original file. Even small changes such as re-naming columns could create complications later if someone tries to repeat the analysis. Remember, data isn&#8217;t stored in version control so anyone who clones your repository will need to copy the data from the shared drive and won&#8217;t have your local changes. This is an important step to ensure repeatability.<\/span><\/li>\n<li><span class=\"md-line md-end-block\"><strong>Keep sensitive data out of version control.<\/strong><\/span><span class=\"md-line md-end-block\">Whether you are using Git version control locally or syncing with a remote repository on GitHub, you should keep all sensitive data locally. This is most often extracts of data in CSV, JSON or binary formats (Excel, R Rds, Python pickle, etc.) but it could also be the password and connection string for a database. Use the <code>env<\/code> folder to store connection strings and other sensitive data in text files that you can load and reference in your code programatically. If other people use the code, they can use their own text file with their password and connection string.<\/span><\/li>\n<\/ul>\n<p><span class=\"md-line md-end-block\">The file structure for the repository should resemble the hierarchy below. Brief descriptions of each folder and file are included but do check the Cookie Cutter site or ask Will Jenden if you want more information.<\/span><\/p>\n<pre class=\"md-fences md-end-block\">YourProjectName2017\r\n\u251c\u2500\u2500 README.md \u00a0 \u00a0 \u00a0 \u00a0  &lt;- A high-level README for anyone interested in your project.\r\n\u251c\u2500\u2500 requirements.txt \u00a0 &lt;- Requirements for reproducing the computing environment\r\n|                         (e.g. generated with `pip freeze &gt; requirements.txt`). \r\n|                         This is a must if you are using Python.\r\n\u251c\u2500\u2500 data*\r\n\u2502 \u00a0 |\u2500\u2500 original \u00a0 \u00a0 \u00a0 &lt;- The original, immutable data dump.\r\n\u2502 \u00a0 \u251c\u2500\u2500 external \u00a0 \u00a0 \u00a0 &lt;- Data from third party sources (often open source).\r\n\u2502 \u00a0 \u251c\u2500\u2500 interim \u00a0 \u00a0 \u00a0  &lt;- Intermediate data sets that have been derived from original \r\n| \u00a0 |                     or external data but aren't used directly.\r\n\u2502 \u00a0 \u2514\u2500\u2500 processed \u00a0 \u00a0  &lt;- The final, canonical data sets for modeling or analysis.\r\n\u251c\u2500\u2500 docs \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 &lt;- Documentation for any code that is part of your deliverables.\r\n|                         Use a tool such as Sphinx or roxygen2 to generate \r\n|                         documentation from comments in your code so you don't have\r\n|                         to maintain it manually.\r\n\u251c\u2500\u2500 env* \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 &lt;- Location for virtualenv's, database info, etc.\r\n\u251c\u2500\u2500 models* \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  &lt;- Saved models, model predictions, or model summaries\r\n\u251c\u2500\u2500 notebooks \u00a0 \u00a0 \u00a0 \u00a0  &lt;- Jupyter notebooks and experimental scripts.\r\n\u251c\u2500\u2500 reports* \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 &lt;- Generated analysis in HTML, PDF, LaTeX, etc.\r\n\u2502 \u00a0 \u2514\u2500\u2500 figures \u00a0 \u00a0 \u00a0  &lt;- Generated graphics and figures to be used in reporting\r\n\u251c\u2500\u2500 src \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  &lt;- Source code for use in this project.\r\n\u2502 \u00a0 \u251c\u2500\u2500 data \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 &lt;- Scripts to download or generate data\r\n\u2502 \u00a0 \u2502 \u00a0 \u2514\u2500\u2500 make_dataset.py\r\n\u2502 \u00a0 \u251c\u2500\u2500 features \u00a0 \u00a0 \u00a0 &lt;- Scripts to turn raw data into features for modeling\r\n\u2502 \u00a0 \u2502 \u00a0 \u2514\u2500\u2500 build_features.py\r\n\u2502 \u00a0 \u251c\u2500\u2500 models \u00a0 \u00a0 \u00a0 \u00a0 &lt;- Scripts to train models and then use trained models to make\r\n\u2502 \u00a0 \u2502 \u00a0 \u2502 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 predictions\r\n\u2502 \u00a0 \u2502 \u00a0 \u251c\u2500\u2500 train_model.py\r\n\u2502 \u00a0 \u2502 \u00a0 \u2514\u2500\u2500 predict.py\r\n\u2502 \u00a0 \u2514\u2500\u2500 visualization  &lt;- Scripts to create exploratory and results oriented figures\r\n\u2502 \u00a0 \u00a0 \u00a0 \u2514\u2500\u2500 visualize.py\r\n\u200b\r\nNote: directories with a star (e.g. data*) must be listed in .gitignore so that they are  from the repository.<\/pre>\n<p><span class=\"md-line md-end-block\">Please note that this structure is not meant to replace the shared project folder structure (i.e. <em>001 Client Data<\/em>, <em>002 References<\/em>, etc.) but should be part of the analysis stored in <em>003 Work in Progress<\/em> and later may be part of your deliverables to the COE and client. <\/span><\/p>\n<p><span class=\"md-line md-end-block\"><strong>Use judgement and adapt as needed<\/strong> <\/span><\/p>\n<p><span class=\"md-line md-end-block\">Your project may require a slightly different structure so go ahead and change the layout rather than try to force your work to fit this format. Just remember that any new folders should come with appropriate READMEs so that others will be able to understand your work.<\/span><\/p>\n<p><span class=\"md-line md-end-block md-focus\">If you love the idea of analysis as a directed acyclic graph (DAG) then you may want to look at using Make. At a high level, Make allows you to specify dependencies in your analysis. When you make changes to a source file (such as a data extract) you can easily run any models or figures that rely on that data without running all the intermediate scripts yourself. It will require some familiarity with the command line but may be worth the effort. For more information check out <a href=\"http:\/\/stat545.com\/automation00_index.html\">this post<\/a> written by Shaun Jackman and Jenny Bryan from STAT545 here at UBC or <a href=\"https:\/\/bost.ocks.org\/mike\/make\/\">this explanation<\/a><span class=\"md-expand\"> from data-vis superstar Mike Bostock.<\/span><\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>A repository is simply a folder for your analysis work. The working copy of the repository should live on your COE computer. Code in the repository (R, python, Jupyter notebooks, etc.) should be in version control and, if you have Stuart&#8217;s approval, should be synced with your remote repository on GitHub. Because the storage on [&hellip;]<\/p>\n","protected":false},"author":41749,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[1074139,1054146,1054148],"class_list":["post-1582","post","type-post","status-publish","format-standard","hentry","category-uncategorized","tag-data-science","tag-industry-projects","tag-version-control"],"_links":{"self":[{"href":"https:\/\/blogs.ubc.ca\/coetoolbox\/wp-json\/wp\/v2\/posts\/1582","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blogs.ubc.ca\/coetoolbox\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.ubc.ca\/coetoolbox\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.ubc.ca\/coetoolbox\/wp-json\/wp\/v2\/users\/41749"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.ubc.ca\/coetoolbox\/wp-json\/wp\/v2\/comments?post=1582"}],"version-history":[{"count":4,"href":"https:\/\/blogs.ubc.ca\/coetoolbox\/wp-json\/wp\/v2\/posts\/1582\/revisions"}],"predecessor-version":[{"id":1592,"href":"https:\/\/blogs.ubc.ca\/coetoolbox\/wp-json\/wp\/v2\/posts\/1582\/revisions\/1592"}],"wp:attachment":[{"href":"https:\/\/blogs.ubc.ca\/coetoolbox\/wp-json\/wp\/v2\/media?parent=1582"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.ubc.ca\/coetoolbox\/wp-json\/wp\/v2\/categories?post=1582"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.ubc.ca\/coetoolbox\/wp-json\/wp\/v2\/tags?post=1582"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}