1  GitHub Mini Workshop

Author
Affiliation

Ryan McShane, Ph.D.

The University of Chicago

Published

Sep. 30th, 2024

Modified

Oct. 10th, 2024

1.0.1 Overview

  • Reproducible Research and Workflows
  • R Markdown, Quarto, (and other Notebooks)
  • RStudio and GitHub Desktop
  • Version Control Systems
  • Git[Hub] Terminology
  • GitHub-Specific Tools
    • Issues
    • Projects
    • Activity
    • Organizations and Classrooms, ghclass
    • Websites (e.g., GitHub.io)
  • Exercises

1.1 Reproducible Research and Workflows

1.1.1 Replication

  • The replication crisis: many scientific findings cannot be replicated.

  • Replication means “re-performing the experiment and collecting new data.” “In order for a scientific study to be replicated, however, the method of statistical analysis must be entirely reproducible” — “re-performing the same analysis with the same code using a different analyst.”1

  • In order for an analysis to be reproducible, a statistician would need to…

    • Have the original dataset/file.
    • Know which data cleaning procedures were used.
    • Choose the same subset of data.
    • Know what statistical test was applied (and in what style).

1.1.2 Reproducible Workflows

  • A typical workflow has at least two components:2
    • A statistical software package for performing the data analysis, and
    • Software for presenting the results.

Quarto (formerly R Markdown) combines these two components into one program! Therefore, there is no loss of information from one component to another.

1.1.3 Statisticians’ [Partial] Solutions to the Replication Crisis

  • Research transparency (methodology, data, process made available, using open source software)

  • The statistical methods should be described in sufficient detail to allow replication by an independent statistician given the same data set.3

  • Use R Markdown/Quarto to consolidate workflow and avoid “copy/paste” issues and loss of information.

  • Using Version Control systems to publish analyses, including multiple versions, if needed. See, e.g. workflowr with demo video.

1.1.4 RStudio with Git Project Overview

1.1.5 RStudio Window Components

Text Editor (e.g., Homework00.qmd)

  • Importantly, every .R, .RMD, .QMD, .tex, .ipynb, etc file is simply a plain text file.

Git

  • If you have installed Git on your machine AND you have connected your RStudio project to a GitHub repository, this pane will be available and will allow you to use Git[Hub] directly within your RStudio window. (Where our version control can be performed).

Files

  • This is an overview of our file directory

1.1.6 GitHub Desktop

1.1.7 GitHub Desktop

1.2 Version Control Systems

1.2.1 Homemade Versioning

You name your files:

  • Draft1.tex
  • Draft2.tex
  • Draft2b.tex
  • Draft3_final.tex
  • Draft3_final_final.tex

It’s possible that only you know what the difference is between these files, and future you might have a difficult time retracing steps.

1.2.2 “Track Changes” in MS Word / Google Doc

  • Server automatically keeps track of changes to file
  • Who approves what change and when is not particularly transparent, and older versions of files can be lost to history
  • Works best with proprietary (binary) file formats (e.g., .docx, video) where files are stored in binary and can only be opened by programs that understand how to process these 0’s and 1’s.

1.2.3 Hard Drives “on the Cloud”

  • OneDrive, Google Drive, Box, Dropbox, etc
  • Depending on file format, offer some capability to locate prior versions of files (names link to versioning functionality help page)
  • YMMV
    • Box charges for version history
    • Dropbox gives a 30 day recovery window

1.2.4 Version Control Systems (VCS)

  • Tells you who did what, who gets credit/blame, and who has responsibility/expertise to fix mishaps.
  • Facilitates rewinding multi-file software/data science projects to previous versions, even on a file-by-file basis.
  • Facilitates combining contributions from different people. When the contributions don’t conflict with each other, merging is automatic.
  • If something breaks, allows users to more easily identify where the breaking change occurred.

1.2.5 Git vs GitHub

  • Git is a no-frills text interface system that allows you approve your changes and commit them to the record. Git does not automatically decide when changes should be saved (and committed), you have to do this yourself. Git allows multiple versions of the same file to simultaneously exist and has mechanisms to combine these files when you are ready.

  • GitHub (2008) is the startup acquired (2018) by Microsoft that puts web-based file browsing, editing, downloads, storage, typesetting, issues, projects, continuous integration, documentation, social media all in one place. Its primary function is to implement Git on the cloud.

  • Other web-based Git implementations exist (GitLab, Bitbucket, etc), but GitHub is where most data science resides.

1.2.6 Git works best with human-readable4 text file formats

  • Code – .R, .py, .sas, .c, .cpp, … (the initial primary use case for VCS)
  • Delimited data files – .csv and .tsv
  • Plain text – .txt
  • Typesetting – .tex, .bib, .bst
  • Markup languages – .md, .yml, .html, .xml
  • Interactive noteboooks – .ipynb
  • Source code for reproducible workflows – .rmd, .qmd

Many of these are part of a modern data science toolkit/workflow!

1.2.7 When files are plain text edited, diff view provides rapid transparency

1.3 Git[Hub] Terminology

1.3.1 Ignoring Things

The .gitignore file tells Git what files to ignore.

# History files
.Rhistory

# Session Data files
.RData

# knitr and R markdown default cache directories
*_cache/
/cache/

# Temporary files created by R markdown
*.utf8.md
*.knit.md
*nul

1.3.2 Other Common Git[Hub] file types

Licensing

The LICENSE, LICENSE.md, or LICENSE.txt file is often used in a repository to indicate how the contents of the repo may be used by others.

Citation

Add a CITATION file to a repository to explain how you want your work cited.

readme.md

A page which is rendered in GitHub-flavored markdown when viewed on GitHub.com

1.3.3 Key Verbs in Git[Hub]

  • fork copies a remote repository owned by someone else to a remote repository you control
  • clone copies a remote repository to create a local repository.
  • branch creates a sequence of version history separate from the main branch. This can be merged or deleted later.
  • commit is user decision to record changes to the repository. This should include a basic description (a log message) of what changes have been made, with space for more details in the description section.
  • fetch [origin] retrieves the set of commits on the remote repository that the local repository does not yet have (if these exist).
  • pull copies changes from a different/remote repository to your local/current repository.
  • push copies changes from your local/current repository to a different/remote repository.

1.3.4 A Longer Introduction to Git

https://swcarpentry.github.io/git-novice/aio.html

1.4 GitHub-Specific Tools

1.4.1 Issues

A problem emerges…

1.4.2 Issues

1.4.3 Projects

1.4.4 Activity

1.4.5 Organizations and Classrooms, ghclass

1.4.6 Git and GitHub as learning objectives

1.4.7 Build a Website Hosted by GitHub (or Netlify, etc)

1.4.8 Exercises

Exercise 1: Create a new file on GitHub.com.

We will not take this approach again, but this may be useful for you in the future.

Exercise 2: Create a conflict

GoodBranch + EvilBranch -> Commits -> Conflict Resolution

Hopefully, you don’t need to resolve a lot of conflicts. But, this is how you can do it on GitHub.com!

Exercise 3: Take this offline

This is how you will do most of your work from here on out!


  1. Stevens, J. R. (2017). Replicability and reproducibility in comparative psychology. Frontiers in psychology, 8, 862. DOI: 10.3389/fpsyg.2017.00862.↩︎

  2. Baumer, B., Cetinkaya-Rundel, M., Bray, A., Loi, L., and Horton, N. J. (2014). R Markdown: Integrating a reproducible analysis tool into introductory statistics. arXiv preprint arXiv:1402.1894.↩︎

  3. Vickers, A. J., & Sjoberg, D. D. (2015). Guidelines for reporting of statistics in European Urology. Eur Urol, 67(2), 181-187.↩︎

  4. human-readability varies by format!↩︎