::include_graphics("images/RStudio2024.png") knitr
1 GitHub Mini Workshop
1.0.1 Overview
- Reproducible Research and Workflows
- R Markdown, Quarto, (and other Notebooks)
- RStudio and GitHub Desktop
- Version Control Systems
- Git[Hub] Terminology
- GitHub-Specific Tools
- Issues
- Projects
- Activity
- Organizations and Classrooms,
ghclass
- Websites (e.g., GitHub.io)
- Exercises
1.1 Reproducible Research and Workflows
1.1.1 Replication
The replication crisis: many scientific findings cannot be replicated.
Replication means “re-performing the experiment and collecting new data.” “In order for a scientific study to be replicated, however, the method of statistical analysis must be entirely reproducible” — “re-performing the same analysis with the same code using a different analyst.”1
In order for an analysis to be reproducible, a statistician would need to…
- Have the original dataset/file.
- Know which data cleaning procedures were used.
- Choose the same subset of data.
- Know what statistical test was applied (and in what style).
1.1.2 Reproducible Workflows
- A typical workflow has at least two components:2
- A statistical software package for performing the data analysis, and
- Software for presenting the results.
Quarto (formerly R Markdown) combines these two components into one program! Therefore, there is no loss of information from one component to another.
1.1.3 Statisticians’ [Partial] Solutions to the Replication Crisis
Research transparency (methodology, data, process made available, using open source software)
The statistical methods should be described in sufficient detail to allow replication by an independent statistician given the same data set.3
Use R Markdown/Quarto to consolidate workflow and avoid “copy/paste” issues and loss of information.
Using Version Control systems to publish analyses, including multiple versions, if needed. See, e.g.
workflowr
with demo video.
1.1.4 RStudio with Git Project Overview
1.1.5 RStudio Window Components
Text Editor (e.g., Homework00.qmd)
- Importantly, every
.R
,.RMD
,.QMD
,.tex
,.ipynb
, etc file is simply a plain text file.
Git
- If you have installed Git on your machine AND you have connected your RStudio project to a GitHub repository, this pane will be available and will allow you to use Git[Hub] directly within your RStudio window. (Where our version control can be performed).
Files
- This is an overview of our file directory
1.1.6 GitHub Desktop
::include_graphics("images/GitHubDesktopViewer.png") knitr
1.1.7 GitHub Desktop
::include_graphics("images/GitHubDesktopHistory.png") knitr
1.2 Version Control Systems
1.2.1 Homemade Versioning
You name your files:
Draft1.tex
Draft2.tex
Draft2b.tex
Draft3_final.tex
Draft3_final_final.tex
- …
It’s possible that only you know what the difference is between these files, and future you might have a difficult time retracing steps.
1.2.2 “Track Changes” in MS Word / Google Doc
- Server automatically keeps track of changes to file
- Who approves what change and when is not particularly transparent, and older versions of files can be lost to history
- Works best with proprietary (binary) file formats (e.g.,
.docx
, video) where files are stored in binary and can only be opened by programs that understand how to process these 0’s and 1’s.
1.2.3 Hard Drives “on the Cloud”
- OneDrive, Google Drive, Box, Dropbox, etc
- Depending on file format, offer some capability to locate prior versions of files (names link to versioning functionality help page)
- YMMV
- Box charges for version history
- Dropbox gives a 30 day recovery window
1.2.4 Version Control Systems (VCS)
- Tells you who did what, who gets credit/blame, and who has responsibility/expertise to fix mishaps.
- Facilitates rewinding multi-file software/data science projects to previous versions, even on a file-by-file basis.
- Facilitates combining contributions from different people. When the contributions don’t conflict with each other, merging is automatic.
- If something breaks, allows users to more easily identify where the breaking change occurred.
1.2.5 Git vs GitHub
Git is a no-frills text interface system that allows you approve your changes and commit them to the record. Git does not automatically decide when changes should be saved (and committed), you have to do this yourself. Git allows multiple versions of the same file to simultaneously exist and has mechanisms to combine these files when you are ready.
GitHub (2008) is the startup acquired (2018) by Microsoft that puts web-based file browsing, editing, downloads, storage, typesetting, issues, projects, continuous integration, documentation, social media all in one place. Its primary function is to implement Git on the cloud.
Other web-based Git implementations exist (GitLab, Bitbucket, etc), but GitHub is where most data science resides.
1.2.6 Git works best with human-readable4 text file formats
- Code –
.R
,.py
,.sas
,.c
,.cpp
, … (the initial primary use case for VCS) - Delimited data files –
.csv
and.tsv
- Plain text –
.txt
- Typesetting –
.tex
,.bib
,.bst
- Markup languages –
.md
,.yml
,.html
,.xml
- Interactive noteboooks –
.ipynb
- Source code for reproducible workflows –
.rmd
,.qmd
Many of these are part of a modern data science toolkit/workflow!
1.2.7 When files are plain text edited, diff view provides rapid transparency
::include_graphics("images/GitHubDesktopHistory.png") knitr
1.3 Git[Hub] Terminology
1.3.1 Ignoring Things
The .gitignore
file tells Git what files to ignore.
# History files
.Rhistory
# Session Data files
.RData
# knitr and R markdown default cache directories
*_cache/
/cache/
# Temporary files created by R markdown
*.utf8.md
*.knit.md *nul
1.3.2 Other Common Git[Hub] file types
Licensing
The LICENSE
, LICENSE.md
, or LICENSE.txt
file is often used in a repository to indicate how the contents of the repo may be used by others.
Citation
Add a CITATION
file to a repository to explain how you want your work cited.
readme.md
A page which is rendered in GitHub-flavored markdown when viewed on GitHub.com
1.3.3 Key Verbs in Git[Hub]
fork
copies a remote repository owned by someone else to a remote repository you controlclone
copies a remote repository to create a local repository.branch
creates a sequence of version history separate from themain
branch. This can be merged or deleted later.commit
is user decision to record changes to the repository. This should include a basic description (a log message) of what changes have been made, with space for more details in the description section.
fetch
[origin] retrieves the set of commits on the remote repository that the local repository does not yet have (if these exist).pull
copies changes from a different/remote repository to your local/current repository.push
copies changes from your local/current repository to a different/remote repository.
1.3.4 A Longer Introduction to Git
1.4 GitHub-Specific Tools
1.4.1 Issues
A problem emerges…
::include_graphics("images/IntegralProblem.png") knitr
::include_graphics("images/ProblemIdentification.png") knitr
1.4.2 Issues
::include_graphics("images/GitHubIssue.png") knitr
::include_graphics("images/GitHubCommitDiff.png") knitr
1.4.3 Projects
::include_graphics("images/GitHubProject.png") knitr
1.4.4 Activity
::include_graphics("images/BryanActivity.png") knitr
::include_graphics("images/HadleyActivity.png") knitr
::include_graphics("images/McShaneActivity.png") knitr
1.4.5 Organizations and Classrooms, ghclass
::include_graphics("images/GitHubClassroom.png") knitr
1.4.6 Git and GitHub as learning objectives
::include_graphics("images/VersionControlLearning.png") knitr
1.4.7 Build a Website Hosted by GitHub (or Netlify, etc)
- Hugo Academic Theme for personal resume (e.g., ryanmcshane.com)
- Blogdown for blogs built with R Markdown
- Quarto - Hugo for websites built with Quarto
- Bookdown for books published online. E.g.:
- Quarto websites
1.4.8 Exercises
Exercise 1: Create a new file on GitHub.com.
We will not take this approach again, but this may be useful for you in the future.
Exercise 2: Create a conflict
GoodBranch
+ EvilBranch
-> Commit
s -> Conflict
Resolution
Hopefully, you don’t need to resolve a lot of conflicts. But, this is how you can do it on GitHub.com!
Exercise 3: Take this offline
This is how you will do most of your work from here on out!
Stevens, J. R. (2017). Replicability and reproducibility in comparative psychology. Frontiers in psychology, 8, 862. DOI: 10.3389/fpsyg.2017.00862.↩︎
Baumer, B., Cetinkaya-Rundel, M., Bray, A., Loi, L., and Horton, N. J. (2014). R Markdown: Integrating a reproducible analysis tool into introductory statistics. arXiv preprint arXiv:1402.1894.↩︎
Vickers, A. J., & Sjoberg, D. D. (2015). Guidelines for reporting of statistics in European Urology. Eur Urol, 67(2), 181-187.↩︎
human-readability varies by format!↩︎