In this vignette, we discuss how to use R Markdown to explore data and write reproducible R code. My hope is that you will use .Rmd files to as the first step (data exploration) in the “standard” workflow of this project. To really understand the “bones” of this document, it is recommend that you actually look through the .Rmd file on GitHub that was used to generate this .html document. This will help you understand R Markdown syntax. The second vignette, “Documenting Functions and Data with roxygen2” will describe how to properly document any data you process or functions you write in the dspgWork package.

What are .Rmd files?

.Rmd (R Markdown) files are a type of file that incorporates both code and a markup language called Markdown (clever name, I know). R Markdown files are to R as IPython Notebooks are to Python. That is to say, we can incorporate text (like what you’re reading right now - note that \(\LaTeX\) is also supported) with executable R code (like what you’ll see below) in the same document. This makes it easier to keep track of insights you have or problems you experience while coding (rather than cramming everything into R comments, which have less flexibility than Markdown). One of the major benefits of using .Rmd files as opposed to a regular R script is that your work can be compiled into an .html, .pdf, or .docx file quite easily (click the “Knit” button in the top-left of your RStudio window).

Excecutable R code can be added to an .Rmd file by creating what’s called a “code chunk.” The syntax to create an R code chunk in an .Rmd file is shown below (note the backticks, not commas). The RStudio shortcut for creating a code chunk is CTRL + ALT + I on Windows. Code within a code chunk can be executed by pressing CTRL + ENTER or by pressing the green “play” button on the top-right of each chunk. RStudio will most often print the output of a code chunk directly below the chunk, as shown below. Note that, for rest of the R code in this document, the backtick syntax will be suppressed (and you’ll only see the gray box & any output of the code chunk).

```{r}
1 + 1
```
#> [1] 2

The creator of R Markdown, Yihui Xie, has written multiple books on the package’s functionality (here’s one). He’s also prolific on his blog, where you can find many cool R Markdown and other “Xie-verse” package tricks. One of the most important R Markdown concepts that I have not touched on here are code chunk “options” - I would very highly recommend reading about code chunk options here and looking through the .Rmd file used to generate this html file for examples.

Exploration with .Rmd files

What I anticipate will be the “standard” workflow for this DSPG project will often start exploration of a data set. Suppose you’re interested in looking into the ACS table B18101, which corresponds to “Sex by Age by Disability Status.” The code below loads the tidyverse and tidycensus packages and downloads the table for Iowa counties.

library(tidyverse)
library(tidycensus)

disabilityStatus <- get_acs(geography = "county",
                          table = "B18101",
                          year = 2019,
                          state = "IA",
                          survey = "acs5",
                          geometry = TRUE,
                          cache = TRUE)
head(disabilityStatus)
#>   GEOID               NAME   variable estimate moe
#> 1 19001 Adair County, Iowa B18101_001     6961  52
#> 2 19001 Adair County, Iowa B18101_002     3479  35
#> 3 19001 Adair County, Iowa B18101_003      213  12
#> 4 19001 Adair County, Iowa B18101_004        0  14
#> 5 19001 Adair County, Iowa B18101_005      213  12
#> 6 19001 Adair County, Iowa B18101_006      562  14

The variable column is not human readable. Luckily, one of the data sets available in the dspgWork package contains all 2019 ACS 5-year variable names. These can be added the to disabilityStatus data frame using the dplyr::left_join function. We can then go on to further process and/or visualize this data set.

library(dspgWork) #we'll discuss how you can load this package on your local machine in the next vignette.
load("../../data_clean/dataClean_variableLabels_acs5_2019.rda")

head(dataClean_variableLabels_acs5_2019)
#> # A tibble: 6 x 3
#>   name       label                                   concept   
#>   <chr>      <chr>                                   <chr>     
#> 1 B01001_001 Estimate!!Total:                        SEX BY AGE
#> 2 B01001_002 Estimate!!Total:!!Male:                 SEX BY AGE
#> 3 B01001_003 Estimate!!Total:!!Male:!!Under 5 years  SEX BY AGE
#> 4 B01001_004 Estimate!!Total:!!Male:!!5 to 9 years   SEX BY AGE
#> 5 B01001_005 Estimate!!Total:!!Male:!!10 to 14 years SEX BY AGE
#> 6 B01001_006 Estimate!!Total:!!Male:!!15 to 17 years SEX BY AGE

disabilityStatus <- disabilityStatus %>%
  left_join(dataClean_variableLabels_acs5_2019,
            by = c("variable" = "name"))

head(disabilityStatus) #new `label` and `concept` columns
#>   GEOID               NAME   variable estimate moe
#> 1 19001 Adair County, Iowa B18101_001     6961  52
#> 2 19001 Adair County, Iowa B18101_002     3479  35
#> 3 19001 Adair County, Iowa B18101_003      213  12
#> 4 19001 Adair County, Iowa B18101_004        0  14
#> 5 19001 Adair County, Iowa B18101_005      213  12
#> 6 19001 Adair County, Iowa B18101_006      562  14
#>                                                        label
#> 1                                           Estimate!!Total:
#> 2                                    Estimate!!Total:!!Male:
#> 3                    Estimate!!Total:!!Male:!!Under 5 years:
#> 4 Estimate!!Total:!!Male:!!Under 5 years:!!With a disability
#> 5     Estimate!!Total:!!Male:!!Under 5 years:!!No disability
#> 6                    Estimate!!Total:!!Male:!!5 to 17 years:
#>                           concept
#> 1 SEX BY AGE BY DISABILITY STATUS
#> 2 SEX BY AGE BY DISABILITY STATUS
#> 3 SEX BY AGE BY DISABILITY STATUS
#> 4 SEX BY AGE BY DISABILITY STATUS
#> 5 SEX BY AGE BY DISABILITY STATUS
#> 6 SEX BY AGE BY DISABILITY STATUS

Note that this sort of variable label joining is something that we will all likely be doing a lot while working on this project. Any process that you anticipate doing multiple times is ripe-pickings for a function. A possibly paraphrased, yet resonant quote goes something like:

If you need to copy a piece of code more than twice, write a function. If you need to copy a function more than twice, write a package.

With this in-mind, let’s write a simple function that does the data loading and joining for us. Proper function writing etiquette dictates that our function and variable names should be expressive and speicific before they are concise. This means having enough willpower to avoid the (sometimes overwhelming) temptation to call everything myfun, foo, x, etc. The name joinVariableLabels below is somewhat clunky, but it works for our current purposes (there is certainly an art to package, function, and variable naming - an art for which I have little acumen). Note that the argument name acs5Data is a subtle reminder to the user that they should be handing the function data from the ACS 5-year survey.

joinVariableLabels <- function(acs5Data){
  
  load("../../data_clean/dataClean_variableLabels_acs5_2019.rda")
  
  acs5Data <- acs5Data %>%
    left_join(dataClean_variableLabels_acs5_2019,
              by = c("variable" = "name"))
  
  rm(dataClean_variableLabels_acs5_2019) #don't want this forced into the global environment.
  
  return(acs5Data)
}

To ensure that the function works as intended, let’s delete the dataClean_variableLabels_acs5_2019.rda object from the environment and make sure that the object returned by the joinVariableLabels function includes the label column.

#this is an example of a bad variable name
disabilityStatus2 <- get_acs(geography = "county",
                           table = "B18101",
                           year = 2019,
                           state = "IA",
                           survey = "acs5",
                           geometry = TRUE)
rm(dataClean_variableLabels_acs5_2019)

disabilityStatus2 <- joinVariableLabels(disabilityStatus2)

head(disabilityStatus2)
#>   GEOID               NAME   variable estimate moe
#> 1 19001 Adair County, Iowa B18101_001     6961  52
#> 2 19001 Adair County, Iowa B18101_002     3479  35
#> 3 19001 Adair County, Iowa B18101_003      213  12
#> 4 19001 Adair County, Iowa B18101_004        0  14
#> 5 19001 Adair County, Iowa B18101_005      213  12
#> 6 19001 Adair County, Iowa B18101_006      562  14
#>                                                        label
#> 1                                           Estimate!!Total:
#> 2                                    Estimate!!Total:!!Male:
#> 3                    Estimate!!Total:!!Male:!!Under 5 years:
#> 4 Estimate!!Total:!!Male:!!Under 5 years:!!With a disability
#> 5     Estimate!!Total:!!Male:!!Under 5 years:!!No disability
#> 6                    Estimate!!Total:!!Male:!!5 to 17 years:
#>                           concept
#> 1 SEX BY AGE BY DISABILITY STATUS
#> 2 SEX BY AGE BY DISABILITY STATUS
#> 3 SEX BY AGE BY DISABILITY STATUS
#> 4 SEX BY AGE BY DISABILITY STATUS
#> 5 SEX BY AGE BY DISABILITY STATUS
#> 6 SEX BY AGE BY DISABILITY STATUS

Success! We can see that, even though we had removed dataClean_variableLabels_acs5_2019.rda from the environment, the call to load() within the joinVariableLabels loaded it back in. The data set now contains a label column.

As an aside, another reason for making variable names expressive is to avoid so-called “masking.” Function masking occurs when you load a package that contains a function of the same name as another package you’ve previously loaded. In effect, the most recently loaded package takes naming precedence (which may have unintended consequences if care is not taken). For example, I was inattentive when naming the plot_sf function in dspgWork, not considering the fact that there is already a plot_sf function in the sf package that has a lot more bells and whistles. Variable masking is similar: if you, say, assign x <- 1 in the global environment and later assign x <- 2 within a function environment, then the variable x will take on the value 2 within the function call and 1 outside of the function call. Use the double-headed assignment operator, x <<- 2, within a function environment to simultaneously update the global environment. Note that this is seen as a faux pas amongst many programmers who like to have control over their global environment variables (e.g., they may have created a time-intensive variable called x in their global environment, so you using x <<- 2 would completely, and silently, overwrite their work). As such, its use is not recommended for functions that you’ll be sharing with others.

Function writing is a constantly evolving, iterative process. You will oftentimes think of ways to abstract the usage of a function or make it more “robust” to user misuse. Here are some ways that we could improve the usage of this function:

  • Say that we eventually pull down variable table names other than from the 2019 ACS 5-year survey. We could add a labelFile argument that accepts a string corresponding to a variable/label data set in the dspgWork package.

  • Suppose that the variable code column is named something different (other than variable). We could add a variableColumn argument that allows the user to specify the column name (another, perhaps easier way around this issue would be to ensure that the data set contains a column called variable before going into the function).

  • Add in some tests to ensure that the function can actually do what it is being asked to do (e.g., ensure that the requested data file actually exists locally, etc.)

It is rare that a function is every “done” being written - there are almost always tweaks that one could make to generalize the function usage or make it more robust. For my own sanity, I find it helpful to write a function until it is both general and robust enough for its immediate purpose, yet be open to revising the function if I think of a new need for it. Nothing is stopping you from taking a previously written function and incorporating it into a new function. For example, rather than using read_acs and joinVariableLabels in succession, we could write a new function that wraps these two together into a single call. Such is the art (and madness) of function writing.

In the next vignette, we discuss how you can properly save and document data sets and functions in the dspgWork package.