2. Documenting Functions and Data in Packages • dspgWork

This vignette is a continuation of the first vignette, “Intro to R Markdown.” It will discuss proper function and data documentation in the dspgWork package. Refer to Hadley’s book on R Packages for more information. To really understand the “bones” of this document, it is recommend that you actually look through the .Rmd file on GitHub that was used to generate this .html document. This will help you understand R Markdown syntax.

Intro to roxygen2

Read this vignette to get an introduction to roxygen2. We will use roxygen2 to maintain some structure in documenting functions and data. A number of other packages work well with roxygen2 documentation. For example, pkgdown compiles roxygen2 documentation to .html and displays it on a cool, automatically-generated website (see https://dspg2021.github.io/work/reference/index.html).

When documenting a function with roxygen2, you start each line of documentation with a special #' comment (i.e., a pound sign followed by a comma, no space). To indicate special fields such as arguments, author, etc., you can use “tags.” Tags all begin with the @ symbol (e.g., @title). A discussion of important tags can be found here.

An example of using roxygen2-style documentation for a function is shown below. Note that the final tag, @export indicates that we want this function to be “exposed” to user for calling. The list of a package’s exported functions can be viewed by typing packageName:: in the console (e.g., dplyr::mutate(). We may not want all functions to be easily-accessible to the user; for example, mini “helper” functions that are created in-service of another, larger function. Removing the @export tag will remove the function from the exported list. To view all exported and unexported functions in a package, use packageName::: (three colons). This is rarely recommended or needed.

#' First line briefly describes the function, this will be large font in the help file
#' 
#' Text written after the first line will be smaller font. Change font type by using, for example, \code{some code}.
#' 
#' You can add blank lines between lines to separate text in the help file.
#' 
#' @param arg1 description of the first argument
#' @param arg2 description of the second argument. Make sure the order of arguments listed here agrees with the order in the function definition
#'
#' @return describe object returned by the function
#'
#' @examples
#' write some R code to demonstrate the function's usage:
#' goodVariableName <- 1
#' 
#' expectedOutput <- expressiveFunctionName(goodVariableName)
#' 
#' print(expectedOutput)
#'
#' the next tag used will end the examples section
#' @author Firstname Lastname
#'
#' @seealso \url{https://r-pkgs.org/man.html}
#' @export
expressiveFunctionName <- function(arg1,arg2){
  ...code...
}

Keep in-mind: You are not only writing functions for yourself, but for fellow and future DSPG students and stakeholders. Let this color how you write and document your code going forward.

Also keep in-mind that documentation is sort of like cleaning up after a party. No one particularly wants to do it, but it’s necessary to ensure that what you did can be used by you and others in the future. Extending this analogy further, roxygen2 is sort of like a Roomba in that it takes some effort to set up initially, but it automates a lot of the really annoying stuff once it gets going.

Saving & Documenting Data

Refer to this vignette for more information about roxygen2 tags (and a bunch of other information).

To keep things concrete, consider a continuation of the example discussed in the last vignette. We pulled data from the 2019 ACS 5-year survey on “Sex by Age by Disability Status.” We also wrote a function, joinVariableLabels, to perform a label joining task. The code below replicates the key parts of the example.

library(tidyverse)
library(tidycensus)

options(tigris_use_cache = FALSE) #caches the table so that it doesn't need to be re-read each time .Rmd file is knitted

disabilityStatus <- get_acs(geography = "county",
                          table = "B18101",
                          year = 2019,
                          state = "IA",
                          survey = "acs5",
                          geometry = TRUE,
                          cache = TRUE)

head(disabilityStatus)
#>   GEOID               NAME   variable estimate moe
#> 1 19001 Adair County, Iowa B18101_001     6961  52
#> 2 19001 Adair County, Iowa B18101_002     3479  35
#> 3 19001 Adair County, Iowa B18101_003      213  12
#> 4 19001 Adair County, Iowa B18101_004        0  14
#> 5 19001 Adair County, Iowa B18101_005      213  12
#> 6 19001 Adair County, Iowa B18101_006      562  14

Suppose that we want to save the disabilityStatus data set in the data_raw folder. To do so, we can use the save function as shown below. Note the naming scheme used to save the object: dataRaw_[variableDescription]_[source]_[year].rda. We ask that you adhere to this naming scheme specifically when saving data in the package so that we avoid conflicting data set names (e.g., disabilityStatus data from the ACS vs. from the Bureau of Transportation Statistics). When working on your own, feel free to name variables whatever you’d like. An rda file contains compressed R data.

Note that you will need to include a relative or absolute path to the data_raw folder. The example below uses ../../, which is file navigation notation syntax to say “go up one level in the folder system.” This is because the .Rmd file you’re currently reading is saved within the vignettes folder, which is on the same level as the data_raw folder in the file system. If you’re unfamiliar with this syntax or what a “relative” or “absolute” path are, refer to the this article.

dataRaw_disabilityStatusByAge_acs5_2019 <- disabilityStatus

save(dataRaw_disabilityStatusByAge_acs5_2019,file = "../../data_raw/dataRaw_sexByAgeByDisabilityStatus_acs5_2019.rda")

joinVariableLabels <- function(acs5Data){
  
  load("../../data_clean/dataClean_variableLabels_acs5_2019.rda")
  
  acs5Data <- acs5Data %>%
    left_join(dataClean_variableLabels_acs5_2019,
              by = c("variable" = "name"))
  
  return(acs5Data)
}

disabilityStatus <- joinVariableLabels(disabilityStatus)

head(disabilityStatus)
#>   GEOID               NAME   variable estimate moe
#> 1 19001 Adair County, Iowa B18101_001     6961  52
#> 2 19001 Adair County, Iowa B18101_002     3479  35
#> 3 19001 Adair County, Iowa B18101_003      213  12
#> 4 19001 Adair County, Iowa B18101_004        0  14
#> 5 19001 Adair County, Iowa B18101_005      213  12
#> 6 19001 Adair County, Iowa B18101_006      562  14
#>                                                        label
#> 1                                           Estimate!!Total:
#> 2                                    Estimate!!Total:!!Male:
#> 3                    Estimate!!Total:!!Male:!!Under 5 years:
#> 4 Estimate!!Total:!!Male:!!Under 5 years:!!With a disability
#> 5     Estimate!!Total:!!Male:!!Under 5 years:!!No disability
#> 6                    Estimate!!Total:!!Male:!!5 to 17 years:
#>                           concept
#> 1 SEX BY AGE BY DISABILITY STATUS
#> 2 SEX BY AGE BY DISABILITY STATUS
#> 3 SEX BY AGE BY DISABILITY STATUS
#> 4 SEX BY AGE BY DISABILITY STATUS
#> 5 SEX BY AGE BY DISABILITY STATUS
#> 6 SEX BY AGE BY DISABILITY STATUS

After performing this join, suppose that we want now save the disabilityStatus data set for others to use. Since we’ve processed the data, we should save the data set in the data_clean folder. The code below does this. Note that dataClean is now the beginning of the file name: dataClean_[variableDescription]_[source]_[year].rda.

Note that what we’ve performed so far is fairly simple processing on the data set. We want to use the data_clean folder for data that are pretty virtually ready to be used in whatever visualizations/summaries we include in our reports. This means that, realistically, the current form of the disabilityStatus data set isn’t ready for the data_clean folder - we would need to clean-up and tidy (yes, those are two different things) the values more.

dataClean_sexByAgeByDisabilityStatus_acs5_2019.rda <- disabilityStatus

save(disabilityStatus,file = "../../data/dataClean_sexByAgeByDisabilityStatus_acs5_2019.rda")

Now comes the important part: documentation. It’s all well and good to save a processed file in the data_clean or data_raw folder, but it’s another thing entirely to come back to these folders after a few weeks and try to sift through the various file names for a data set you could have sworn you added. To make all of this easier, we are going to try to adhere to documenting our data sets using the roxygen2 package. The documentation of a data set is similar to that of a function: you start a documentation line with #' and include tags to indicate special fields. Below is an example of how we might document the (raw) disability status data set. You can copy + paste this code into the data_clean.R and data_raw.R files when you go to document your own data sets.

#' ACS 5-year survey 2019, Table B18101, Sex by Age by Disability Status for counties in IA
#'
#' @name dataRaw_sexByAgeByDisabilityStatus_acs5_2019.rda
#'
#' @format A sf tibble object with 5544 rows and 6 variables:
#' \describe{
#'   \item{GEOID}{geographic ID of county}
#'   \item{NAME}{name of county}
#'   \item{variable}{ACS code associated with age, sex, and disability status combination}
#'   \item{estimate}{estimate of the prevalence of the associated variable in the associated county}
#'   \item{moe}{margin of error associated with the estimate}
#' }
#' @source tidycensus::get_acs(geography = "county",
#'                             table = "B18101",
#'                             year = 2019,
#'                             state = "IA",
#'                             survey = "acs5",
#'                             geometry = TRUE,
#'                             cache = TRUE)
#' @examples
#' \dontrun{
#' load("data_raw/dataClean_sexByAgeByDisabilityStatus_acs5_2019.rda")
#' }
#'
NULL #this is to tell roxygen2 that this is the last line of the above documentation

Saving & Documenting Functions

Now let’s suppose that we want to save the joinVariableLabels as an R function that we can access using the dspgWork package. Since this function handles cleaning/manipulating a data set, we might save it in the processing.R file available in the R/ folder of the package. Below shows how we might document this function.

#' Joins variable labels to a table from the ACS 5-year survey
#' 
#' @param acs5Data data frame object from an ACS 5-year survey (like those returned returned by tidycensus)
#'
#' @return the acs5Data object with 2 new columns: \code{label}, the English-word translation of the \code{variable} column in acs5Data and \code{concept}, the topic of the table of which the different variables are features.
#'
#' @examples
#' \dontrun{
#'  load("dataRaw/dataClean_sexByAgeByDisabilityStatus_acs5_2019.rda")
#'  head(joinVariableLabels(dataRaw_disabilityStatusByAge_acs5_2019))
#' }
#' @author Joe Zemmels
#'
#' @export
#'
joinVariableLabels <- function(acs5Data){
  
  load("../../data_clean/dataClean_variableLabels_acs5_2019.rda")
  
  acs5Data <- acs5Data %>%
    left_join(dataClean_variableLabels_acs5_2019,
              by = c("variable" = "name"))
  
  return(acs5Data)
}

`Install and Restart`

The purpose of adding functions to dspgWork is to allow you and others to re-use the function in the future. For example, if we want to re-create a plot with a new data set, wrapping the construction of the plot in a function allows us to do so concisely. To make a function re-usable, you must first “build” the package on your local machine (note that when you install a package from, say, CRAN or GitHub, this building is done for you automatically). “Building” a package will allow the package’s functions to be used in another R session on your local machine. To build a package, navigate to the “Build” tab in the top-right panel (assuming you haven’t rearranged the panels in RStudio) and click Install and Restart. Alternatively, the RStudio shortcut is CTRL + SHIFT + B. Following a session restart, you can use the library() function to load the package as you would with any other package.

`devtools::document()`

“Documenting” a package means compiling the function documentation into special files called .Rd (R documentation) files. These .Rd files are what displays whenever you use ? to look up the help file of a function. To automatically generate these files, you can call devtools::document() in the console (requires the devtools package - install this if you haven’t already). The .Rd files are then compiled into the man folder (which you should never need to change).