Why use functions
Sarah Endicott and David Hope
functions.Rmd
NOTE!!! This is a draft and is being actively edited
Disclaimer
Much of the examples and code here comes from Hadley Wickham et al’s books R for data science and R packages.
Why use functions
You can give a function an evocative name that makes your code easier to understand.
Simplify code so repetition is hidden and differences are clear
Avoid copy paste errors
Only update code in one place
Reuse code across different projects
When to write a function
Any time you copy paste a block of code more than twice! This often means re-writing the first two spots where you used the code but it is worth it.
Example 1
Here is an example of repetition we can avoid by using a function.
library(tidyverse)
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr 1.1.4 ✔ readr 2.1.5
#> ✔ forcats 1.0.0 ✔ stringr 1.5.1
#> ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
#> ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
#> ✔ purrr 1.0.4
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
df <- tibble(a = rnorm(5), b = rnorm(5), c = rnorm(5), d = rnorm(5))
df |> mutate(
a = (a - min(a, na.rm = TRUE)) / (max(a, na.rm = TRUE) - min(a, na.rm = TRUE)),
b = (b - min(a, na.rm = TRUE)) / (max(b, na.rm = TRUE) - min(b, na.rm = TRUE)),
c = (c - min(c, na.rm = TRUE)) / (max(c, na.rm = TRUE) - min(c, na.rm = TRUE)),
d = (d - min(d, na.rm = TRUE)) / (max(d, na.rm = TRUE) - min(d, na.rm = TRUE))
)
#> # A tibble: 5 × 4
#> a b c d
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.339 0.387 0.291 0
#> 2 0.880 -0.613 0.611 0.557
#> 3 0 -0.0833 1 0.752
#> 4 0.795 -0.0822 0 1
#> 5 1 -0.0952 0.580 0.394
Can you spot the copy-paste error above?
Now that we fixed that, how can we turn this into a function?
df |> mutate(
a = (a - min(a, na.rm = TRUE)) / (max(a, na.rm = TRUE) - min(a, na.rm = TRUE)),
b = (b - min(b, na.rm = TRUE)) / (max(b, na.rm = TRUE) - min(b, na.rm = TRUE)),
c = (c - min(c, na.rm = TRUE)) / (max(c, na.rm = TRUE) - min(c, na.rm = TRUE)),
d = (d - min(d, na.rm = TRUE)) / (max(d, na.rm = TRUE) - min(d, na.rm = TRUE))
)
#> # A tibble: 5 × 4
#> a b c d
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.339 1 0.291 0
#> 2 0.880 0 0.611 0.557
#> 3 0 0.530 1 0.752
#> 4 0.795 0.531 0 1
#> 5 1 0.518 0.580 0.394
- Identify parts that are the same across repetitions and “factor out” the parts that are different. This is the function body
For example, if we replace all the letters with x
the
rest is the same.
- Choose a clear name that tells us what the function does. Ideally a verb.
- Name the parts that will change between uses, these are the
arguments.
x
is a conventional name for a numeric vector.
Function template:
name <- function(arguments) {
body
}
Filling in the template we get:
rescale01 <- function(x) {
(x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
}
Then when we use the function in our code we can clearly see that we are applying the same rescaling across each column.
df |> mutate(
a = rescale01(a),
b = rescale01(b),
c = rescale01(c),
d = rescale01(d)
)
#> # A tibble: 5 × 4
#> a b c d
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.339 1 0.291 0
#> 2 0.880 0 0.611 0.557
#> 3 0 0.530 1 0.752
#> 4 0.795 0.531 0 1
#> 5 1 0.518 0.580 0.394
Then when your collaborator says they need the output in 0-100 instead of 0-1 we only need to change the code in one place and it will apply to all the columns.
rescale01 <- function(x) {
x <- (x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
x*100
}
df |> mutate(
a = rescale01(a),
b = rescale01(b),
c = rescale01(c),
d = rescale01(d)
)
#> # A tibble: 5 × 4
#> a b c d
#> <dbl> <dbl> <dbl> <dbl>
#> 1 33.9 100 29.1 0
#> 2 88.0 0 61.1 55.7
#> 3 0 53.0 100 75.2
#> 4 79.5 53.1 0 100
#> 5 100 51.8 58.0 39.4
Note: you can decrease repetition even more using iteration eg
df |> mutate(across(a:d, rescale01))
. See R for Data Science Chapter
26 for more on this.
What is a function?
A function is an object in R just like a vector. A function can be
defined in a package (eg: dplyr::mutate
) or created in an R
session. As mentioned above a function has three parts that need to be
defined, the name, arguments and body.
name <- function(arguments) {
body
}
There is one additional part of a function that is implicitly created when the function is created which is an environment. An environment is the place where code looks for objects that are called.
For example, above we defined two objects df
and
rescale01
which are both stored in the “Global environment”
and are listed in the Environment pane in RStudio. The Global
environment is the first place code run in the Console will look for an
object.
But inside the body of our function rescale01
we also
defined an object x
.
Does x
exist?
x
#> Error: object 'x' not found
No! It doesn’t exist in the Global environment, it only exists inside the function’s environment. Objects created inside a function are not available in the Global environment. Only the result of the last expression evaluated in the function will be returned.
For example:
rescale01 <- function(x) {
x <- (x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
x*100
message("data rescaled")
}
rescale01(df)
#> data rescaled
This returned NULL since that is the value returned by
message()
.
To return the rescaled data we assign it to an object and then evaluate the object last.
rescale01 <- function(x) {
x <- (x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
x <- x*100
message("data rescaled")
x
}
rescale01(df)
#> data rescaled
#> a b c d
#> 1 23.03762 79.64117 41.83571 12.75467
#> 2 59.80471 13.66962 68.10416 42.53950
#> 3 0.00000 48.64056 100.00000 52.96555
#> 4 54.01014 48.70998 17.90810 66.19434
#> 5 67.93915 47.85473 65.51536 33.83143
Converting existing scripts to use functions
Example 2
Imagine we have a script to process csv files containing observations of temperature and location from swimmers.
This is the R script:
infile <- "swim.csv"
swims_in <- read.csv(infile)
swims <- swims_in
swims
#> name where temp
#> 1 Adam beach 95
#> 2 Bess coast 91
#> 3 Cora seashore 28
#> 4 Dale beach 85
#> 5 Evan seaside 31
# Assume country based on name for beach
swims$english[swims$where == "beach"] <- "US"
swims$english[swims$where == "coast"] <- "US"
swims$english[swims$where == "seashore"] <- "UK"
swims$english[swims$where == "seaside"] <- "UK"
# Assume Farenheit for US
swims$temp[swims$english == "US"] <- (swims$temp[swims$english == "US"] - 32) * 5/9
swims
#> name where temp english
#> 1 Adam beach 35.00000 US
#> 2 Bess coast 32.77778 US
#> 3 Cora seashore 28.00000 UK
#> 4 Dale beach 29.44444 US
#> 5 Evan seaside 31.00000 UK
# save result with a timestamp
now <- Sys.time()
timestamp <- format(now, "%Y-%B-%d_%H-%M-%S")
(outfile <- paste0(timestamp, "_", sub("(.*)([.]csv$)", "\\1_clean\\2", infile)))
#> [1] "2025-April-04_14-27-48_swim_clean.csv"
write.csv(swims, file = outfile, quote = FALSE, row.names = FALSE)
This is an example of how functions can help to simplify code even when there isn’t a lot of repetition.
First we make a function to assign a country based on the name used
for the swimming location. This uses a look up table rather than
assigning each directly to reduce the repeated
swims$english[swims$where ==
calls.
library(tidyverse)
localize_beach <- function(dat) {
lookup_table <- tribble(
~where, ~english,
"beach", "US",
"coast", "US",
"seashore", "UK",
"seaside", "UK"
)
left_join(dat, lookup_table, by = "where")
}
swims_in %>% localize_beach()
#> name where temp english
#> 1 Adam beach 95 US
#> 2 Bess coast 91 US
#> 3 Cora seashore 28 UK
#> 4 Dale beach 85 US
#> 5 Evan seaside 31 UK
Then we write two functions, one to convert F to C and another that
applies f_to_c()
to US temperatures in a data frame.
f_to_c <- function(x) (x - 32) * 5/9
celsify_temp <- function(dat) {
mutate(dat, temp = if_else(english == "US", f_to_c(temp), temp))
}
swims_in %>% localize_beach() %>% celsify_temp()
#> name where temp english
#> 1 Adam beach 35.00000 US
#> 2 Bess coast 32.77778 US
#> 3 Cora seashore 28.00000 UK
#> 4 Dale beach 29.44444 US
#> 5 Evan seaside 31.00000 UK
Finally, one more function to create the output file name
outfile_path <- function(infile) {
now <- Sys.time()
timestamp <- format(now, "%Y-%B-%d_%H-%M-%S")
paste0(timestamp, "_", str_replace(infile, "(.*)([.]csv$)", "\\1_clean\\2"))
}
outfile_path(infile)
#> [1] "2025-April-04_14-27-48_swim_clean.csv"
To keep our script tidy we can move the functions we have defined
into a separate script. Eg cleaning-funs.R
# define functions
localize_beach <- function(dat) {
lookup_table <- tribble(
~where, ~english,
"beach", "US",
"coast", "US",
"seashore", "UK",
"seaside", "UK"
)
left_join(dat, lookup_table, by = "where")
}
f_to_c <- function(x) (x - 32) * 5/9
celsify_temp <- function(dat) {
mutate(dat, temp = if_else(english == "US", f_to_c(temp), temp))
}
outfile_path <- function(infile) {
now <- Sys.time()
timestamp <- format(now, "%Y-%B-%d_%H-%M-%S")
paste0(timestamp, "_", str_replace(infile, "(.*)([.]csv$)", "\\1_clean\\2"))
}
Putting that all together we get a simpler script where the function names tell us what the code is doing.
library(tidyverse)
source("cleaning-funs.R")
infile <- "swim.csv"
swims <- read.csv(infile)
swims <- swims %>%
localize_beach() %>%
celsify_temp()
swims
#> name where temp english
#> 1 Adam beach 35.00000 US
#> 2 Bess coast 32.77778 US
#> 3 Cora seashore 28.00000 UK
#> 4 Dale beach 29.44444 US
#> 5 Evan seaside 31.00000 UK
write.csv(swims, outfile_path(infile))
The tricky question is how much simplification is useful? Here the script is simpler and makes it pretty clear what the steps of the cleaning process are but it hides the assumption that we can classify country based on beach name and temperature units based on that country. Whether that is desirable depends on the use case of the code.
One useful tool for navigating code containing functions in RStudio is F2. If you place your cursor inside a function name and press F2 you will jump to the definition of the function. This allows readers that care about the details to easily navigate to them.
Benefits of a functional mindset
Once you start using functions in your code you will start to notice more opportunities to reduce repetition and write more efficient code.
Example 3
Let say we want to determine which variable in the
mtcars
data set is the best predictor of mpg
.
To do this we build a linear model for each variable vs mpg
(Note this a toy example not stats advice).
str(mtcars)
#> 'data.frame': 32 obs. of 11 variables:
#> $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#> $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
#> $ disp: num 160 160 108 258 360 ...
#> $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
#> $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
#> $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
#> $ qsec: num 16.5 17 18.6 19.4 17 ...
#> $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
#> $ am : num 1 1 1 0 0 0 0 0 0 0 ...
#> $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
#> $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
A first attempt might be to right them all out by hand.
library(broom)
mod_cyl <- lm(mpg ~ cyl, data = mtcars)
gl_cyl <- glance(mod_cyl)
mod_disp <- lm(mpg ~ disp, data = mtcars)
gl_disp <- glance(mod_disp)
data.frame(var = c("cyl", "disp"), r.squared = c(gl_cyl$r.squared, gl_disp$r.squared))
#> var r.squared
#> 1 cyl 0.7261800
#> 2 disp 0.7183433
# But for all the variables... sigh
One option is to use a for loop
mtcars_vars <- colnames(mtcars) %>% str_subset("mpg", negate = TRUE)
mtcars_mods <- vector("list", length = length(mtcars_vars))
for (i in seq_along(mtcars_vars)) {
form <- as.formula(paste0("mpg ~ ", mtcars_vars[i]))
mod <- lm(form, data = mtcars)
mtcars_mods[[i]] <- mod
}
Another option is to write a function and iteratively apply it to all the variables
fit_mpg_vs_x <- function(x){
form <- as.formula(paste0("mpg ~ ", x))
lm(form, data = mtcars)
}
# name the mtcars_vars list so output is named
names(mtcars_vars) <- mtcars_vars
mtcars_mods <- map(mtcars_vars, fit_mpg_vs_x)
I like this better than the for loop because there is less boilerplate code. But it becomes really useful if we need to apply the same model to a different set of variables. In a for loop we would need to repeat the whole thing but since a function is saved as an object we can easily apply it to a different vector.
Then we can easily apply additional functions to the results to get our desired output.
library(broom)
mtcars_mods %>% map(glance) %>% list_rbind(names_to = "var") %>%
ggplot(aes(fct_reorder(var, r.squared), r.squared))+
geom_point()+
labs(y = "R squared", x = "Variable")
Or to make diagnostic plots.
save_mod_chk_plts <- function(mod, nm){
png(filename = paste0(nm,"_check_model.png"), width = 7, height = 10, units = "in", res = 300)
performance::check_model(mod)
dev.off()
}
mtcars_mods %>% walk2(names(mtcars_mods), save_mod_chk_plts)
knitr::include_graphics("cyl_check_model.png")
Oooh! With a couple tweaks I could use save_mod_chk_plts
in lots of projects!!
When to make it a Package?
An R package is a collection of functions and metadata that can be easily shared with other R users. It includes a standardized folder structure and set of files.
The copy twice rule of thumb applies to whole functions as well. Once
you have copied a function across two separate script it should at least
go in a separate file that is sourced in both scripts so you only need
to change it in one place. Then if you are copying the same helper
functions into multiple projects they should become a package that you
can call in that script with library(mypackage)
.
Another school of thought is that all analysis projects should use a package structure to organize their code (One example) or a modified package structure with additional folders added for analysis and paper writing (What Sarah uses).
How to create a package
The book R Packages contains
everything you need to know to write a package.
The key files that define a package are:
- DESCRIPTION: Metadata about the package including dependencies
- NAMESPACE: List of imported and exported functions
- R: Code that defines the functions of the package
Two packages, devtools and usethis, and tools in RStudio make
building your own package pretty easy. For example, all the boilerplate
structure of a package can be created with
usethis::create_package()
. You then add new function
scripts with usethis::use_r()
, and create tests with
usethis::use_test
.
Benefits of using a package
- Easy loading and re-loading: call
devtools::load_all()
(Ctrl-Shift-L) to source all the functions in your R directory - Way to record and install dependencies: record dependencies with
usethis::use_package()
, usedevtools::install_deps()
to install all the listed dependencies. (Note this assumes the code works with the most recent version of dependencies!) - Documentation: Functions in a package are documented using Roxygen, special comments that create documentation for your functions. In RStudio use Code > Insert Roxygen Skeleton or Alt-Ctrl-Shift-R to insert a template when your cursor is in a function definition.
- pkgdown website: Run
usethis::use_pkgdown_github_pages()
to create a website with the package README as home page, vignettes rendered as articles and a reference for all the function documentation.(For example the package roads) - Testing: Unit tests ensure your functions continue doing what you
expect when you make changes or update dependencies.
usethis::use_testthat()
sets up the testing architecture andusethis::use_test
opens a new test file.usethis::use_github_action()
sets up GitHub actions to run R CMD check which includes your tests on GitHub every time you commit to the main branch aka Continuous Integration testing.
# clean up
list.files(here::here("vignettes"), full.names = TRUE) %>%
str_subset("check_model|swim_clean") %>% str_subset("cyl", negate = TRUE) %>%
file.remove()
#> [1] TRUE