4 Reporting
Using papaja
, it is possible to run all your analyses from within an R Markdown document. Any output from R is included as you usually would using R Markdown. By default the R code will not be displayed in the final documents. If you wish to show off your code you need to set echo = TRUE
in the chunk options.
This section describes how you can include the results of your analyses can be included in text. It starts with simple numerical values (like a sample mean) and then continues with more-complex output, like figures, tables, and the results from statistical models and hypothesis tests.
4.1 Numerical values
Consider a basic example, where you want to include a mean and a standard deviation into text: printnum()
can be used to include numeric values in the text. The function performs the rounding and optionally omits leading zeros in accordance with APA guidelines:
Participants mean age was `r printnum(age_mean)` years ($SD = `r printnum(age_sd)`$).
The above line is rendered as follows:
Participants mean age was 22.84 years (\(SD = 3.78\)).
printnum()
is quiet flexible and passes its arguments to the underlying base function formatC()
(CAVE: We are currently considering to switch to format()
.), so changing the number of digits after the decimal point or the character to be used to indicate the decimal point (e.g., from a point "."
to a comma ","
) is always possible.
papaja
automatically applies printnum()
to any numbers that a printed in inline code chunks. Hence, `r age_mean`
and `r printnum(age_mean)`
are effectively identical and both yield “22.84”.
papaja
additionally provides a shorthand version of printnum()
, namely printp()
, with defaults for correct typesetting of \(p\) values (e.g., \(p < .001\) instead of \(p = 0.000\)).
Typeset numerical values for greater control - printnum()
- printp()
as shorthand for p values
printnum(c(143234.34557, Inf))
## [1] "143,234.35" "$\\infty$"
printnum(42L, numerals = FALSE, capitalize = TRUE)
## [1] "Forty-two"
printp(c(1, 0.0008, 0))
## [1] "> .999" ".001" "< .001"
4.2 Figures
As in any R Markdown document, you can include figures in your document. Figures can either consist of plots generated in R or external files. In accordance with APA guidelines, figures are not displayed in place but are deferred to the final pages of the document. To change this default behavior set the option figsintext
in the YAML front matter to yes
.
4.2.1 R-generated figures
Any plot that is generated by an R code chunk is automatically included in your document. Hence, a simple call to plot()
is sufficient to create a figure:
plot(cars)
Figure display size, resolution, file format and many other things can be controlled by setting the corresponding chunk options. Refer to the plot-related knitr
chunk options for an overview of all options. In papaja
-documents, by default, all figures as saved as vectorized PDF and pixel-based PNG files at a resolution of 300 DPI, which should in most cases be sufficient for print publication. When the target document format is PDF the vectorized PDF files are included; when the target document format is DOCX the pixel-based PNG files are included. The files can be found in the mydocument_files
folder that is generated when mydocument.Rmd
is knitted.
Note, if you define transparent colors in your plots (e.g., when you defining colors using rgb()
), the rendered document will always display the pixel-based PNG files, regardless of the target document format. Given the lossless compression in PNG and the decent resolution of 300 DPI this should not be a problem in print.
4.2.1.1 papaja
plot functions
Factorial designs are ubiquotous in experimental psychology but notoriously cumbersome to visualize with R base graphics
. That is why papaja
provides a set of functions to facilitate plotting data from factorial designs.
The functions apa_beeplot()
, apa_lineplot()
, apa_barplot()
, and the generic apa_factorial_plot()
are intended to provide a convenient way to obtain a publication-ready figure while maintaining enough flexibility for customization. The basic arguments to these functions are all the same, so you only need to learn one syntax to generate different types of plots. Moreover, it is easy switch between the different types of plots by changing the name of the called function.
Consider the following example of a so-called beeswarm plot:
apa_beeplot(
data = mixed_data
, id = "Subject"
, dv = "Recall"
, factors = c("Task", "Valence", "Dosage")
)
A data.frame
that contains the data in long format is passed to the data
argument. The other arguments expect the names of the columns that contain the subject identifier id
, the dependent variable dv
, and the between- or within-subjects factors
of the design. Currently, zero to four factors are supported.
For each cell of the design, the functions plot
- central tendency,
- dispersion, and
- names of dependent variable, factors and levels, and optionally a legend.
Mesures of central tendency and dispersion default to the mean and 95% confidence intervals. However, the arguments tendency
and dispersion
can be used to overwrite these defaults and plot other statistics. For example, to plot within-subject rather than between-subject confidence intervals you can set dispersion = within_subjects_conf_int
. If the data encompass multiple observations per participant-factor-level-combination, these observations are aggregated. By default, the condition means are computed but you can specify other aggregation functions via the fun_aggregate
argument.
The visual elements of the plots can be customized by passing options to the arguments args_points
, args_error_bars
, args_legend
, and so on. These arugments take named lists of arguments that are passed on to the respective graphics
functions, such as points()
, arrows()
, and legend()
. Consider the following code and the resulting Figure 4.3 for a customized version of the previous beeswarm plot.
apa_beeplot(
data = mixed_data
, id = "Subject"
, dv = "Recall"
, factors = c("Task", "Valence", "Dosage")
, dispersion = within_subjects_conf_int
, main = c(expression(italic(A)~"dosage"), expression(italic(B)~"dosage"), expression(italic(C)~"dosage"))
, ylab = "Number of correctly recalled words"
, args_points = list(cex = 2, bg = c("skyblue4", "indianred3", "seagreen4"), col = "white")
, args_swarm = list(cex = 1.2)
, args_error_bars = list(length = 0.02, col = "grey20")
, args_legend = list(x = "bottom", inset = 0.05)
, args_y_axis = list(las = 1)
)
- Utilize variable labels
These plot functions also utilize variable labels, with some \(\LaTeX\) math support (see ?latex2exp::TeX
)
variable_labels(cosmetic_surgery) <- c(
Post_QoL = "Quality of life post surgery ($\\bar{x}_{post}$)"
)
apa_beeplot(
id = "ID"
, dv = "Post_QoL"
, factors = c("Reason", "Surgery", "Gender")
, data = cosmetic_surgery
# , ylab = "Quality of life post surgery"
, las = 1
, args_legend = list(x = 0.25, y = 30)
, args_points = list(bg = c("skyblue2", "indianred1"))
, args_error_bars = list(length = 0.03)
)
4.2.2 Setting global plot options
As noted in the APA guidelines (p. 153) figures in an article should be consistent and in the same style. Hence, it is desireable to define plot styles once rather than every time a new plot is created. When using graphics
plot functions global options can be set using par()
. For example, the size of labels and symbols can be controled with cex
:
par(cex = 1.2)
apa_beeplot(
data = mixed_data
, id = "Subject"
, dv = "Recall"
, factors = c("Task", "Valence")
)
However, the use of par()
is restricted in knitr
.
knitr
resets par()
for every chunk; par()
needs to be called anew in every chunk that creates a plot.
While there are advantages to this behavior it impedes defining a consistent plot style. Although not very convenient, it is possible to set global plot options in knitr
and therefore also in papaja
. The following example illustrates how to make knitr
execute a specified call to par()
before executing code in a given chunk. Here we rotate the \(y\)-axis label (las = 1
), sets a regular font and increases the size of the main title (font.main = 1
, cex.main = 1.05
).
# Ensure that par()-settings are retained across chunks
knitr::opts_knit$set(global.par = TRUE)
# Define a function that sets desired global plot defaults
par(font.main = 1`, `cex.main = 1.05, las = 1)
The customizations apply to all subsequently generated plots. It is, thus, advisable to add this code at the top of the R Markdown document shortly after the YAML front matter.
These same limitations do not apply to users of ggplot2
. Custom themes can be set as default as usual and are subsequently applied to all plots created with ggplot()
. papaja
provides a template theme_apa()
that we feel is well suited to create publication-ready plots. The following example demonstrates how to set a default theme.
# Cosmetic surgery dataset from
# Field, A. P., Miles, J., & Field, Z. (2012). Discovering statistics using R. London: Sage.
load("data/cosmetic_surgery.rdata")
ggplot(
cosmetic_surgery
, aes(x = Base_QoL, y = Post_QoL, color = Reason)
) +
geom_point() +
geom_smooth(method = "lm") +
labs(
x = "Baseline quality of life"
, y = "Quality of life post surgery"
) +
scale_color_brewer(palette = "Set1") +
theme_apa(box = TRUE) +
theme(legend.position = c(0.2, 0.8))
theme_set(theme_apa())
4.2.3 External files
…
4.3 Tables
For prettier tables, I suggest you try apa_table()
, which builds on knitr
’s kable()
.
descriptives <- mixed_data %>% group_by(Dosage) %>%
summarize(
Mean = mean(Recall)
, Median = median(Recall)
, SD = sd(Recall)
, Min = min(Recall)
, Max = max(Recall)
)
descriptives[, -1] <- printnum(descriptives[, -1])
apa_table(
descriptives
, caption = "Descriptive statistics of correct recall by dosage."
, note = "This table was created with apa_table()."
, escape = TRUE
)
Of course popular packages like xtable
^{3} or tables
can also be used to create tables when knitting PDF documents. These packages, however, cannot be used when you want to create Microsoft Word documents because they rely on \(\LaTeX\) for typesetting. apa_table()
creates tables that conform to APA guidelines and are correctly rendered in PDF and Word documents. But don’t get too excited. In papaja, table formatting is somewhat limited for Word documents due to missing functionality in pandoc (e.g., it is not possible to have cells or headers span across multiple columns).
As required by the APA guidelines, tables are deferred to the final pages of the manuscript when creating a PDF. To place tables and figures in your text instead, set the figsintext
parameter in the YAML header to yes
or true
, as I have done in this document. Again, this is not the case in Word documents due to limited pandoc functionality. The bottom line is, Word documents will be less polished than PDF. The resulting documents should suffice to enable collaboration with Wordy colleagues and prepare a journal submission with limited manual labor.
apa_table()
extends knitr::kable()
- Designed to create table designs from the APA manual
- Does not perform any calculations
- Can merge multiple
data.frame
of the same structure contained in alist
- Much more powerful in PDF documents
(seeapa_table_example.Rmd
) - Remember to set chunk option
results = "asis"
!
kableExtra
is another powerful alternative
4.3.1 Alternatives
Pixidust: https://github.com/nutterb/pixiedust/blob/master/README.md
xtable: https://cran.r-project.org/web/packages/xtable/xtable.pdf
tables: https://cran.r-project.org/web/packages/tables/vignettes/tables.pdf
stargazer: https://cran.r-project.org/web/packages/stargazer/index.html
knitr::kable
Hmisc::latex
https://twitter.com/polesasunder/status/464132152347475968/photo/1
kableExtra: https://cran.r-project.org/web/packages/kableExtra/index.html
4.4 Results from statistical tests
Output from statistical models and tests needs to be properly formatted before it can be included in text. For this purpose, we created the function apa_print()
. Simply drop the results from a statistical test (e.g. a \(t\) test or an ANOVA) into apa_print()
, and you will get an output object (a list
) that contains all you need to report the analysis according to APA guidelines. apa_print()
currently supports output objects from a variety of statistical models/tests, including \(t\) tests, ANOVA, linear regression. For a full list of supported output objects, see Table ??.
A-B | B-H | L-S | S-Z |
---|---|---|---|
afex_aov | BFBayesFactorList^{*} | list | summary_emm^{*} |
anova | BFBayesFactorTop^{*} | lm | summary.glht^{*} |
Anova.mlm | emmGrid^{*} | lsmobj^{*} | summary.glm |
aov | glht^{*} | summary.Anova.mlm | summary.lm |
aovlist | glm | summary.aov | summary.ref.grid^{*} |
BFBayesFactor^{*} | htest | summary.aovlist |
^{*} Not fully tested, don’t trust blindly!
4.4.1 htest (t tests, correlations, etc.)
The results of a \(t\) test (regardless of the exact type of \(t\) test) are typically stored in an htest
object.
# Cosmetic surgery dataset from
# Field, A. P., Miles, J., & Field, Z. (2012). Discovering statistics using R. London: Sage.
load("data/cosmetic_surgery.rdata")
t_out <- t.test(
x = cosmetic_surgery$Post_QoL
, y = cosmetic_surgery$Base_QoL
, paired = TRUE
)
# Let's see what the apa_print() output looks like
apa_print(t_out)
## $estimate
## [1] "$M_d = -3.95$, 95\\% CI $[-4.86$, $-3.04]$"
##
## $statistic
## [1] "$t(275) = -8.52$, $p < .001$"
##
## $full_result
## [1] "$M_d = -3.95$, 95\\% CI $[-4.86$, $-3.04]$, $t(275) = -8.52$, $p < .001$"
##
## $table
## NULL
As you can see, apa_print()
returns a list of four elements: List element estimate
contains the sample estimate of mean difference; element statistic
contains the result of the NHST with test statistic, dfs, and p value; element full_result
contains both information combined. The fourth element, table
, contains a table containing test results; while this is NULL
for a \(t\) test, it will be quiet interesting for ANOVA or linear model results, as you will see in the next subsections.
4.4.2 lm (linear regression)
The results of a linear model are stored in an lm
object.
lm_out <- lm(formula = Post_QoL ~ Base_QoL + BDI, data = cosmetic_surgery)
apa_lm <- apa_print(lm_out)
Again, apa_print
returns a four-elements list. apa_lm$table
contains a complete regression table that can be passed to apa_table()
to get a full regression table, see Table @ref(tab:lm-table; remember to set the chunk option results = "asis"
).
apa_table(
apa_lm$table
, caption = "A full regression table."
, escape = FALSE
)
Predictor | \(b\) | 95% CI | \(t(273)\) | \(p\) |
---|---|---|---|---|
Intercept | 18.50 | \([13.10\), \(23.91]\) | 6.74 | < .001 |
Base QoL | 0.59 | \([0.50\), \(0.67]\) | 13.23 | < .001 |
BDI | 0.17 | \([0.11\), \(0.22]\) | 6.08 | < .001 |
lm_res <- lm(Post_QoL ~ Base_QoL + BDI, data = cosmetic_surgery)
lm_res_apa <- apa_print(lm_res, observed_predictors = TRUE)
lm_res_apa$estimate$Intercept
## [1] "$b = 18.50$, 95\\% CI $[13.10$, $23.91]$"
\(b = 18.50\), 95% CI \([13.10\), \(23.91]\)
–
lm_res_apa$full_result$modelfit$r2
## [1] "$R^2 = .50$, 90\\% CI $[0.42$, $0.57]$, $F(2, 273) = 136.78$, $p < .001$"
\(R^2 = .50\), 90% CI \([0.42\), \(0.57]\), \(F(2, 273) = 136.78\), \(p < .001\)
Tables produced by apa_print()
have variable labels
lm_res_apa$table
## A data.frame with 5 labelled columns:
##
## predictor estimate ci statistic p.value
## 1 Intercept 18.50 $[13.10$, $23.91]$ 6.74 < .001
## 2 Base QoL 0.59 $[0.50$, $0.67]$ 13.23 < .001
## 3 BDI 0.17 $[0.11$, $0.22]$ 6.08 < .001
##
## predictor: Predictor
## estimate : $b$
## ci : 95\% CI
## statistic: $t(273)$
## p.value : $p$
Tables produced by apa_print()
have variable labels
variable_labels(lm_res_apa$table)
## $predictor
## [1] "Predictor"
##
## $estimate
## [1] "$b$"
##
## $ci
## [1] "95\\% CI"
##
## $statistic
## [1] "$t(273)$"
##
## $p.value
## [1] "$p$"
variable_labels(letters) <- "Alphabet"
letters[1:15]
## Variable label : Alphabet
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o"
## attr(,"class")
## [1] "papaja_labelled" "character"
variable_labels(cars$speed) <- "Speed [m/h]"
cars[1:15, "speed"]
## Variable label : Speed [m/h]
## [1] 4 4 7 7 8 9 10 10 10 11 11 12 12 12 12
## attr(,"class")
## [1] "papaja_labelled" "numeric"
Variable labels can be added to any vector
or data.frame
variable_labels(cars) <- c(
speed = "Speed [m/h]"
, dist = "Stopping distance [ft]"
)
cars[1:3, ]
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
Variable labels are attributes of the variables that make up the data.frame
- Conflicting implementations
- Ours is, in principle, compatible with
Hmisc
variable_labels(cars$dist)
## [1] "Stopping distance [ft]"
cars[1:15, "dist"]
## Variable label : Stopping distance [ft]
## [1] 2 10 4 22 16 10 18 26 34 17 28 14 20 24 28
## attr(,"class")
## [1] "papaja_labelled" "numeric"
4.4.2.1 anova (model comparisons)
…
4.4.3 anova, Anova, Anova.mlm (Analysis of Variance – ANOVA)
Foreword – you can skip this if you are not interested in background. Amongst psychologists, analysis of variance is a still-popular and frequently used technique. In base R, however, analysis of variance is not implemented in a way that is too useful for psychologists, as the aov
-function in the stats package does not provide the much sought-after Type-\(\rm{I\!I\!I}\) sums-of-squares. A variety of R packages has emerged to fill the gap, e.g. car
(Fox & Weisberg, 2011), ez
(Lawrence, 2016), and afex
(Singmann et al., 2016). Unfortunately, each ANOVA function provides different output objects that need to be digested by apa_print()
. To do so, we rely on the package authors to return standardized result objects (i.e., S3/S4 classes). For this reason apa_print()
does not support results from the popular ezANOVA()
. The object returned by this function has no class to inform us about its contents and, thus, cannot be processed. We recommend you try one of the other available packages, e.g. aov_ez()
from the afex
-package.
Functionality.
cosmetic_surgery_anova <- afex::aov_ez(
data = cosmetic_surgery
, dv = "Post_QoL"
, id = "ID"
, between = c("Clinic", "Gender")
)
apa_anova <- apa_print(cosmetic_surgery_anova)
Now, you can report the results of your analyses like so:
Clinic (`r apa_anova$full$Clinic`) and patient gender affected post-surgery quality of life, `r apa_anova$full$Gender`. However, the effect of gender differed by clinic, `r apa_anova$full$Clinic_Gender`.
Which will yield the following text:
Clinic (\(F(9, 256) = 19.98\), \(\mathit{MSE} = 47.58\), \(p < .001\), \(\hat{\eta}^2_G = .413\)) and patient gender affected post-surgery quality of life, \(F(1, 256) = 15.29\), \(\mathit{MSE} = 47.58\), \(p < .001\), \(\hat{\eta}^2_G = .056\). However, the effect of gender differed by clinic, \(F(9, 256) = 1.97\), \(\mathit{MSE} = 47.58\), \(p = .043\), \(\hat{\eta}^2_G = .065\).
Again, you can easily create a complete ANOVA table by passing recall_anova_results$table
to apa_table()
, see .
apa_table(
apa_anova$table
, caption = "A really beautiful ANOVA table."
, note = "Note that the column names contain beautiful mathematical copy: This is because the table has variable labels."
)
Effect | \(F\) | \(\mathit{df}_1\) | \(\mathit{df}_2\) | \(\mathit{MSE}\) | \(p\) | \(\hat{\eta}^2_G\) |
---|---|---|---|---|---|---|
Clinic | 19.98 | 9 | 256 | 47.58 | < .001 | .413 |
Gender | 15.29 | 1 | 256 | 47.58 | < .001 | .056 |
Clinic \(\times\) Gender | 1.97 | 9 | 256 | 47.58 | .043 | .065 |
4.4.3.1 glht, lsmeans, ref.grid (post-hoc tests)
…
4.5 Caching expensive computations
If R codes take a long time to run, results can be cached
```{r heavy-computation, .highlight[cache = TRUE]}
# Imagine computationally expensive R code here
```
my_manuscript_cache
directory is created that contains - Computed objects - List of loaded R packages in __packages
The next time the chunk is executed
- If unchanged, cached results are loaded
- If changed, cache is discarded and code is executed anew
–
Packages are not reloaded, which cause problems in subsequent uncached chunks
- Do not include
library()
-calls in cached chunks!
–
Load cache into an interactive R session using lazyLoad()
library()
-calls. Subsequent use of functions from these packages in non-cached chunks will throw errors!
To avoid incorrect results due to caching, dependencies should be specified
```{r chunk1, cache = TRUE}
x <- 5
```
```{r chunk2, cache = TRUE, .highlight[dependson = “chunk1”]}
x <- x + 5
```
If code in chunk1
changes, chunk2
is recomputed
To avoid incorrect results due to caching, dependencies should be specified
knitr::opts_chunk$set(autodep = TRUE)
knitr::dep_auto()
See knitr cache examples for details
Control when cache is discarded
- Ignore changes to comments
knitr::opts_chunk$set(cache.comments = FALSE)
- Discard cache when random seed changes
opts_chunk$set(cache.extra = knitr::rand_seed)
I would recommend to always set these options
- Discard cache if data are modified
opts_chunk$set(cache.extra = file.info("data.csv")$mtime)
- Discard cache if the R environment changes
opts_chunk$set(
cache.extra = list(R.version, sessionInfo())
)
If time permits, always discard all cached results (i.e., delete my_manuscript_cache
directory) and reknit the document before submission to ensure that all results are based on the current R code!
4.5.1 Inserting variables before they have been computed
R code in R Markdown documents in executed from top to bottom. Hence, variables are not available before the code chunk in which they are defined has been executed. However, sometimes it may be desireable to report a variable in a document before the corresponding code chunk. A common example is to include the sample size or focal test statistics in the abstract.
Cached code chunks store their results in a cache database in the subfolder my_document_cache
(default). The function knitr::load_cache()
can be used to access cache databases and report results of a chunk anywhere in the document. For the following example, we’ll assume that we want to report the sample size in the abstract. The sample size is calculated in the method section:
```{r, sample-size, cache = TRUE}
n <- nrow(cosmetic_surgery)
```
To report this variable in the abstract we need to use an inline code chunk in the YAML front matter field:
---
abstract: |
This is the abstract.
Our sample size was n = `r knitr::load_cache(label = "sample-size", object = "n")`.
---
When the document is first rendered the object n
in the code chunk sample-size
has not yet been computed and knitr::load_cache()
will insert "NOT AVAILABLE"
instead. The next time the document is rendered, the results of the previous run have been saved as cache database. These results will be loaded and inserted into the abstract.
knitr::load_cache()
can only ever display the result of the previous run—if the cached code is changed, the document has to be rendered twice to display the new result.
References
Fox, J., & Weisberg, S. (2011). An R companion to applied regression (Second.). Thousand Oaks CA: Sage. Retrieved from http://socserv.socsci.mcmaster.ca/jfox/Books/Companion
Lawrence, M. A. (2016). Ez: Easy analysis and visualization of factorial experiments. Retrieved from https://CRAN.R-project.org/package=ez
Singmann, H., Bolker, B., Westfall, J., & Aust, F. (2016). Afex: Analysis of factorial experiments. Retrieved from https://CRAN.R-project.org/package=afex
When you use
xtable()
, table captions are set to the left page margin.↩