Introduction
Survey data often encounters the challenge of dropout – instances where participants fail to complete sections of the survey due to interruptions or omissions. Handling dropouts effectively is crucial for accurate data analysis and interpretation. The dropout package provides a solution by offering insights into participant behavior during the survey process.
Understanding the Dropout Package
The dropout package empowers you with the capability to extract valuable insights from your dataset, such as:
- Identifying the specific survey points where participants tend to stop completing the survey.
- Detecting sections that are frequently skipped by respondents.
- Quantifying the extent and locations of dropouts within the survey.
- Estimating the proportion of missing values attributed to dropouts in each column.
- Profiling respondents who discontinued the survey and pinpointing their dropout points.
- And much more…
By leveraging these insights, you can:
- Enhance data cleaning procedures by distinguishing dropouts and tailoring your approach accordingly.
- Adjust your analytical strategies to accommodate dropout-related biases in the dataset.
- Analyze omitted sections in conjunction with the collected data, yielding a comprehensive understanding.
In this vignette, we will provide an in-depth overview of the dropout
package’s features and their practical utilization. We will use a sample
dataset named “flying” to illustrate these concepts. This is a modified
version of the Flying Etiquette Survey data behind the story: 41 percent
of flyers say it’s rude to recline your seat on an airplane. You can
load this preinstalled dataset into your environment using the command
data(flying)
.
Prerequisites
While the dropout package can function independently, integrating it with the tidyverse ecosystem (especially using dplyr) can significantly enhance your workflow. However all of the methods used in this Vignette can be transferred to using Base R code exclusively.
Exploring Dropout Insights with drop_summary
Let’s embark on a deeper exploration of the dropout
package by delving into the drop_summary
function. This
function serves as a pivotal tool for gaining in-depth insights into
dropout patterns within your dataset. To effectively utilize the
drop_summary
function, you should specify the last column
in your dataset that corresponds to the survey items. If you encounter a
warning message while using this function, it could be attributed to
either of the following reasons:
- Your dataset contains no instances of dropout.
- Additional columns beyond the survey items exist, but they aren’t correctly identified as such.
For example, in the “flying” dataset, the final survey-related item
is stored in the “location_census_region” column. Following this, the
“survey_type” column contains supplementary survey information. Many
datasets incorporate similar non-survey-related data, and it’s crucial
to consider such cases. If the last column is left unspecified, the
drop_summary
function will assume that only survey-related
items are present.
To gain a comprehensive overview of the dropout patterns within your dataset, consider the following code snippet:
drop_summary(flying, "location_census_region")
#> # A tibble: 27 × 8
#> column_name dropout drop_rate drop_na section_na single_na missing
#> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 respondent_id 0 0 0 0 0 0
#> 2 travel_frequency 0 0 0 0 0 0
#> 3 seat_recline 18 0.02 18 164 0 182
#> 4 height 0 0.02 18 164 12 194
#> 5 children_under_18 1 0.02 19 164 6 189
#> 6 two_armrests 1 0.02 20 164 0 184
#> 7 middle_armrest 0 0.02 20 164 0 184
#> 8 window_shade 0 0.02 20 164 0 184
#> 9 moving_to_unsold_seat 1 0.02 21 164 0 185
#> 10 talking_to_seatmate 0 0.02 21 164 0 185
#> # ℹ 17 more rows
#> # ℹ 1 more variable: completion_rate <dbl>
Now, let’s delve into the intricacies of the
drop_summary
function and the valuable insights it provides
in a structured format.
Understanding the Output of drop_summary
When you use the drop_summary
function, the output you
receive is a compact yet informative summary, packaged as either a
dataframe or a tibble. This summary consists of multiple columns, each
of which provides insights into different dimensions of dropout analysis
within your dataset.
Output Columns:
column_name
: Lists the names of the columns from your dataset that have been analyzed for dropouts.dropout
: Contains the frequency of dropouts within each listed column, allowing you to see where dropout rates might be the most significant.drop_rate
: Shows the overall percentage of dropout incidents in each column. This is useful for understanding the relative impact of dropouts in various parts of your dataset.drop_na
: Provides the percentage of missing values in each column that can be attributed specifically to dropouts. This offers insights into the nature of missing data.-
section_na
: Indicates occurrences of missing values that span at leastn
consecutive columns (n
defaults to 3). You can adjust this parameter usingsection_min
as shown below:drop_summary(data, last_col, section_min = n)
This column is particularly useful for identifying participants who might skip entire sections of a survey without dropping out completely.
single_na
: Reveals the percentage of single-instance missing values in each column, which are not associated with systematic dropouts or section skips.completion_rate
: Denotes the overall data completion rate for each analyzed column, enabling you to gauge the integrity and reliability of your dataset.
Using drop_detect
to Identify Dropouts
One of the core tools in the dropout
package is the
drop_detect
function. This function serves as a
comprehensive tool for isolating and understanding individual
participant dropout behaviors. Specifically, it reveals whether a
participant has left the survey prematurely and pinpoints the exact
juncture at which the dropout took place.
The structure and usage of drop_detect
are intentionally
made to resonate with the drop_summary
function, ensuring
consistency and ease of adoption.
Output Columns:
dropout
: A Boolean column (TRUE or FALSE). ATRUE
value signifies that the respective participant exited the survey prematurely.dropout_column
: For those marked asTRUE
in thedropout
column, this field specifies the exact column or question that triggered the dropout.dropout_index
: Offers a direct reference to the row number where the dropout incident occurred, facilitating easier traceability.
Example Usage:
For practical insights, consider applying drop_detect
on
the ‘flying’ dataset. Here’s how you can achieve this:
drop_detect(flying, "location_census_region")
Moreover, if you wish to append the extracted dropout details back to
the original dataset, you can employ the bind_cols
function
from the dplyr
package:
drop_detect(flying, "location_census_region") %>%
bind_cols(flying, .)
Such integration of dropout specifics into the primary dataset can act as a preliminary step for more nuanced analyses, like zoning into specific dropout triggers, assessing commonalities among dropouts, or any other relevant exploratory exercises.
Subsequent sections will delve into exemplified applications of this integrated approach.
Practical Workflow Examples
Cleaning Early Dropouts
The drop_detect
function can be useful for identifying
and filtering out early dropouts, i.e., participants who stopped
answering the survey at a specific column. For example, you can filter
for participants who did not drop out early, or had a ‘late’ dropout in
the demographic part of the questions, using the
dropout_index
:
Analyzing Specific Sections
If you’re interested in a specific section of questions and want to
filter for dropouts and section_na
, you have two
approaches:
Comparative Analysis: Age and Gender
One practical application is to compare the demographics (e.g., age and gender) between those who left out a section and those who did not. The following code generates a bar graph that breaks down dropout rates by age group and gender.
library(ggplot2)
flying %>%
drop_detect("smoking_violation") %>%
bind_cols(flying, .) %>%
filter(!is.na(gender)) %>%
mutate(age = factor(age, levels = c("18-29", "30-44", "45-60", "> 60"))) %>%
ggplot(aes(x=age, fill=dropout)) +
geom_bar(position="dodge") +
facet_grid(gender ~ .) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
By visualizing the data, you can more easily discern patterns and disparities among different demographic groups with respect to dropout rates.
Exploring Relationships: Dropout, Age and Gender
Another interesting avenue is to explore whether there’s a relationship between dropout behavior (or in this case leaving out a section) and demographic variables like age and gender:
test <- flying %>%
drop_detect("smoking_violation") %>%
bind_cols(flying, .) %>%
filter(!is.na(gender)) %>%
select(dropout, age, gender) %>%
mutate(dropout = as.numeric(ifelse(dropout == TRUE, 1, 0)))
glm_model <- glm(dropout ~ gender + age, data = test, family = binomial)
print(summary(glm_model))
This can be particularly useful for hypothesis testing and can aid in uncovering patterns in your data.