diff --git a/episodes/clean-data.Rmd b/episodes/clean-data.Rmd index d95c75f3..c78753bc 100644 --- a/episodes/clean-data.Rmd +++ b/episodes/clean-data.Rmd @@ -12,28 +12,27 @@ exercises: 10 ::::::::::::::::::::::::::::::::::::: objectives - Explain how to clean, curate, and standardize case data using `{cleanepi}` package -- Perform essential data-cleaning operations on a real case dataset. +- Perform essential data-cleaning operations on a real case dataset :::::::::::::::::::::::::::::::::::::::::::::::: ::::::::::::::::::::: prereq -In this episode, we will use a simulated Ebola dataset that can be: +In this episode, we will use a simulated Ebola dataset. To access it: -- Download the [simulated_ebola_2.csv](https://epiverse-trace.github.io/tutorials-early/data/simulated_ebola_2.csv) +- Download the [simulated_ebola_2.csv](https://epiverse-trace.github.io/tutorials-early/data/simulated_ebola_2.csv) file - Save it in the `data/` folder. Follow instructions in Setup to [configure an RStudio Project and folder](../learners/setup.md#setup-an-rstudio-project-and-folder) ::::::::::::::::::::: ## Introduction -In the process of analyzing outbreak data, it's essential to ensure that the dataset is clean, curated, standardized, and validated. This will ensure that analysis is accurate (i.e. you are analysing what you think you are analysing) and reproducible (i.e. if someone wants to go back and repeat your analysis steps with your code, you can be confident they will get the same results). - This episode focuses on cleaning epidemics and outbreaks data using the - [cleanepi](https://epiverse-trace.github.io/cleanepi/) package, - For demonstration purposes, we'll work with a simulated dataset of Ebola cases. -Let's start by loading the package `{rio}` to read data and the package `{cleanepi}` -to clean it. We'll use the pipe `%>%` to connect some of their functions, including others from -the package `{dplyr}`, so let's also call to the tidyverse package: +In the process of analyzing outbreak data, it's essential to ensure that the dataset is clean, curated, standardized, and validated. This will ensure that analysis is accurate (i.e., you are analyzing what you think you are analyzing) and reproducible (i.e., if someone wants to go back and repeat your analysis steps with your code, you can be confident they will get the same results). +This episode focuses on cleaning epidemic and outbreak data using the [cleanepi](https://epiverse-trace.github.io/cleanepi/) package. For demonstration purposes, we'll work with a simulated dataset of Ebola cases. + +Let's start by loading the package `{rio}` to read data and the package `{cleanepi}` +to clean it. We'll use the pipe `%>%` to connect some of their functions, including others from +the package `{dplyr}`, so let's also load the tidyverse package: ```{r,eval=TRUE,message=FALSE,warning=FALSE} # Load packages @@ -48,11 +47,11 @@ library(cleanepi) ### The double-colon (`::`) operator The `::` in R lets you access functions or objects from a specific package without attaching the entire package to the search path. It offers several important -advantages including the followings: +advantages, including the following: -* Telling explicitly which package a function comes from, reducing ambiguity and potential conflicts when several packages have functions with the same name. -* Allowing to call a function from a package without loading the whole package -with library(). +* Telling explicitly which package a function comes from, reducing ambiguity and potential conflicts when several packages have functions with the same name +* Allowing you to call a function from a package without loading the whole package +with `library()` For example, the command `dplyr::filter(data, condition)` means we are calling the `filter()` function from the `{dplyr}` package. @@ -60,13 +59,13 @@ the `filter()` function from the `{dplyr}` package. ::::::::::::::::::: -The first step is to import the dataset into working environment. This can be -done by following the guidelines outlined in the [Read case data](../episodes/read-cases.Rmd) episode. It involves loading the dataset into -`R` environment and view its structure and content. +The first step is to import the dataset into the working environment. This can be +done by following the guidelines outlined in the [Read case data](../episodes/read-cases.Rmd) episode. It involves loading the dataset into the +`R` environment and viewing its structure and content. ```{r,eval=FALSE,echo=TRUE,message=FALSE} # Read data -# e.g.: if path to file is data/simulated_ebola_2.csv then: +# e.g., if path to file is data/simulated_ebola_2.csv then: raw_ebola_data <- rio::import( here::here("data", "simulated_ebola_2.csv") ) %>% @@ -102,11 +101,11 @@ You can use the following terms to **diagnose characteristics**: - *Codification*, like the codification of values in columns like 'gender' and 'age' using numbers, letters, and words. Also the presence of multiple dates formats ("dd/mm/yyyy", "yyyy/mm/dd", etc) in the same column like in -'date_onset'. Less visible, but also the column names. +'date_onset'. Less visible, but also the column names - *Missing*, how to interpret an entry like "" in the 'status' column or "-99" in other circumstances? Do we have a data dictionary from the data collection process? -- *Inconsistencies*, like having a date of sample before the date of onset. -- *Non-plausible values*, like observations where some dates values are outside of the expected timeframe. +- *Inconsistencies*, like having a date of sample before the date of onset +- *Non-plausible values*, like observations where some dates values are outside of the expected timeframe - *Duplicates*, are all observations unique? You can use these terms to relate to **cleaning operations**: @@ -119,7 +118,7 @@ You can use these terms to relate to **cleaning operations**: :::::::::::::::::::::::::::::: -## A quick inspection +## A quick inspection Quick exploration and inspection of the dataset are crucial to identify potential data issues before diving into any analysis tasks. The `{cleanepi}` @@ -130,17 +129,17 @@ cleanepi::scan_data(raw_ebola_data) ``` -The results provide an overview of the content of all character columns, including column names, and the percent of some data types within them. +The results provide an overview of the content of all character columns, including column names, and the percentage of some data types within them. You can see that the column names in the dataset are descriptive but lack consistency. Some are composed of multiple words separated by white spaces. Additionally, some columns contain more than one data type, and there are missing values in the form of an empty string in others. ## Common operations -This section demonstrate how to perform some common data cleaning operations using the `{cleanepi}` package. +This section demonstrates how to perform some common data cleaning operations using the `{cleanepi}` package. ### Standardizing column names -For this example dataset, standardizing column names typically involves removing with spaces and connecting different words with “_”. This practice helps +For this example dataset, standardizing column names typically involves removing white spaces and connecting different words with “_”. This practice helps maintain consistency and readability in the dataset. However, the function used for standardizing column names offers more options. Type `?cleanepi::standardize_column_names` in the console for more details. ```{r} @@ -155,7 +154,7 @@ column names that are intended to be kept unchanged. - What differences can you observe in the column names? -- Standardize the column names of the input dataset, but keep the first column names as it is. +- Standardize the column names of the input dataset, but keep the first column name as it is ::::::::::::::::: hint @@ -168,7 +167,7 @@ You can try `cleanepi::standardize_column_names(data = raw_ebola_data, keep = "V ### Removing irregularities Raw data may contain fields that don't add any variability to the data such as **empty** rows and columns, or **constant** columns (where all entries have the same value). It can also contain **duplicated** rows. Functions from -`{cleanepi}` like `remove_duplicates()` and `remove_constants()` remove such irregularities as demonstrated in the below code chunk. +`{cleanepi}` like `remove_duplicates()` and `remove_constants()` remove such irregularities as demonstrated in the code chunk below. ```{r} # Remove constants @@ -187,9 +186,9 @@ columns. --> ::::::::::::::::::::: spoiler -#### How many rows you removed? What rows where removed? +#### How many rows did you remove? Which rows were removed? -You can get the number and location of the duplicated rows that where found. Run `cleanepi::print_report()`, wait for the report to open in your browser, and +You can get the number and location of the duplicated rows that were found. Run `cleanepi::print_report()`, wait for the report to open in your browser, and find the "Duplicates" tab. To use this information within R, you can print data frames with specific sections of the report in the console using the argument `what`. @@ -261,7 +260,7 @@ df <- df %>% cleanepi::remove_constants(cutoff = 0.5) ### Replacing missing values -In addition to the irregularities, raw data may contain missing values, and these may be encoded by different strings (e.g. `"NA"`, `""`, `character(0)`). To ensure robust analysis, it is a good practice to replace all missing values by `NA` in the entire dataset. Below is a code snippet demonstrating how you can achieve this in `{cleanepi}` for missing entries represented by an empty string `""`: +In addition to the irregularities, raw data may contain missing values, and these may be encoded by different strings (e.g., `"NA"`, `""`, `character(0)`). To ensure robust analysis, it is a good practice to replace all missing values by `NA` in the entire dataset. Below is a code snippet demonstrating how you can achieve this in `{cleanepi}` for missing entries represented by an empty string `""`: ```{r} sim_ebola_data <- cleanepi::replace_missing_values( @@ -274,12 +273,12 @@ sim_ebola_data ### Validating subject IDs -Each entry in the dataset represents a subject (e.g. a disease case or study participant) and should be distinguishable by a specific ID formatted in a -particular way. These IDs can contain numbers falling within a specific range, a prefix and/or suffix, and might be written such that they contain a specific number of characters. The `{cleanepi}` package offers the function `check_subject_ids()` designed precisely for this task as shown in the below code chunk. This function checks whether the IDs are unique and meet the required criteria specified by the user. +Each entry in the dataset represents a subject (e.g., a disease case or study participant) and should be distinguishable by a specific ID formatted in a +particular way. These IDs can contain numbers falling within a specific range, a prefix and/or suffix, and might be written such that they contain a specific number of characters. The `{cleanepi}` package offers the function `check_subject_ids()` designed precisely for this task as shown in the code chunk below. This function checks whether the IDs are unique and meet the required criteria specified by the user. ```{r} -# check if the subject IDs in the 'case_id' column contains numbers ranging +# check if the subject IDs in the 'case_id' column contain numbers ranging # from 0 to 15000 sim_ebola_data <- cleanepi::check_subject_ids( data = sim_ebola_data, @@ -300,7 +299,7 @@ tab to identify what IDs require an extra treatment. In the console, you can print: ```{r, eval=FALSE} -print_report(data = sim_ebola_data, "incorrect_subject_id") +print_report(data = sim_ebola_data, what = "incorrect_subject_id") ``` After finishing this tutorial, we invite you to explore the package reference guide of [cleanepi::check_subject_ids()](https://epiverse-trace.github.io/cleanepi/reference/check_subject_ids.html) to find the @@ -310,7 +309,7 @@ function that can fix this situation. ### Standardizing dates -An epidemic dataset typically contains date columns for different events, such as the date of infection, date of symptoms onset, etc. These dates can come in different date formats, and it is good practice to standardize them to benefit from the powerful R functionalities designed to handle date values in downstream analyses. The `{cleanepi}` package provides functionality for converting date columns of epidemic datasets into ISO8601 format, ensuring consistency across the different date columns. Here's how you can use it on our simulated dataset: +An epidemic dataset typically contains `Date` columns for different events, such as the date of infection, date of symptoms onset, etc. These dates can come in different date formats, and it is good practice to standardize them to benefit from the powerful R functionalities designed to handle date values in downstream analyses. The `{cleanepi}` package provides functionality for converting date columns of epidemic datasets into ISO8601 format, ensuring consistency across the different date columns. Here's how you can use it on our simulated dataset: ```{r} sim_ebola_data <- cleanepi::standardize_dates( @@ -327,7 +326,7 @@ This function converts the values in the target columns into the **YYYY-mm-dd** #### How is this possible? -We invite you to find the key package that makes this standardisation possible inside `{cleanepi}` by reading the “Details” section of the +We invite you to find the key package that makes this standardization possible inside `{cleanepi}` by reading the “Details” section of the [Standardize date variables reference manual](https://epiverse-trace.github.io/cleanepi/reference/standardize_dates.html#details)! Also, check how to use the `orders` argument if you want to target US format character strings. You can explore [this reproducible example](https://github.com/epiverse-trace/cleanepi/discussions/262). @@ -336,10 +335,9 @@ Also, check how to use the `orders` argument if you want to target US format cha ### Converting to numeric values -In the raw dataset, some columns can come with mixture of character and numerical values, and you will often want to convert -character values for numbers explicitly into numeric values (e.g. `"seven"` to `7`). For example, in our simulated data set, in the age column some entries are -written in words. In `{cleanepi}` the function `convert_to_numeric()` does such conversion as illustrated in the below -code chunk. +In the raw dataset, some columns can come with a mixture of character and numerical values, and you will often want to convert +character values for numbers explicitly into numeric values (e.g., `"seven"` to `7`). For example, in the age column of our simulated dataset, some entries are +written in words. In `{cleanepi}` the function `convert_to_numeric()` does such conversion as illustrated in the code chunk below. ```{r} sim_ebola_data <- cleanepi::convert_to_numeric( @@ -358,7 +356,7 @@ Thanks to the `{numberize}` package, we can convert numbers written in English, ::::::::::::::::::::::::: -## Epidemiology related operations +## Epidemiology-related operations In addition to common data cleansing tasks, such as those discussed in the above section, the `{cleanepi}` package offers additional functionalities tailored specifically for processing and analyzing outbreak and epidemic data. This section covers some of these specialized tasks. @@ -393,7 +391,7 @@ test_dict <- base::readRDS( test_dict ``` -Now, we can use this dictionary to standardize values of the the “gender” column according to predefined categories. Below is an example code chunk demonstrating how to perform this using the `clean_using_dictionary()` function from the {cleanepi} package. +Now, we can use this dictionary to standardize values of the “gender” column according to predefined categories. Below is an example code chunk demonstrating how to perform this using the `clean_using_dictionary()` function from the `{cleanepi}` package. ```{r} sim_ebola_data <- cleanepi::clean_using_dictionary( @@ -411,7 +409,7 @@ This approach simplifies the data cleaning process, ensuring that categorical va #### How to create your own data dictionary? -Note that, when a column in the dataset contains values that are not in the dictionary, the function `cleanepi::clean_using_dictionary()` will raise an error. +Note that when a column in the dataset contains values that are not in the dictionary, the function `cleanepi::clean_using_dictionary()` will raise an error. You can start a custom dictionary with a data frame inside or outside R and use the function `cleanepi::add_to_dictionary()` to include new elements in the dictionary. For example: ```{r} @@ -431,7 +429,7 @@ new_dictionary <- tibble::tibble( new_dictionary ``` -You can have more details in the section about "Dictionary-based data substituting" in the package +There are more details in the section about "Dictionary-based data substituting" in the package [vignette](https://epiverse-trace.github.io/cleanepi/articles/cleanepi.html#dictionary-based-data-substituting). :::::::::::::::::::::::::: @@ -439,11 +437,11 @@ You can have more details in the section about "Dictionary-based data substituti ### Calculating time span between different date events -In epidemiological data analysis, it is also useful to track and analyze time-dependent events, such as the progression of a disease outbreak (i.e., the time elapsed between today and the date the first case was reported) or the duration between date of sample collection and analysis (i.e., the time difference between today and the sample collection date). The most common example is to calculate the age of all the subjects given their dates of birth (i.e., the time difference between today and their date of birth). +In epidemiological data analysis, it is also useful to track and analyze time-dependent events, such as the time since the start of a disease outbreak (i.e., the time elapsed between today and the date the first case was reported) or the duration between date of sample collection and analysis (i.e., the time difference between today and the sample collection date). The most common example is to calculate the age of all the subjects given their dates of birth (i.e., the time difference between today and their date of birth). -The `{cleanepi}` package offers a convenient function for calculating the time elapsed between two dated events at different time scales. For example, the below code snippet utilizes the function `cleanepi::timespan()` to compute the +The `{cleanepi}` package offers a convenient function for calculating the time elapsed between two dated events. For example, the below code snippet utilizes the function `cleanepi::timespan()` to compute the time elapsed since the date of sampling of the identified cases until the $3^{rd}$ of January 2025 (`"2025-01-03"`). - + ```{r} sim_ebola_data <- cleanepi::timespan( data = sim_ebola_data, @@ -462,7 +460,7 @@ After executing the function `cleanepi::timespan()`, two new columns named `year ::::::::::::::::::::::::::::::::::::::::::::::: challenge -Age data is useful in many downstream analysis. You can categorize it to generate stratified estimates. +Age data is useful in many downstream analyses. You can categorize it to generate stratified estimates. Read the `test_df.RDS` data frame within the `{cleanepi}` package: @@ -487,7 +485,7 @@ Before calculating the age, you may need to: :::::::::::::::::::::::::: solution -In the solution we added `date_first_pcr_positive_test` as part of the Date columns to be standardised given that it often used in disease outbreak analysis. +In the solution we added `date_first_pcr_positive_test` as one of the date columns to be standardized, given that it is often used in disease outbreak analysis. ```{r} dat_clean <- dat %>% @@ -511,7 +509,7 @@ dat_clean <- dat %>% ) ``` -Now, How would you categorize a numerical variable? +Now, how would you categorize a numerical variable? :::::::::::::::::::::::::: @@ -540,8 +538,8 @@ dat_clean <- dat_clean %>% ) ``` -You can investigate the maximum values of variables from the summary made from the `skimr::skim()` function. Instead of `base::cut()` you can also use -`Hmisc::cut2(x = age_in_years, cuts = c(20,35,60))`, which gives the maximum value and do not require more arguments. +You can investigate the maximum values of variables from the summary made from the `skimr::skim()` function. Instead of `base::cut()` you can also use +`Hmisc::cut2(x = age_in_years, cuts = c(20, 35, 60))`, which automatically uses the maximum value as the upper bound and does not require additional arguments. :::::::::::::::::::::::::: @@ -559,8 +557,8 @@ cleaned_data <- cleanepi::clean_data(raw_ebola_data) ``` -Further more, you can combine multiple data cleaning tasks via the base R pipe (`%>%`) or the {magrittr} pipe (`%>%`) operator, as shown in the below code -snippet. +Furthermore, you can combine multiple data cleaning tasks via the base R pipe (`|>`) or the `{magrittr}` pipe (`%>%`) operator, as shown in the code +snippet below. ```{r,warning = FALSE, message = FALSE} # Perform the cleaning operations using the pipe (%>%) operator @@ -617,7 +615,7 @@ Notice that `{cleanepi}` contains a set of functions to **diagnose** the cleanin ## Cleaning report -The `{cleanepi}` package generates a comprehensive report detailing the findings and actions of all data cleansing operations conducted during the analysis. This report is presented as a HTML file that automatically opens in your browser with. Each section corresponds to a specific data cleansing operation, and clicking on each section allows you to access the results of that particular operation. This interactive approach enables users to efficiently review and analyze the effects of individual cleansing steps within the broader data cleansing process. +The `{cleanepi}` package generates a comprehensive report detailing the findings and actions of all data cleansing operations conducted during the analysis. This report is presented as a HTML file that automatically opens in your browser. Each section corresponds to a specific data cleansing operation, and clicking on each section allows you to access the results of that particular operation. This interactive approach enables users to efficiently review and analyze the effects of individual cleansing steps within the broader data cleansing process. You can view the report using: @@ -625,22 +623,20 @@ You can view the report using: cleanepi::print_report(data = cleaned_data) ``` -

+
Data cleaning report -
-

Example of data cleaning report generated by `{cleanepi}`

-
+ alt="Data cleaning report" + width="600"/> +
Example of data cleaning report generated by {cleanepi}
::::::::::::::::::::::::::::::::::::: keypoints -- Use `{cleanepi}` package to clean and standardize epidemiological-related data +- Use the `{cleanepi}` package to clean and standardize epidemiological-related data - Understand how to use `{cleanepi}` to perform common data cleansing tasks -- View the data cleaning report in a browser, consult it and make decisions. +- View the data cleaning report in a browser, consult it and make decisions ::::::::::::::::::::::::::::::::::::: diff --git a/episodes/describe-cases.Rmd b/episodes/describe-cases.Rmd index 44a7d299..51f3562a 100644 --- a/episodes/describe-cases.Rmd +++ b/episodes/describe-cases.Rmd @@ -6,9 +6,9 @@ exercises: 10 :::::::::::::::::::::::::::::::::::::: questions -- How to aggregate and summarise case data? +- How to aggregate and summarize case data? - How to visualize aggregated data? -- What is distribution of cases across time, space, gender, and age? +- What is the distribution of cases across time, space, gender, and age? :::::::::::::::::::::::::::::::::::::::::::::::: @@ -21,26 +21,25 @@ exercises: 10 ## Introduction -In an analytic pipeline, exploratory data analysis (EDA) is an important step before formal modelling. EDA helps determine relationships between variables and summarize their main characteristics, often by means of data visualization. +In an analytic pipeline, exploratory data analysis (EDA) is an important step before formal modeling. EDA helps determine relationships between variables and summarize their main characteristics, often by means of data visualization. This episode focuses on EDA of outbreak data using R packages. -A key aspect of EDA in epidemic analysis is 'person, place and time'. It is useful to identify how observed events - such as confirmed cases, hospitalizations, deaths, and recoveries - change over time, and how these vary across different locations and demographic factors, including gender, age, and more. +A key aspect of EDA in epidemic analysis is 'person, place, and time'. It is useful to identify how observed events - such as confirmed cases, hospitalizations, deaths, and recoveries - change over time, and how these vary across different locations and demographic factors, including gender, age, and more. -Let's start by loading the `{incidence2}` package to aggregate the linelist data according to specific characteristics, and visualize the resulting epidemic curves (epicurves) that plot the number of new events (i.e. case incidence over time). -We'll use the `{simulist}` package to simulate the outbreak data to analyse, and `{tracetheme}` for figure formatting. We'll use the pipe operator (`%>%`) to connect some of their functions, including others from the `{dplyr}` and `{ggplot2}` packages, so let's also call to the {tidyverse} package. +Let's start by loading the `{incidence2}` package to aggregate the linelist data according to specific characteristics, and visualize the resulting epidemic curves (epicurves) that plot the number of new events (i.e., case incidence over time). +We'll use the `{simulist}` package to simulate the outbreak data to analyze, and `{tracetheme}` for figure formatting. We'll use the pipe operator (`%>%`) to connect some of their functions, including others from the `{dplyr}` and `{ggplot2}` packages, so let's also load the `{tidyverse}` package. ```{r,eval=TRUE,message=FALSE,warning=FALSE} # Load packages -library(incidence2) # For aggregating and visualising +library(incidence2) # For aggregating and visualizing library(simulist) # For simulating linelist data library(tracetheme) # For formatting figures -library(tidyverse) # For {dplyr} and {ggplot2} functions and the pipe |> +library(tidyverse) # For {dplyr} and {ggplot2} functions and the pipe %>% ``` - ## Synthetic outbreak data -To illustrate the process of conducting EDA on outbreak data, we will generate a line list for a hypothetical disease outbreak utilizing the `{simulist}` package. `{simulist}` generates simulated data for outbreak according to a given configuration. Its minimal configuration can generate a linelist, as shown in the below code chunk: +To illustrate the process of conducting EDA on outbreak data, we will generate a line list for a hypothetical disease outbreak utilizing the `{simulist}` package. `{simulist}` generates simulated data for an outbreak according to a given configuration. Its minimal configuration can generate a linelist, as shown in the code chunk below: ```{r} # Simulate linelist data for an outbreak with size between 1000 and 1500 @@ -52,15 +51,15 @@ sim_data <- simulist::sim_linelist(outbreak_size = c(1000, 1500)) %>% sim_data ``` -This linelist dataset has simulated entries on individual-level events during an outbreak. +This linelist dataset contains simulated individual-level records of events during an outbreak ::::::::::::::::::: spoiler -## Additional Resources on Outbreak Data +### Additional resources on outbreak data -The above is the default configuration of `{simulist}`. It includes a number of assumptions about the transmissibility and severity of the pathogen. If you want to know more about the `simulist::sim_linelist()` function and other functionalities check the [documentation website](https://epiverse-trace.github.io/simulist/). +The above is the default configuration of `{simulist}`. It includes a number of assumptions about the transmissibility and severity of the pathogen. If you want to know more about the `simulist::sim_linelist()` function and other functionalities, check the [documentation website](https://epiverse-trace.github.io/simulist/). -You can also find data sets from past real outbreaks within the [`{outbreaks}`](https://www.reconverse.org/outbreaks/) R package. +You can also find datasets from past real outbreaks within the [outbreaks](https://www.reconverse.org/outbreaks/) R package. ::::::::::::::::::: @@ -68,7 +67,7 @@ You can also find data sets from past real outbreaks within the [`{outbreaks}`]( ## Aggregating the data -Often we want to analyse and visualise the number of events that occur on a particular day or week, rather than focusing on individual cases. This requires grouping the linelist data into incidence data. The [{incidence2}]((https://www.reconverse.org/incidence2/articles/incidence2.html){.external target="_blank"}) package offers a useful function called `incidence2::incidence()` for grouping case data, usually based around dated events and/or other characteristics. The code chunk provided below demonstrates the creation of an `` class object from the simulated Ebola `linelist` data based on the date of onset. +Often we want to analyze and visualize the number of events that occur on a particular day or week, rather than focusing on individual cases. This requires grouping the linelist data into incidence data. The [incidence2](https://www.reconverse.org/incidence2/articles/incidence2.html){.external target="_blank"} package offers a useful function called `incidence2::incidence()` for aggregating case data around dated events. It can also group aggregated data on other characteristics (e.g., gender). The code chunk provided below demonstrates the creation of an `` object from the simulated `linelist` data based on the date of onset. ```{r} # Create an incidence object by aggregating case data based on the date of onset @@ -82,7 +81,7 @@ daily_incidence <- incidence2::incidence( daily_incidence ``` -With the `{incidence2}` package, you can specify the desired interval (e.g. day, week) and categorize cases by one or more factors. Below is a code snippet demonstrating weekly cases grouped by the date of onset, sex, and type of case. +With the `{incidence2}` package, you can specify the desired time interval (e.g., day, week, etc.) and categorize cases by one or more factors. Below is a code snippet demonstrating weekly cases grouped by the date of onset, sex, and type of case. ```{r} # Group incidence data by week, accounting for sex and case type @@ -99,7 +98,7 @@ weekly_incidence ::::::::::::::::::::::::::::::::::::: callout -## Dates Completion +### Dates completion When cases are grouped by different factors, it's possible that the events involving these groups may have different date ranges in the resulting `incidence2` object. The `{incidence2}` package provides a function called `incidence2::complete_dates()` to ensure that an incidence object has the same range of dates for each group. By default, missing counts for a particular group will be filled with 0 for that date. @@ -132,23 +131,25 @@ daily_incidence_2_complete <- incidence2::complete_dates( ::::::::::::::::::::::::::::::::::::: challenge -## Challenge 1: Can you do it? +### Challenge 1: Can you do it? - - **Task**: Calculate the __biweekly__ incidence of cases from the `sim_data` linelist based on their admission date and outcome. Save the result in an object called `biweekly_incidence`. +- **Task**: Calculate the **biweekly** incidence of cases from the `sim_data` linelist based on their admission date and outcome. Save the result in an object called `biweekly_incidence` :::::::::::::::::::::::::::::::::::::::::::::::: ## Visualization -The `incidence2` objects can be visualized using the `plot()` function from the base R package. -The resulting graph is referred to as an epidemic curve, or epi-curve for short. The following code snippets generate epi-curves for the `daily_incidence` and `weekly_incidence` incidence objects mentioned above. +The `incidence2` objects can be visualized using the `plot()` function from base R. +The resulting graph is referred to as an epidemic curve, or epicurve for short. The following code snippets generate epicurves for the `daily_incidence` and `weekly_incidence` incidence objects mentioned above. + +Plotting an `` object relies on the `{ggplot2}` package, so [`ggplot` layers](https://ggplot2-book.org/layers.html) can be added to the plot as shown below. ```{r} # Plot daily incidence data base::plot(daily_incidence) + ggplot2::labs( x = "Time (in days)", # x-axis label - y = "Dialy cases" # y-axis label + y = "Daily cases" # y-axis label ) + theme_bw() ``` @@ -159,16 +160,16 @@ base::plot(daily_incidence) + base::plot(weekly_incidence) + ggplot2::labs( x = "Time (in weeks)", # x-axis label - y = "weekly cases" # y-axis label + y = "Weekly cases" # y-axis label ) + theme_bw() ``` :::::::::::::::::::::::: callout -#### Easy aesthetics +### Easy aesthetics -We invite you to take a look at the `{incidence2}` [package vignette](https://www.reconverse.org/incidence2/articles/incidence2.html). Find how you can use the arguments within the `plot()` function to provide aesthetics to your incidence2 class objects. +We invite you to take a look at the [incidence2 package vignette](https://www.reconverse.org/incidence2/articles/incidence2.html) to find out how you can use the arguments within the `plot()` function to provide aesthetics to your `incidence2` object. ```{r} base::plot(weekly_incidence, fill = "sex") @@ -180,15 +181,15 @@ Some of them include `show_cases = TRUE`, `angle = 45`, and `n_breaks = 5`. Try ::::::::::::::::::::::::::::::::::::: challenge -## Challenge 2: Can you do it? +### Challenge 2: Can you do it? - - **Task**: Visualize the `biweekly_incidence` object. + - **Task**: Visualize the `biweekly_incidence` object :::::::::::::::::::::::::::::::::::::::::::::::: ## Curve of cumulative cases -The cumulative number of cases can be calculated using the `incidence2::cumulate()` function on an `incidence2` object and visualized it, as in the example below. +The cumulative number of cases can be calculated using the `incidence2::cumulate()` function on an `incidence2` object and visualized, as in the example below. ```{r} # Calculate cumulative incidence @@ -198,7 +199,7 @@ cum_df <- incidence2::cumulate(daily_incidence) base::plot(cum_df) + ggplot2::labs( x = "Time (in days)", # x-axis label - y = "weekly cases" # y-axis label + y = "Cumulative cases" # y-axis label ) + theme_bw() ``` @@ -208,14 +209,14 @@ Note that this function preserves grouping, i.e., if the `incidence2` object con ::::::::::::::::::::::::::::::::::::: challenge -## Challenge 3: Can you do it? - - **Task**: Visulaize the cumulative cases from the `biweekly_incidence` object. +### Challenge 3: Can you do it? + - **Task**: Visualize the cumulative cases from the `biweekly_incidence` object :::::::::::::::::::::::::::::::::::::::::::::::: -## Peak time estimation +## Peak time estimation -You can estimate the peak -- the time with the highest number of recorded cases -- using the `incidence2::estimate_peak()` function from the {incidence2} package. This function uses a bootstrapping method to determine the peak time (i.e. by resampling dates with replacement, resulting in a distribution of estimated peak times). +You can estimate the peak (i.e., the time with the highest number of recorded cases) using the `incidence2::estimate_peak()` function from the `{incidence2}` package. This function uses a bootstrapping method to determine the peak time (i.e., by resampling dates with replacement, resulting in a distribution of estimated peak times). ```{r} # Estimate the peak of the daily incidence data @@ -231,12 +232,12 @@ peak <- incidence2::estimate_peak( print(peak) ``` -This example demonstrates how to estimate the peak time using the `incidence2::estimate_peak()` function at $95%$ confidence interval and using 100 bootstrap samples. +This example demonstrates how to estimate the peak time using the `incidence2::estimate_peak()` function, specifying a 95% confidence interval and using 100 bootstrap samples. ::::::::::::::::::::::::::::::::::::: challenge -## Challenge 4: Can you do it? - - **Task**: Estimate the peak time from the `biweekly_incidence` object. +### Challenge 4: Can you do it? + - **Task**: Estimate the peak time from the `biweekly_incidence` object :::::::::::::::::::::::::::::::::::::::::::::::: @@ -251,8 +252,8 @@ The example below demonstrates how to configure these three elements for a simpl ```{r} # Define date breaks for the x-axis breaks <- seq.Date( - from = min(as.Date(daily_incidence$date_index, na.rm = TRUE)), - to = max(as.Date(daily_incidence$date_index, na.rm = TRUE)), + from = min(as.Date(daily_incidence$date_index), na.rm = TRUE), + to = max(as.Date(daily_incidence$date_index), na.rm = TRUE), by = 20 # every 20 days ) @@ -291,7 +292,7 @@ ggplot2::ggplot(data = daily_incidence) + ) ``` -Use the `group` option in the mapping function to visualize an epicurve with different groups. If there is more than one grouping factor, use the `facet_wrap()` option, as demonstrated in the example below: +Use the `group` option in the mapping function to visualize an epicurve with different groups. If there is more than one grouping factor, use the `ggplot2::facet_wrap()` function, as demonstrated in the example below: ```{r} # Plot daily incidence faceted by sex @@ -334,16 +335,16 @@ ggplot2::ggplot(data = daily_incidence_2) + ::::::::::::::::::::::::::::::::::::: challenge -## Challenge 5: Can you do it? +### Challenge 5: Can you do it? - - **Task**: Produce an annotated figure for the `biweekly_incidence` object using the `{ggplot2}` package. + - **Task**: Produce an annotated figure for the `biweekly_incidence` object using the `{ggplot2}` package :::::::::::::::::::::::::::::::::::::::::::::::: ::::::::::::::::::::::::::::::::::::: keypoints -- Use `{simulist}` package to generate synthetic outbreak data -- Use `{incidence2}` package to aggregate case data based on a date event, and other variables to produce epidemic curves. -- Use `{ggplot2}` package to produce better annotated epicurves. +- Use the `{simulist}` package to generate synthetic outbreak data +- Use the `{incidence2}` package to aggregate case data based on date events, and other variables to produce epidemic curves +- Use the `{ggplot2}` package to produce better annotated epicurves :::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/episodes/validate.Rmd b/episodes/validate.Rmd index 3aaa2a2b..cc33e795 100644 --- a/episodes/validate.Rmd +++ b/episodes/validate.Rmd @@ -13,7 +13,7 @@ exercises: 2 ::::::::::::::::::::::::::::::::::::: objectives -- Demonstrate how to covert case data into `linelist` data +- Demonstrate how to convert case data into `linelist` data - Demonstrate how to tag and validate data to make analysis more reliable @@ -24,21 +24,21 @@ exercises: 2 This episode requires you to: - Download the [cleaned_data.csv](https://epiverse-trace.github.io/tutorials-early/data/cleaned_data.csv) file -- and save it in the `data/` folder. +- Save it in the `data/` folder ::::::::::::::::::::: ## Introduction -In outbreak analysis, once you have completed the initial steps of reading and cleaning the case data, it's essential to establish an additional fundamental layer to ensure the integrity and reliability of subsequent analyses. Otherwise you might encounter issues during the analysis process due to creation or removal of specific variables, changes in their underlying data types (like `` or ``), etc. Specifically, this additional step involves: +In outbreak analysis, once you have completed the initial steps of reading and cleaning the case data, it's essential to establish an additional fundamental layer to ensure the integrity and reliability of subsequent analyses. Otherwise you might encounter issues during the analysis process due to creation or removal of specific variables, changes in their underlying data types (like `` or ``), etc. Specifically, this additional step involves: 1. Verifying the presence and correct data type of certain columns within -your dataset, a process commonly referred to as **tagging**; +your dataset, a process commonly referred to as **tagging**. 2. Implementing measures to make sure that these tagged columns are not inadvertently deleted during further data processing steps, known as **validation**. This episode focuses on tagging and validating outbreak data using the [linelist](https://epiverse-trace.github.io/linelist/) package. Let's start by loading the package `{rio}` to read data and the `{linelist}` package -to create a linelist object. We'll use the pipe operator (`%>%`) to connect some of their functions, including others from the package `{dplyr}`. For this reason, we will also load the {tidyverse} package. +to create a `linelist` object. We'll use the pipe operator (`%>%`) to connect some of their functions, including others from the package `{dplyr}`. For this reason, we will also load the `{tidyverse}` package. ```{r,eval=TRUE,message=FALSE,warning=FALSE} @@ -54,23 +54,23 @@ library(linelist) # for tagging and validating ### The double-colon (`::`) operator -The`::`in R lets you access functions or objects from a specific package without attaching the entire package to the search path. It offers several important -advantages including the followings: +The `::` in R lets you access functions or objects from a specific package without attaching the entire package to the search path. It offers several important +advantages, including the following: -* Telling explicitly which package a function comes from, reducing ambiguity and potential conflicts when several packages have functions with the same name. -* Allowing to call a function from a package without loading the whole package -with library(). +* Telling explicitly which package a function comes from, reducing ambiguity and potential conflicts when several packages have functions with the same name +* Allowing you to call a function from a package without loading the whole package +with `library()` For example, the command `dplyr::filter(data, condition)` means we are calling the `filter()` function from the `{dplyr}` package. ::::::::::::::::::: -Import the dataset following the guidelines outlined in the [Read case data](../episodes/read-cases.Rmd) episode. This involves loading the dataset into the working environment and view its structure and content. +Import the dataset following the guidelines outlined in the [Read case data](../episodes/read-cases.Rmd) episode. This involves loading the dataset into the working environment and viewing its structure and content. ```{r, eval=FALSE} # Read data -# e.g.: if path to file is data/simulated_ebola_2.csv then: +# e.g., if path to file is data/cleaned_data.csv then: cleaned_data <- rio::import( here::here("data", "cleaned_data.csv") ) %>% @@ -92,9 +92,9 @@ cleaned_data -### An unexpected change +### Example scenario: an unexpected change -You are in an emergency response situation. You need to generate daily situation reports. You automated your analysis to read data directly from the online server :grin:. However, the people in charge of the data collection/administration needed to **remove/rename/reformat** one variable you found helpful :disappointed:! +You are in an emergency response situation. You need to generate daily situation reports. You automated your analysis to read data directly from the online server. However, the people in charge of the data collection/administration needed to **remove/rename/reformat** one variable you found helpful! How can you detect if the input data is **still valid** to replicate the analysis code you wrote the day before? @@ -104,18 +104,20 @@ How can you detect if the input data is **still valid** to replicate the analysi If learners do not have an experience to share, we as instructors can share one. -A scenario like this usually happens when the institution doing the analysis is not the same as the institution collecting the data. The later can make decisions about the data structure that can affect downstream processes, impacting the time or the accuracy of the analysis results. +A scenario like this usually happens when the institution doing the analysis is not the same as the institution collecting the data. The latter can make decisions about the data structure that can affect downstream processes, impacting the time or the accuracy of the analysis results. :::::::::::::::::::::::: ## Creating a linelist and tagging columns -Once the data is loaded and cleaned, we can convert the cleaned case data into a `linelist` object using `{linelist}` package, as in the below code chunk. +Before diving in, it helps to distinguish the two steps: **tagging** attaches a semantic role (such as *case ID* or *date of onset*) to a column in your dataset, while **validation** checks that the tagged columns still exist and have the expected data types. Tagging is done once when you build the `linelist` object; validation is something you can run repeatedly as the underlying data evolves. + +Once the data is loaded and cleaned, we can convert the cleaned case data into a `linelist` object using the `{linelist}` package, as in the code chunk below. ```{r} # Create a linelist object from cleaned data linelist_data <- linelist::make_linelist( - x = cleaned_data, # Input data + x = cleaned_data, # Input data id = "case_id", # Column for unique case identifiers date_onset = "date_onset", # Column for date of symptom onset gender = "gender" # Column for gender @@ -126,21 +128,22 @@ linelist_data ``` The `{linelist}` package supplies tags for common epidemiological variables -and a set of appropriate data types for each. You can view the list of available tags by the variable name and their acceptable data types using the `linelist::tags_types()` function. +and a set of appropriate data types for each. You can view the list of available tag names and their acceptable data types using the `linelist::tags_types()` function. ::::::::::::::::::::::::::::::::::::: challenge Let's **tag** more variables. In some datasets, it is possible to encounter variable names that are different from the available tag names. In such cases, we can associate them based on how variables were defined for data collection. Now: --**Explore** the available tag names in `{linelist}`. --**Find** what other variables in the input dataset can be associated with any of these available tags. --**Tag** those variables as shown above using the `linelist::make_linelist()` -function. + +- **Explore** the available tag names in `{linelist}` +- **Find** what other variables in the input dataset can be associated with any of these available tags +- **Tag** those variables as shown above using the `linelist::make_linelist()` +function :::::::::::::::::::: hint -Your can get access to the list of available tag names in `{linelist}` using: +You can get access to the list of available tag names in `{linelist}` using: ```{r, eval = FALSE} # Get a list of available tags names and data types linelist::tags_types() @@ -167,7 +170,7 @@ linelist::make_linelist( Are these additional tags visible in the output? -< !--Do you want to see a display of available and tagged variables? You can explore the function `linelist::tags()` and read its [reference documentation](https://epiverse-trace.github.io/linelist/reference/tags.html).- -> + ::::::::::::::::::::: @@ -176,6 +179,8 @@ Are these additional tags visible in the output? ## Validation +Recall the scenario above, where an upstream change to the data (a removed, renamed, or reformatted variable) could quietly break your analysis. Validation is the check that catches this: running `linelist::validate_linelist()` confirms that every tagged column is still present and still has the expected data type. In an ongoing analysis, you can re-run it each time fresh data arrives, so that any breaking change is flagged immediately rather than propagating downstream. + To ensure that all tagged variables are standardized and have the correct data types, use the `linelist::validate_linelist()` function, as shown in the example below: @@ -184,23 +189,21 @@ linelist::validate_linelist(linelist_data) ``` If your dataset requires a new tag other than those defined in the `{linelist}` -package, use `allow_extra = TRUE` when creating the linelist object with its -corresponding datatype using the `linelist::make_linelist()` function. +package, use `allow_extra = TRUE` when creating the `linelist` object with its +corresponding data type using the `linelist::make_linelist()` function. ::::::::::::::::::::::::: challenge Let's assume the following scenario during an ongoing outbreak. You notice at some point that the data stream you have been relying on has a set of new entries (i.e., rows or observations), and the data type of one variable has changed. -Let's consider the example where the type `age` variable has changed from a double (``) to character (``). +Let's consider the example where the type of the `age` variable has changed from a double (``) to character (``). To simulate this situation: -- **Change** the data type of the variable, - -- **Tag** the variable into a linelist, and then - -- **Validate** it. +- **Change** the data type of the variable +- **Tag** the variable into a `linelist` +- **Validate** the `linelist` Describe how `linelist::validate_linelist()` reacts when there is a change in the data type of one variable of the input data. @@ -240,15 +243,12 @@ Why are we getting an `Error` message? Should we have a `Warning` message instead? Explain why. -Explore other situations to understand this behavior by converting:-`date_onset` from `` to character (``), -`gender` character (``) to integer (``). - -Then tag them into a linelist for validation. Does the `Error` message suggest a fix to the issue? +Explore other situations to understand this behavior by converting: -Why are we getting an `Error` message? -Should we have a `Warning` message instead? Explain why? -Explore other situations to understand this behavior by converting:-`date_onset` from `` to character (``), -`gender` character (``) to integer (``). +- `date_onset` from `` to `` +- `gender` from `` to `` -Then tag them into a linelist for validation. Does the `Error` message suggest a fix to the issue? +Then tag them into a `linelist` for validation. Does the `Error` message suggest a fix to the issue? ::::::::::::::::::::::::: solution @@ -278,23 +278,23 @@ cleaned_data %>% linelist::validate_linelist() ``` -We get `Error` messages because the default type of these variable in `linelist::tags_types()` is different from the one we set them at. +We get `Error` messages because the default type of these variables in `linelist::tags_types()` is different from the one we have assigned. -The `Error` message inform us that in order to **validate** our linelist, we must fix the input variable type to fit the expected tag type. In a data analysis script, we can do this by adding one cleaning step into the pipeline. +The `Error` message informs us that in order to **validate** our linelist, we must fix the input variable type to fit the expected tag type. In a data analysis script, we can do this by adding one cleaning step into the pipeline. ::::::::::::::::::::::::: ::::::::::::::::::::::::::::: challenge -Beyond tagging and validating the linelist object, what extra step do we needed when building the object? +Beyond tagging and validating the linelist object, what extra step do we need when building the object? :::::::::::::::::::::::::: solution -Let's simulate a scenario about losing a variable : +Let's simulate a scenario about losing a variable: ```{r} cleaned_data %>% # remove the variable 'age' - select(-age) %>% + dplyr::select(-age) %>% # tag variable 'age' that no longer exist linelist::make_linelist( age = "age" @@ -310,14 +310,14 @@ cleaned_data %>% ## Safeguarding -Safeguarding is implicitly built into the linelist objects. If you try to drop any of the tagged columns, you will receive an error or warning message, as shown in the example below. +Safeguarding is implicitly built into the `linelist` objects. If you try to drop any of the tagged columns, you will receive an error or warning message, as shown in the example below. ```{r, warning=TRUE} new_df <- linelist_data %>% dplyr::select(case_id, gender) ``` -This `Warning` message above is the default output option when we lose tags in a `linelist` object. However, it can be changed to an `Error` message using the `linelist::lost_tags_action()` function. +The `Warning` message above is the default output option when we lose tags in a `linelist` object. However, it can be changed to an `Error` message using the `linelist::lost_tags_action()` function. ::::::::::::::::::::::::::::::::::::: challenge @@ -338,7 +338,7 @@ linelist_data %>% linelist::lost_tags_action(action = "error") ``` -- Now, re - run the above code chunk with `dplyr::count()`. +- Now, re-run the above code chunk with `dplyr::count()`. Identify: @@ -348,11 +348,11 @@ Identify: :::::::::::::::::::::::: solution -Deciding between `Warning` or `Error` message will depend on the level of attention or flexibility you need when losing tags. One will alert you about a change but will continue running the code downstream. The other will stop your analysis pipeline and the rest will not be executed. +Deciding between `Warning` or `Error` message will depend on the level of attention or flexibility you need when losing tags. A `Warning` will alert you about a change but will continue running the code downstream. An `Error` will stop your analysis pipeline and the rest will not be executed. A data reading, cleaning and validation script may require a more stable or fixed pipeline. An exploratory data analysis may require a more flexible approach. These two processes can be isolated in different scripts or repositories to adjust the safeguarding according to your needs. -Before you continue, set the configuration back again to the default option of `Warning`: +Before you continue, set the configuration back to the default option of `Warning`: ```{r} # set behavior to the default option: "warning" linelist::lost_tags_action() @@ -363,31 +363,29 @@ linelist::lost_tags_action() ::::::::::::::::::::::::::::::::::::: A `linelist` object resembles a data frame but offers richer features -and functionalities. Packages that are linelist - aware can leverage these +and functionalities. Packages that are `linelist`-aware can leverage these features. For example, you can extract a data frame of only the tagged columns using the `linelist::tags_df()` function, as shown below: ```{r, warning = FALSE} linelist::tags_df(linelist_data) ``` -This allows for the use of tagged variables only in downstream analysis, which will be useful for the next episode! +This allows for the use of tagged variables only in downstream analysis, which will be useful for the next episode (Aggregate and visualize)! :::::::::::::::::::::::::::::::::::: checklist ### When should I use `{linelist}`? -Data analysis during an outbreak response or mass - gathering surveillance demands a different set of "data safeguards" if compared to usual research situations. For example, your data will change or be updated over time (e.g. new entries, new variables, renamed variables). +Data analysis during an outbreak response or mass-gathering surveillance demands a different set of _data safeguards_ if compared to usual research situations. For example, your data will change or be updated over time (e.g., new entries, new variables, renamed variables). -`{linelist}` is more appropriate for this type of ongoing or long - lasting analysis. Check the "Get started" vignette section about +`{linelist}` is more appropriate for this type of ongoing or long-lasting analysis. Check the "Get started" vignette section about [When I should consider using `{linelist}`? ](https://epiverse-trace.github.io/linelist/articles/linelist.html#should-i-use-linelist) for more information. :::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::::::: keypoints -- Use the `{linelist}` package to tag, -validate, -and prepare case data for downstream analysis. +- Use the `{linelist}` package to tag, validate, and prepare case data for downstream analysis ::::::::::::::::::::::::::::::::::::::::::::::::