An R function to removing qPCR outliers within technical replicates

The problem

When performing qPCR you typically have three technical replicates of every sample. The Cq values within replicates are expected to be very close to each other. However, sometimes outliers occur thru for example, inaccurate pipetting, accidentally pipetting in the wrong well or seal detachment. Typically a Cq value more than 0.5 apart from the other two is considered an outlier. Some people prefer an ever stricter cutoff. On the other hand, if the target gene has very low expression - and thus high Cq values - high variation is unavoidable and a more lenient threshold might be preferred. In this post we will make a function that finds and removes outliers based on the deviation from the median value, given a certain threshold value. The function will use the following rules:

  • If only one Cq value is present (i.e. the other replicates failed to produce a Cq value), it will be removed.
  • If only two Cq values are present, they need to be less than a threshold apart.
  • For three or more technical replicates:
    • If the absolute distance between a Cq value and the median Cq is greater than a set threshold, than this value will be removed.
    • If all Cq values within a technical replicate are more than a threshold apart, they will all be removed.

For manipulating the data we will mostly use functions from the tidyverse.

library(tidyverse)

In this example we will use some randomly generated data. It contain Cq values for three genes: gene_a, gene_b and gene_hk (hk = housekeeping). There are three treatment conditions: a, b and ctrl, where ctrl is the untreated control. Three biological replicates are included. The data contains outliers representing all conditions previously specified, which we will try to remove.

The data is in a tidy format, meaning that every variable has its own column, and every measurement has its own row.

Table 1: First 5 rows of the raw data
treatmentbio_reptech_repprimer_paircq_values
ctrl11gene_aNA
ctrl12gene_a35.0
ctrl13gene_aNA
ctrl21gene_a34.9
ctrl22gene_a34.3

First steps

For every sample we will calculate:

  • The median Cq value.
  • The number of replicates that have a value at all. The purpose of this step is to enable removing those samples where only one or no wells resulted in a Cq value.
median_cq <- raw_data %>%
  # first group the data to perform the calculation for every sample separately. 
  group_by(treatment, primer_pair, bio_rep) %>%
  # then calculate the median and count
  summarise(median_cq = median(cq_values, na.rm = TRUE), 
            count = sum(!is.na(cq_values))) 
Table 2: First 5 rows of median and count values
treatmentprimer_pairbio_repmedian_cqcount
agene_a122.33
agene_a221.73
agene_a321.53
agene_b122.13
agene_b222.33

Next we will join the table back to the original data, so that the median and count will become their own column. We can then simply remove those samples with only one or zero technical replicates, and calculate the distance from the median for every remaining technical replicate. Finally, a column is created that contains TRUE if the distance from the median is within a threshold value or FALSE if it is not.

Note: if there are only two technical replicates within a sample, the median falls exactly between these two values. We want the two replicates to be less than the threshold apart. To get the distance between the replicates we multiply the distance from the median by 2.

#As an example we will use a threshold value of 1. 
#This is probably too high for most practical applications. 

threshold <- 1

dev_test <- raw_data %>%
    full_join(median_cq, by = c("treatment", "primer_pair", "bio_rep")) %>%
    # take out samples with only 1 or 0 tech reps
    filter(count > 1) %>% 
    # calculate the distance from the median
    mutate(distance_med = abs(cq_values - median_cq)) %>%  
    # check if the distance from the median is within the threshold. 
    mutate(keep = if_else(count == 2,
                          if_else(distance_med*2 < threshold, TRUE, FALSE),
                          if_else(distance_med   < threshold, TRUE, FALSE)))
Table 3: Median and count are now a column in the original table. (scroll for the keep column)
treatmentbio_reptech_repprimer_paircq_valuesmedian_cqcountdistance_medkeep
ctrl21gene_a34.934.530.4TRUE
ctrl22gene_a34.334.530.2TRUE
ctrl23gene_a34.534.530.0TRUE
ctrl31gene_a34.334.330.0TRUE
ctrl32gene_a34.734.330.4TRUE

There is one condition left we need to be able to remove, namely where all three technical replicates are more than the threshold apart from each other. Using the table we have generated so far the middle replicate will be kept, because it is the median and therefore the distance from the median is 0. To get rid of this we will only keep samples where at least two replicates are marked as TRUE in the keep column.

# count the number of TRUEs 
count_true <- dev_test %>%
    group_by(treatment, primer_pair, bio_rep) %>%
    summarise(count_keep = sum(keep, na.rm = TRUE)) %>%
    ungroup()

# finally, filter out all outliers
clean_data <- dev_test %>%
    full_join(count_true, by = c("treatment", "primer_pair", "bio_rep")) %>%
    # remove outliers
    filter(keep == TRUE,         
           count_keep > 1) %>%   
    # remove the now unnecessary columns
    select(-(median_cq:count_keep)) 
Table 4: First 5 rows of the cleaned data
treatmentbio_reptech_repprimer_paircq_values
ctrl21gene_a34.9
ctrl22gene_a34.3
ctrl23gene_a34.5
ctrl31gene_a34.3
ctrl32gene_a34.7

To find all Cq values that have been removed:

outliers <- raw_data %>%
  setdiff(clean_data)
Table 5: All removed Cq values
treatmentbio_reptech_repprimer_paircq_values
ctrl11gene_aNA
ctrl12gene_a35.0
ctrl13gene_aNA
a12gene_a25.0
b13gene_a20.0
ctrl12gene_bNA
a21gene_b18.0
a22gene_b22.3
a23gene_b27.0
b11gene_b26.0
b12gene_b24.0
b13gene_bNA

Turning it into a function

Throughout this post we have almost exclusively use functions from the tidyverse to manipulate the data. Using tidyverse function inside custom functions is possible, but we do need the rlang package.

library(rlang)

Our function will have the following arguments:

  • The first argument is name of the data we want to use.
  • cq = tells the function the name of the column containing the Cq values.
  • With threshold = we can set the maximum distance from the median.
  • Finally, the function needs the names of all other columns that are not the cq_values or denote technical replicates. These will be used to make groups. Calculations will be made for each unique combination.

Which and how many experimental variables there are depends on the experimental setup. Therefore, the final argument of the function is .... Now we can supply any number of column names (unquoted and separated by a comma). Inside the qpcr_clean() function, ... can be passed to group_by(). There is just one problem. The tidyverse function full_join() needs the same set of column names to join the tables by, but as a character vector. Luckely ... can be converted to such a vector with:

dots <- names(enquos(..., .named = TRUE))

Finally the name of the column containing the Cq values can be passed to the tidyverse functions by surrounding cq with double braces: {{cq}}.

qpcr_clean <- function(.data, cq, threshold, ...){
  # to do: add if statements to check input data etc.
  
  # convert ... to character vector
  dots <- names(enquos(..., .named = TRUE))
  
  # calculate the median and count for every sample
  median_cq <- .data %>%
    group_by(...) %>%
    summarise(median_cq = median({{cq}}, na.rm = TRUE), 
            count = sum(!is.na({{cq}}))) %>% 
    ungroup()
  
  # remove solo tech reps, test if distance from the median is < threshold
  dev_test <- .data %>%
    full_join(median_cq, by = dots) %>%
    filter(count > 1) %>% 
    mutate(distance_med = abs({{cq}} - median_cq)) %>%  
    mutate(keep = if_else(count == 2,
                          if_else(distance_med*2 < threshold, TRUE, FALSE),
                          if_else(distance_med   < threshold, TRUE, FALSE)))
  
  # count the # of TRUEs per sample
  count_true <- dev_test %>%
    group_by(...) %>%
    summarise(count_keep = sum(keep, na.rm = TRUE)) %>%
    ungroup()
  
  # remove all outliers
  clean_data <- dev_test %>%
    full_join(count_true, by = dots) %>%
    filter(count_keep > 1, 
           keep == TRUE) %>% 
    select(-(median_cq:count_keep)) 
  clean_data
}

Lets test if it works:

clean_data <- qpcr_clean(raw_data, 
                         cq = cq_values, 
                         threshold = 1, 
                         treatment, primer_pair, bio_rep)
Table 6: It worked!
treatmentbio_reptech_repprimer_paircq_values
ctrl21gene_a34.9
ctrl22gene_a34.3
ctrl23gene_a34.5
ctrl31gene_a34.3
ctrl32gene_a34.7

Outlier context

For context, you might want to see the other Cq values within a technical replicate that contains an outlier. For this we can make a separate function called qpcr_outlier_context().

The function arguments will be:

  • With raw_data = we supply the raw unfiltered data
  • With clean_data = we supply the cleaned data from the qpcr_clean() function
  • cq_values = is used to give name of the column containing the cq values
  • tech_rep = gives the name of the column containing the technical replicate information
  • ... all other column names, excluding the columns containing the Cq values or technical replicates.
qpcr_outlier_context <- function(raw_data, clean_data, cq_values, tech_rep, ...){
  # turn some values into strings for joining tables
  dots <- names(enquos(..., .named = TRUE))
  string_tech_rep <- names(enquos(tech_rep, .named = TRUE))
  join2 <- append(dots, string_tech_rep)
  
  # first find all the outliers
  outliers <- raw_data %>%
    setdiff(clean_data)
  
  # give outliers extra columns, all TRUE
  extra_col <- outliers %>%
    mutate(outlier = TRUE) %>%
    select(-{{cq_values}})
  
  # find all triplets with outliers
  triplets <- outliers %>%
    # remove some columns to prevent double columns after joining
    select(-{{cq_values}}) %>%
    select(-{{tech_rep}}) %>%
    
    # left_join() every sample not in outliers
    left_join(raw_data, by = dots) %>%
    distinct() %>%
    
    # add a column called `outlier` that marks outliers with TRUE
    left_join(extra_col, by = join2)
  triplets
}

Lets try it out!

outlier_triplets <- qpcr_outlier_context(raw_data = raw_data, 
                                         clean_data = clean_data, 
                                         cq_values = cq_values, 
                                         tech_rep = tech_rep, 
                                         treatment, bio_rep, primer_pair)
treatmentbio_repprimer_pairtech_repcq_valuesoutlier
ctrl1gene_a1TRUE
ctrl1gene_a235.0TRUE
ctrl1gene_a3TRUE
a1gene_a121.7
a1gene_a225.0TRUE
a1gene_a322.3
b1gene_a125.3
b1gene_a225.5
b1gene_a320.0TRUE
ctrl1gene_b135.1
ctrl1gene_b2TRUE
ctrl1gene_b335.6
a2gene_b118.0TRUE
a2gene_b222.3TRUE
a2gene_b327.0TRUE
b1gene_b126.0TRUE
b1gene_b224.0TRUE
b1gene_b3TRUE

We indeed find back every kind of outlier condition specified in the beginning of this post. Now we don’t have to visually inspect our qPCR results in excel anymore!

Jorik Bot
Jorik Bot
PhD student

Site under construction.

comments powered by Disqus

Related