Mark "exclude" (non-distinct / non-unique / ambiguous) taxa

Takes as an input data frame with Sample ID, Taxa ID, and phlogenetic name fields and returns a similar dataframe with a column for "exclude" taxa (TRUE or FALSE).

Exclude taxa are refered to by multiple names; ambiguous, non-distinct, and non-unique. The "exclude" name was chosen so as to be consistent with "non-target" taxa. That is, taxa marked as "TRUE" are treated as undesireables. Exclude taxa are those that are present in a sample when taxa of the same group are present in the same sample are identified finer level. That is, the parent is marked as exclude when child taxa are present in the same sample.

markExcluded(
  df_samptax,
  SampID = "SAMPLEID",
  TaxaID = "TAXAID",
  TaxaCount = "N_TAXA",
  Exclude = "EXCLUDE",
  TaxaLevels,
  Exceptions = NA
)

Arguments

df_samptax: Input data frame.
SampID: Column name in df_samptax for sample identifier. Default = "SAMPLEID".
TaxaID: Column name in df_samptax for organism identifier. Default = "TAXAID".
TaxaCount: Column name in df_samptax for organism count. Default = "N_TAXA".
Exclude: Column name for Exclude Taxa results in returned data frame. Default = "Exclude".
TaxaLevels: Column names in df_samptax that for phylogenetic names to be evaluated. Need to be in order from coarse to fine (i.e., Phylum to Species).
Exceptions: NA or two column data frame of synonyms or other exceptions. Default = NA Column 1 is the name used in the TaxaID column of df_samptax. Column 2 is the name used in the TaxaLevels columns of df_samptax.

Value

Returns a data frame of df_samptax with an additional column, Exclude.

Details

The exclude taxa are referenced in the metric values function. These taxa are removed from the taxa richness metrics. This is because these are coarser level taxa when fine level taxa are present in the same sample.

Exceptions is a 2 column data frame of synonyms or other exceptions. Column 1 is the name used in the TaxaID column the input data frame (df_samptax). Column 2 is the name used in the TaxaLevels columns of the input data frame (df_samptax). The phylogenetic columns (TaxaLevels) will be modified from Column 2 of the Exceptions data frame to match Column 1 of the Exceptions data frame. This ensures that the algorithm for markExcluded works properly. The changes will not be stored and the original names provided in the input data frame (df_samptax) will be returned in the final result. The function example below includes a practical case.

Taxa Levels are phylogenetic names that are to be checked. They should be listed in order from course (kingdom) to fine (species). Names not appearing in the data will be skipped.

The spelling of names must be consistent (including case) for this function to produce the intended output.

Examples


# Packages
library(readxl)
library(dplyr)
#> 
#> Attaching package: ‘dplyr’
#> The following objects are masked from ‘package:stats’:
#> 
#>     filter, lag
#> The following objects are masked from ‘package:base’:
#> 
#>     intersect, setdiff, setequal, union
library(lazyeval)
library(knitr)

# Data
df_samps_bugs <- read_excel(system.file("./extdata/Data_Benthos.xlsx"
                                        , package="BioMonTools")
                           , guess_max=10^6)

# Variables
SampID     <- "SampleID"
TaxaID     <- "TaxaID"
TaxaCount  <- "N_Taxa"
Exclude    <- "Exclude_New"
TaxaLevels <- c("Kingdom"
                , "Phylum"
                , "SubPhylum"
                , "Class"
                , "SubClass"
                , "Order"
                , "SubOrder"
                , "SuperFamily"
                , "Family"
                , "SubFamily"
                , "Tribe"
                , "Genus"
                , "SubGenus"
                , "Species"
                , "Variety")
# Taxa that should be treated as equivalent
Exceptions <- data.frame("TaxaID"=c("Sphaeriidae")
                                    , "PhyloID"=c("Pisidiidae"))

# Filter Data
# df_samptax <- filter(df_samps_bugs
#                       , !!as.name(SampID) == "08BEA3478__2013-08-21_0")
# df_tst_small <- markExcluded(df_samptax
#                              , SampID
#                              , TaxaID
#                              , TaxaCount
#                              , TaxaLevels
#                              , Exceptions
#                              , Exclude)

# EXAMPLE 1
df_tst <- markExcluded(df_samps_bugs
                       , SampID = "SampleID"
                       , TaxaID = "TaxaID"
                       , TaxaCount = "N_Taxa"
                       , Exclude = "Exclude_New"
                       , TaxaLevels = TaxaLevels
                       , Exceptions = Exceptions)
#> *WARNING*, 3 taxa levels missing from input data frame; 
#>  KINGDOM, SUBORDER, VARIETY
#> [1] "Working on item (1/12); PHYLUM"
#> [1] "Working on item (2/12); SUBPHYLUM"
#> [1] "Working on item (3/12); CLASS"
#> [1] "Working on item (4/12); SUBCLASS"
#> [1] "Working on item (5/12); ORDER"
#> [1] "Working on item (6/12); SUPERFAMILY"
#> [1] "Working on item (7/12); FAMILY"
#> [1] "Working on item (8/12); SUBFAMILY"
#> [1] "Working on item (9/12); TRIBE"
#> [1] "Working on item (10/12); GENUS"
#> [1] "Working on item (11/12); SUBGENUS"
#> [1] "Working on item (12/12); SPECIES"

# Compare
df_compare <- dplyr::summarise(dplyr::group_by(df_tst, SampleID)
                               , Exclude_Import = sum(Exclude)
                               , Exclude_R = sum(Exclude_New))
df_compare$Diff <- df_compare$Exclude_Import - df_compare$Exclude_R
#
tbl_diff <- table(df_compare$Diff)
kable(tbl_diff)
#> 
#> 
#> |Var1 | Freq|
#> |:----|----:|
#> |0    |  678|
# sort
df_compare <- df_compare %>% arrange(desc(Diff))

# Number with issues
sum(abs(df_compare$Diff))
#> [1] 0
# total samples
nrow(df_compare)
#> [1] 678

# confusion matrix
tbl_results <- table(df_tst$Exclude, df_tst$Exclude_New, useNA = "ifany")
#
# Show differences
kable(tbl_results)
#> 
#> 
#> |      | FALSE| TRUE|
#> |:-----|-----:|----:|
#> |FALSE | 24653|    0|
#> |TRUE  |     0| 2350|
knitr::kable(df_compare[1:10, ])
#> 
#> 
#> |SampleID                   | Exclude_Import| Exclude_R| Diff|
#> |:--------------------------|--------------:|---------:|----:|
#> |01103CSR_Bug_2001-08-27_0  |              1|         1|    0|
#> |02087REF_Bug_2002-08-22_0  |              0|         0|    0|
#> |03013CSR_Bug_2003-07-01_0  |              1|         1|    0|
#> |03053CSR_Bug_2003-08-14_0  |              6|         6|    0|
#> |03054CSR_Bug_2003-08-18_0  |              2|         2|    0|
#> |06012CSR_Bug_2006-09-05_0  |              1|         1|    0|
#> |06019CSRw_Bug_2006-09-14_0 |              2|         2|    0|
#> |06021CSR_Bug_2006-08-01_0  |              2|         2|    0|
#> |06022CSR_Bug_2006-08-17_0  |              1|         1|    0|
#> |06024CSR_Bug_2006-07-27_0  |              2|         2|    0|
knitr::kable(df_compare[672:678, ])
#> 
#> 
#> |SampleID                | Exclude_Import| Exclude_R| Diff|
#> |:-----------------------|--------------:|---------:|----:|
#> |ricci__2013-09-04_0     |              7|         7|    0|
#> |silvr192__2012-08-24_0  |              3|         3|    0|
#> |skytrmann__2010-08-12_0 |              6|         6|    0|
#> |skytrmann__2013-08-28_0 |              8|         8|    0|
#> |swampcart__2012-08-20_0 |              6|         6|    0|
#> |woodshand__2010-08-06_0 |             12|        12|    0|
#> |woodshand__2013-09-03_0 |              4|         4|    0|
# samples with differences
samp_diff <- as.data.frame(df_compare[df_compare[,"Diff"] != 0, "SampleID"])
# results for only those with differences
df_tst_diff <- df_tst[df_tst[,"SampleID"] %in% samp_diff$SampleID, ]
# add diff field
df_tst_diff$Exclude_Diff <- df_tst_diff$Exclude - df_tst_diff$Exclude_New

# Classification Performance Metrics
class_TP <- tbl_results[2,2] # True Positive
class_FN <- tbl_results[2,1] # False Negative
class_FP <- tbl_results[1,2] # False Positive
class_TN <- tbl_results[1,1] # True Negative
class_n <- sum(tbl_results)  # total
#
# sensitivity (recall); TP / (TP+FN); measure model to ID true positives
class_sens <- class_TP / (class_TP + class_FN)
# precision; TP / (TP+FP); accuracy of model positives
class_prec <- class_TP / (class_TP + class_FP)
# specifity; TN / (TN + FP); measure model to ID true negatives
class_spec <- class_TN  / (class_TN + class_FP)
# overall accuracy; (TP + TN) / all cases; accuracy of all classifications
class_acc <- (class_TP + class_TN) / class_n
# F1; 2 * (class_prec*class_sens) / (class_prec+class_sens)
## balance of precision and recall
class_F1 <- 2 * (class_prec * class_sens) / (class_prec + class_sens)
#
results_names <- c("Sensitivity (Recall)"
                   , "Precision"
                   , "Specificity"
                   , "Overall Accuracy"
                   , "F1")
results_values <- c(class_sens
                   , class_prec
                   , class_spec
                   , class_acc
                   , class_F1)
#
tbl_class <- data.frame(results_names, results_values)
names(tbl_class) <- c("Performance Metrics", "Percent")
tbl_class$Percent <- round(tbl_class$Percent * 100, 2)
kable(tbl_class)
#> 
#> 
#> |Performance Metrics  | Percent|
#> |:--------------------|-------:|
#> |Sensitivity (Recall) |     100|
#> |Precision            |     100|
#> |Specificity          |     100|
#> |Overall Accuracy     |     100|
#> |F1                   |     100|

#~~~~~~~~~~~~~~~~~~~~~~~~~~

# EXAMPLE 2
## No Exceptions

df_tst2 <- markExcluded(df_samps_bugs
                        , SampID = "SampleID"
                        , TaxaID = "TaxaID"
                        , TaxaCount = "N_Taxa"
                        , Exclude = "Exclude_New"
                        , TaxaLevels = TaxaLevels
                        , Exceptions = NA)
#> *WARNING*, 3 taxa levels missing from input data frame; 
#>  KINGDOM, SUBORDER, VARIETY
#> [1] "Working on item (1/12); PHYLUM"
#> [1] "Working on item (2/12); SUBPHYLUM"
#> [1] "Working on item (3/12); CLASS"
#> [1] "Working on item (4/12); SUBCLASS"
#> [1] "Working on item (5/12); ORDER"
#> [1] "Working on item (6/12); SUPERFAMILY"
#> [1] "Working on item (7/12); FAMILY"
#> [1] "Working on item (8/12); SUBFAMILY"
#> [1] "Working on item (9/12); TRIBE"
#> [1] "Working on item (10/12); GENUS"
#> [1] "Working on item (11/12); SUBGENUS"
#> [1] "Working on item (12/12); SPECIES"

# Compare
df_compare2 <- dplyr::summarise(dplyr::group_by(df_tst2, SampleID)
                               , Exclude_Import = sum(Exclude)
                               , Exclude_R=sum(Exclude_New))
df_compare2$Diff <- df_compare2$Exclude_Import - df_compare2$Exclude_R
#
tbl_diff2 <- table(df_compare2$Diff)
kable(tbl_diff2)
#> 
#> 
#> |Var1 | Freq|
#> |:----|----:|
#> |0    |  670|
#> |1    |    8|
# sort
df_compare2 <- df_compare2 %>% arrange(desc(Diff))

# Number with issues
sum(abs(df_compare2$Diff))
#> [1] 8
# total samples
nrow(df_compare2)
#> [1] 678

# confusion matrix
tbl_results2 <- table(df_tst2$Exclude, df_tst2$Exclude_New, useNA = "ifany")
#
# Show differences
kable(tbl_results2)
#> 
#> 
#> |      | FALSE| TRUE|
#> |:-----|-----:|----:|
#> |FALSE | 24653|    0|
#> |TRUE  |     8| 2342|
knitr::kable(df_compare2[1:10, ])
#> 
#> 
#> |SampleID                  | Exclude_Import| Exclude_R| Diff|
#> |:-------------------------|--------------:|---------:|----:|
#> |08BEA3478__2013-08-21_0   |              3|         2|    1|
#> |08BEA3571__2014-09-03_0   |              5|         4|    1|
#> |08LIT2692__2012-09-10_0   |              8|         7|    1|
#> |08SAM0000__2012-07-11_0   |              4|         3|    1|
#> |08SAM0000__2013-07-25_0   |              4|         3|    1|
#> |09BLA0716__2014-08-14_0   |              3|         2|    1|
#> |95C__2012-10-02_0         |              1|         0|    1|
#> |95T__2012-10-03_0         |              1|         0|    1|
#> |01103CSR_Bug_2001-08-27_0 |              1|         1|    0|
#> |02087REF_Bug_2002-08-22_0 |              0|         0|    0|
knitr::kable(tail(df_compare2))
#> 
#> 
#> |SampleID                | Exclude_Import| Exclude_R| Diff|
#> |:-----------------------|--------------:|---------:|----:|
#> |silvr192__2012-08-24_0  |              3|         3|    0|
#> |skytrmann__2010-08-12_0 |              6|         6|    0|
#> |skytrmann__2013-08-28_0 |              8|         8|    0|
#> |swampcart__2012-08-20_0 |              6|         6|    0|
#> |woodshand__2010-08-06_0 |             12|        12|    0|
#> |woodshand__2013-09-03_0 |              4|         4|    0|
# samples with differences
(samp_diff2 <- as.data.frame(df_compare2[df_compare2[,"Diff"]!=0
                                                  , "SampleID"]))
#>                  SampleID
#> 1 08BEA3478__2013-08-21_0
#> 2 08BEA3571__2014-09-03_0
#> 3 08LIT2692__2012-09-10_0
#> 4 08SAM0000__2012-07-11_0
#> 5 08SAM0000__2013-07-25_0
#> 6 09BLA0716__2014-08-14_0
#> 7       95C__2012-10-02_0
#> 8       95T__2012-10-03_0
# results for only those with differences
df_tst_diff2 <- filter(df_tst2, SampleID %in% samp_diff2$SampleID)
# add diff field
df_tst_diff2$Exclude_Diff <- df_tst_diff2$Exclude - df_tst_diff2$Exclude_New

# Classification Performance Metrics
class_TP2 <- tbl_results2[2,2] # True Positive
class_FN2 <- tbl_results2[2,1] # False Negative
class_FP2 <- tbl_results2[1,2] # False Positive
class_TN2 <- tbl_results2[1,1] # True Negative
class_n2 <- sum(tbl_results2)  # total
#
# sensitivity (recall); TP / (TP+FN); measure model to ID true positives
class_sens2 <- class_TP2 / (class_TP2 + class_FN2)
# precision; TP / (TP+FP); accuracy of model positives
class_prec2 <- class_TP2 / (class_TP2 + class_FP2)
# specifity; TN / (TN + FP); measure model to ID true negatives
class_spec2 <- class_TN2 / (class_TN2 + class_FP2)
# overall accuracy; (TP + TN) / all cases; accuracy of all classifications
class_acc2 <- (class_TP2 + class_TN2) / class_n2
# F1; 2 * (class_prec*class_sens) / (class_prec+class_sens)
## balance of precision and recall
class_F12 <- 2 * (class_prec2 * class_sens2) / (class_prec2 + class_sens2)
#
results_names2 <- c("Sensitivity (Recall)"
                    , "Precision"
                    , "Specificity"
                    , "Overall Accuracy"
                    , "F1")
results_values2 <- c(class_sens2
                    , class_prec2
                    , class_spec2
                    , class_acc2
                    , class_F12)
#
tbl_class2 <- data.frame(results_names2, results_values2)
names(tbl_class2) <- c("Performance Metrics", "Percent")
tbl_class2$Percent <- round(tbl_class2$Percent * 100, 2)
kable(tbl_class2)
#> 
#> 
#> |Performance Metrics  | Percent|
#> |:--------------------|-------:|
#> |Sensitivity (Recall) |   99.66|
#> |Precision            |  100.00|
#> |Specificity          |  100.00|
#> |Overall Accuracy     |   99.97|
#> |F1                   |   99.83|