Purpose

The BioMonTools package was created to enable users to have a common set of tools for bioassessment and biomonitoring data.

Installation

The package is hosted on GitHub (https://github.com/leppott/BioMonTools) and can be installed using the following lines of code. It is necessary to install the devtools package.

# Installing the BioMonTools library (with the vignette) from GitHub
library(devtools) 
install_github("leppott/BioMonTools", force=TRUE, build_vignettes=TRUE)

Help

After the BioMonTools package has been installed running the following line of code will open the help file with links to all functions.

help(package="BioMonTools")

Package Functions

The suite of functions in the BioMonTools package are presented below.

  • metric.values
  • markExcluded
  • rarify

metric.values

Calculate metric values from a data frame with taxa by sample with fully qualified master taxa information. All percent metric results are 0-1.

Data Preparation

There are a number of required fields for the input file. If any fields are missing the user will be prompted as to which are missing and the user can decide whether to continue or quit. If the user continues, the missing fields will be added but will be filled with zero or NA (as appropriate). Any metrics based on the missing fields will not be valid.

Required Fields:

  • SAMPLEID

    • Unique sample identifier

    • Valid values: character or number, must be unique

  • TAXAID

    • Unique taxa identifier

    • Valid values: character or number, must be unique

  • N_TAXA

    • Number of individuals

    • Valid values: number

  • EXCLUDE

    • Non-unique/non-distinct taxa are excluded from richness metric calculations but are counted in the other metrics. Appendix B of the ‘BCGcalc_README_20180919’ Word document describes the Exclude Taxa Decision Criteria that was used during BCG model calibration.

    • Valid values: TRUE or FALSE

    • Non-unique/non-distinct taxa should be entered as “TRUE”

  • NONTARGET

    • Non-target taxa are not part of the intended capture list; e.g., fish, herps, water column taxa. They are excluded from all metric calculations.

    • Valid values: TRUE or FALSE.

    • NonTarget taxa should be entered as “TRUE”

  • INDEX_NAME

    • Name of the BCG rules worksheet in the ‘Rules’ file (in the ‘extdata’ folder) that the R code is referencing.

    • Valid values for the Puget Lowlands/Willamette Valley model: BCG_PacNW_v1_500ct or BCG_PacNW_v1_300ct.

  • SITE_TYPE

    • Select which BCG model to apply: Low (lo) gradient (<1% NHD+ v2 flowline slope) or high (hi) gradient (≥ 1% NHD+ v2 flowline slope).

    • Valid values: “hi” or “lo”.

  • PHYLUM, SUBPHYLUM, CLASS, ORDER, FAMILY, SUBFAMILY, TRIBE, GENUS

    • Phylogeny

    • Other phylogenetic rankings (e.g., SubOrder or SuperFamily) can be included but are not used in the current metric calculations.

    • Valid values: text

  • BCG_Attr

    • BCG attribute assignments (for example, Stamp and Gerritsen 2018).

    • Valid values: 1i, 1m, 2, 3, 4, 5, 6; if not available, leave blank or enter ‘NA’.

  • FFG, HABIT, LIFE_CYCLE, TOLVAL, THERMAL_INDICATOR, UFC, ELEVATION_ATTR, GRADIENT_ATTR, WSAREA_ATTR, REPRODUCTION, CONNECTIVITY

    • FFG

      • Valid values: CG, CF, PR, SC, SH

      • Function feeding group entries.

    • HABIT

      • Valid values: BU, CB, CN, SP, SW

      • Habit designations.

    • LIFE_CYLCE

      • Valid values: UNI, SEMI, MULTI

      • Life cycle designations.

    • THERMAL_INDICATOR

      • Valid values: STENOC, COLD, COOL, WARM, STENOW , EURYTHERMAL, INCONCLUSIVE, NA
    • LONGLIVED

      • Valid values: TRUE, FALSE
    • NOTEWORTHY

      • Valid values: TRUE, FALSE
    • HABITAT

      • Valid values: BRAC, DEPO, GENE, HEAD, RHEO, RIVE, SPEC, UNKN
    • UFC

      • Valid values: numeric; 1, 2, 3, 4, 5, 6

      • Taxonomic Uncertainty Frequency Class

    • ELEVATION_ATTR

      • Valid values: LOW, HIGH
    • GRADIENT_ATTR

      • Valid values: LOW, MOD, HIGH
    • WSAREA_ATTR

      • Valid values: SMALL, MEDIUM, LARGE, XLARGE
    • REPRODUCTION

      • Valid values: BROADCASTER, SIMPLE NEST, COMPLEX NEST, BEARER, MIGRATORY
    • CONNECTIVITY

      • Valid values: TRUE, FALSE

Fields that are optional can be in the output (see examples below).

# Packages
# library(readxl)
# library(knitr)
# library(BioMonTools)
# library(dplyr)

# Load Data
df_data <- readxl::read_excel(system.file("./extdata/Data_Benthos.xlsx"
                                       , package = "BioMonTools")
                      , guess_max = 10^6)
# Columns to keep
myCols <- c("Area_mi2", "SurfaceArea", "Density_m2", "Density_ft2")

# Run Function
df_metval <- BioMonTools::metric.values(df_data, "bugs", fun.cols2keep = myCols)
#> Updated col class; TOLVAL2 to numeric
#> Joining with `by = join_by(SAMPLEID, INDEX_NAME, INDEX_CLASS)`
#> Warning: The `.dots` argument of `group_by()` is deprecated as of dplyr 1.0.0.
#>  The deprecated feature was likely used in the dplyr package.
#>   Please report the issue at <https://github.com/tidyverse/dplyr/issues>.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
# Metrics of Interest
## thermal indicator (_ti_)
#names(df.metval)[grepl("_ti_", names(df.metval))]
col_met2keep <- c("ni_total"
                  , "nt_total"
                  , "nt_ti_stenocold" # renamed from corecold
                  , "nt_ti_cold"
                  , "nt_ti_cool"
                  , "pi_ti_stenocold" # renamed from corecold
                  , "pi_ti_cold"
                  , "pi_ti_cool")
col_ID <- c("SAMPLEID"
            , toupper(myCols)
            , "INDEX_NAME"
            , "INDEX_CLASS")
# Ouput
df_metval_ci <- df_metval[, c(col_ID, col_met2keep)]
# RMD table
knitr::kable(head(df_metval_ci), caption = "Metric Calculation, select metrics")
Metric Calculation, select metrics
SAMPLEID AREA_MI2 SURFACEAREA DENSITY_M2 DENSITY_FT2 INDEX_NAME INDEX_CLASS ni_total nt_total nt_ti_stenocold nt_ti_cold nt_ti_cool pi_ti_stenocold pi_ti_cold pi_ti_cool
01103CSR_Bug_2001-08-27_0 44.084184 NA NA NA BCG_PacNW_v1_500ct Lo 589 27 0 1 10 0 40.237691 60.95076
02087REF_Bug_2002-08-22_0 1.791475 NA NA NA BCG_PacNW_v1_500ct Hi 542 38 0 23 28 0 18.265683 20.66421
03013CSR_Bug_2003-07-01_0 9.613734 8 NA NA BCG_PacNW_v1_500ct Hi 507 16 0 3 5 0 2.958580 13.21499
03053CSR_Bug_2003-08-14_0 3.061723 8 NA NA BCG_PacNW_v1_500ct Hi 573 36 0 19 24 0 46.073298 58.46422
03054CSR_Bug_2003-08-18_0 71.249741 8 NA NA BCG_PacNW_v1_500ct Lo 558 13 0 4 4 0 6.989247 11.82796
06012CSR_Bug_2006-09-05_0 82.338353 8 NA NA BCG_PacNW_v1_500ct Hi 545 40 0 18 28 0 33.944954 75.96330

markExcluded

Takes as an input data frame with Sample ID, Taxa ID, and phlogenetic name fields and returns a similar dataframe with a column for “exclude” taxa (TRUE or FALSE).

Exclude taxa are refered to by multiple names; ambiguous, non-distinct, and non-unique. The “exclude” name was chosen so as to be consistent with “non-target” taxa. That is, taxa marked as “TRUE” are treated as undesirables. Exclude taxa are those that are present in a sample when taxa of the same group are present in the same sample are identified at a finer level. That is, the parent is marked as exclude when child taxa are present in the same sample.

The exclude taxa are referenced in the metric values function. These taxa are removed from the taxa richness metrics. This is because these are coarser level taxa.

Exceptions is a 2 column data frame of synonyms or other exceptions. Column 1 is the name used in the TaxaID column the input data frame (df_samptax). Column 2 is the name used in the TaxaLevels columns of the input data frame (df_samptax). The phylogenetic columns (TaxaLevels) will be modified from Column 2 of the Exceptions data frame to match Column 1 of the Exceptions data frame. This ensures that the algorithm for markExcluded works properly. The changes will not be stored and the original names provided in the input data frame (df_samptax) will be returned in the final result. The function example below includes a practical case.

Taxa Levels are phylogenetic names that are to be checked. They should be listed in order from course (kingdom) to fine (species). Names not appearing in the data will be skipped.

The spelling of names must be consistent (including case) for this function to produce the intended output.

Returns a data frame of df_samptax with an additional column, Exclude.

# Packages
#library(readxl)
#library(dplyr)
#library(lazyeval)
#library(knitr)

# Define pipe
`%>%` <- dplyr::`%>%`

# Data
df_samps_bugs <- readxl::read_excel(system.file("./extdata/Data_Benthos.xlsx"
                                        , package = "BioMonTools")
                            , guess_max = 10^6)

# Variables
SampID     <- "SampleID"
TaxaID     <- "TaxaID"
TaxaCount  <- "N_Taxa"
Exclude    <- "Exclude_New"
TaxaLevels <- c("Kingdom"
                , "Phylum"
                , "SubPhylum"
                , "Class"
                , "SubClass"
                , "Order"
                , "SubOrder"
                , "SuperFamily"
                , "Family"
                , "SubFamily"
                , "Tribe"
                , "Genus"
                , "SubGenus"
                , "Species"
                , "Variety")
# Taxa that should be treated as equivalent
Exceptions <- data.frame("TaxaID" = "Sphaeriidae", "PhyloID" = "Pisidiidae")

# Filter Data
# df_samptax <- filter(df_samps_bugs, !!as.name(SampID) == 
# "08BEA3478__2013-08-21_0")
# df_tst_small <- markExcluded(df_samptax, SampID, TaxaID, TaxaCount, TaxaLevels
#, Exceptions, Exclude)

# EXAMPLE 1
df_tst <- BioMonTools::markExcluded(df_samps_bugs
                                    , SampID = "SampleID"
                                    , TaxaID = "TaxaID"
                                    , TaxaCount = "N_Taxa"
                                    , Exclude = "Exclude_New"
                                    , TaxaLevels = TaxaLevels
                                    , Exceptions = Exceptions)
#> *WARNING*, 3 taxa levels missing from input data frame; 
#>  KINGDOM, SUBORDER, VARIETY
#> [1] "Working on item (1/12); PHYLUM"
#> [1] "Working on item (2/12); SUBPHYLUM"
#> [1] "Working on item (3/12); CLASS"
#> [1] "Working on item (4/12); SUBCLASS"
#> [1] "Working on item (5/12); ORDER"
#> [1] "Working on item (6/12); SUPERFAMILY"
#> [1] "Working on item (7/12); FAMILY"
#> [1] "Working on item (8/12); SUBFAMILY"
#> [1] "Working on item (9/12); TRIBE"
#> [1] "Working on item (10/12); GENUS"
#> [1] "Working on item (11/12); SUBGENUS"
#> [1] "Working on item (12/12); SPECIES"

# Compare
df_compare <- dplyr::summarise(dplyr::group_by(df_tst, SampleID)
                               , Exclude_Import = sum(Exclude)
                               , Exclude_R = sum(Exclude_New))
df_compare$Diff <- df_compare$Exclude_Import - df_compare$Exclude_R
#
tbl_diff <- table(df_compare$Diff)
#kable(tbl_diff)
# sort
df_compare <- df_compare %>% dplyr::arrange(desc(Diff))

# Number with issues
#sum(abs(df_compare$Diff))
# total samples
#nrow(df_compare)

# confusion matrix
tbl_results <- table(df_tst$Exclude, df_tst$Exclude_New, useNA = "ifany")
#
# Show differences
knitr::kable(tbl_results, caption = "Confusion Matrix")
Confusion Matrix
FALSE TRUE
FALSE 24653 0
TRUE 0 2350
# samples with differences
samp_diff <- as.data.frame(df_compare[df_compare[,"Diff"] != 0, "SampleID"])
# results for only those with differences
df_tst_diff <- df_tst[df_tst[,"SampleID"] %in% samp_diff$SampleID, ]
# add diff field
df_tst_diff$Exclude_Diff <- df_tst_diff$Exclude - df_tst_diff$Exclude_New

# Classification Performance Metrics
class_TP <- tbl_results[2,2] # True Positive
class_FN <- tbl_results[2,1] # False Negative
class_FP <- tbl_results[1,2] # False Positive
class_TN <- tbl_results[1,1] # True Negative
class_n <- sum(tbl_results)  # total
#
# sensitivity (recall); TP / (TP+FN); measure model to ID true positives
class_sens <- class_TP / (class_TP + class_FN)
# precision; TP / (TP+FP); accuracy of model positives
class_prec <- class_TP / (class_TP + class_FP)
# specifity; TN / (TN + FP); measure model to ID true negatives
class_spec <- class_TN  / (class_TN + class_FP)
# overall accuracy; (TP + TN) / all cases; accuracy of all classifications
class_acc <- (class_TP + class_TN) / class_n
# F1; 2 * (class_prec*class_sens) / (class_prec+class_sens)
## balance of precision and recall
class_F1 <- 2 * (class_prec * class_sens) / (class_prec + class_sens)
#
results_names <- c("Sensitivity (Recall)"
                   , "Precision", "Specificity"
                   , "OVerall Accuracy"
                   , "F1")
results_values <- c(class_sens
                    , class_prec
                    , class_spec
                    , class_acc
                    , class_F1)
#
tbl_class <- data.frame(results_names, results_values)
names(tbl_class) <- c("Performance Metrics", "Percent")
tbl_class$Percent <- round(tbl_class$Percent * 100, 2)
knitr::kable(tbl_class, caption = "Classification Performance Metrics")
Classification Performance Metrics
Performance Metrics Percent
Sensitivity (Recall) 100
Precision 100
Specificity 100
OVerall Accuracy 100
F1 100

The same data with no “exceptions” leaves 8 records misclassified.

# Packages
#library(readxl)
#library(dplyr)
#library(lazyeval)
#library(knitr)

# Define pipe
`%>%` <- dplyr::`%>%`

# Data
df_samps_bugs <- readxl::read_excel(system.file("./extdata/Data_Benthos.xlsx"
                                              , package = "BioMonTools")
                                  , guess_max = 10^6)

# Variables
SampID     <- "SampleID"
TaxaID     <- "TaxaID"
TaxaCount  <- "N_Taxa"
Exclude    <- "Exclude_New"
TaxaLevels <- c("Kingdom"
                , "Phylum"
                , "SubPhylum"
                , "Class"
                , "SubClass"
                , "Order"
                , "SubOrder"
                , "SuperFamily"
                , "Family"
                , "SubFamily"
                , "Tribe"
                , "Genus"
                , "SubGenus"
                , "Species"
                , "Variety")
# Taxa that should be treated as equivalent
Exceptions <- NA

# EXAMPLE 2
## No Exceptions

df_tst2 <- BioMonTools::markExcluded(df_samps_bugs
                                     , SampID = "SampleID"
                                     , TaxaID = "TaxaID"
                                     , TaxaCount = "N_Taxa"
                                     , Exclude = "Exclude_New"
                                     , TaxaLevels = TaxaLevels
                                     , Exceptions = NA)
#> *WARNING*, 3 taxa levels missing from input data frame; 
#>  KINGDOM, SUBORDER, VARIETY
#> [1] "Working on item (1/12); PHYLUM"
#> [1] "Working on item (2/12); SUBPHYLUM"
#> [1] "Working on item (3/12); CLASS"
#> [1] "Working on item (4/12); SUBCLASS"
#> [1] "Working on item (5/12); ORDER"
#> [1] "Working on item (6/12); SUPERFAMILY"
#> [1] "Working on item (7/12); FAMILY"
#> [1] "Working on item (8/12); SUBFAMILY"
#> [1] "Working on item (9/12); TRIBE"
#> [1] "Working on item (10/12); GENUS"
#> [1] "Working on item (11/12); SUBGENUS"
#> [1] "Working on item (12/12); SPECIES"

# Compare
df_compare2 <- dplyr::summarise(dplyr::group_by(df_tst2, SampleID)
                               , Exclude_Import = sum(Exclude)
                               , Exclude_R = sum(Exclude_New))
df_compare2$Diff <- df_compare2$Exclude_Import - df_compare2$Exclude_R
#
tbl_diff2 <- table(df_compare2$Diff)
#kable(tbl_diff2)
# sort
df_compare2 <- df_compare2 %>% dplyr::arrange(desc(Diff))

# Number with issues
#sum(abs(df_compare2$Diff))
# total samples
#nrow(df_compare2)

# confusion matrix
tbl_results2 <- table(df_tst2$Exclude, df_tst2$Exclude_New, useNA = "ifany")
#
# Show differences
knitr::kable(tbl_results2, caption = "Confusion Matrix")
Confusion Matrix
FALSE TRUE
FALSE 24653 0
TRUE 8 2342
knitr::kable(df_compare2[1:10, ])
SampleID Exclude_Import Exclude_R Diff
08BEA3478__2013-08-21_0 3 2 1
08BEA3571__2014-09-03_0 5 4 1
08LIT2692__2012-09-10_0 8 7 1
08SAM0000__2012-07-11_0 4 3 1
08SAM0000__2013-07-25_0 4 3 1
09BLA0716__2014-08-14_0 3 2 1
95C__2012-10-02_0 1 0 1
95T__2012-10-03_0 1 0 1
01103CSR_Bug_2001-08-27_0 1 1 0
02087REF_Bug_2002-08-22_0 0 0 0
knitr::kable(tail(df_compare2))
SampleID Exclude_Import Exclude_R Diff
silvr192__2012-08-24_0 3 3 0
skytrmann__2010-08-12_0 6 6 0
skytrmann__2013-08-28_0 8 8 0
swampcart__2012-08-20_0 6 6 0
woodshand__2010-08-06_0 12 12 0
woodshand__2013-09-03_0 4 4 0
# samples with differences
(samp_diff2 <- as.data.frame(df_compare2[df_compare2[,"Diff"] != 0, "SampleID"]))
#>                  SampleID
#> 1 08BEA3478__2013-08-21_0
#> 2 08BEA3571__2014-09-03_0
#> 3 08LIT2692__2012-09-10_0
#> 4 08SAM0000__2012-07-11_0
#> 5 08SAM0000__2013-07-25_0
#> 6 09BLA0716__2014-08-14_0
#> 7       95C__2012-10-02_0
#> 8       95T__2012-10-03_0
# results for only those with differences
df_tst_diff2 <- dplyr::filter(df_tst2, SampleID %in% samp_diff2$SampleID)
# add diff field
df_tst_diff2$Exclude_Diff <- df_tst_diff2$Exclude - df_tst_diff2$Exclude_New

# Classification Performance Metrics
class_TP2 <- tbl_results2[2,2] # True Positive
class_FN2 <- tbl_results2[2,1] # False Negative
class_FP2 <- tbl_results2[1,2] # False Positive
class_TN2 <- tbl_results2[1,1] # True Negative
class_n2 <- sum(tbl_results2)  # total
#
# sensitivity (recall); TP / (TP+FN); measure model to ID true positives
class_sens2 <- class_TP2 / (class_TP2 + class_FN2)
# precision; TP / (TP+FP); accuracy of model positives
class_prec2 <- class_TP2 / (class_TP2 + class_FP2)
# specifity; TN / (TN + FP); measure model to ID true negatives
class_spec2 <- class_TN2 / (class_TN2 + class_FP2)
# overall accuracy; (TP + TN) / all cases; accuracy of all classifications
class_acc2 <- (class_TP2 + class_TN2) / class_n2
# F1; 2 * (class_prec*class_sens) / (class_prec+class_sens)
## balance of precision and recall
class_F12 <- 2 * (class_prec2 * class_sens2) / (class_prec2 + class_sens2)
#
results_names2 <- c("Sensitivity (Recall)"
                    , "Precision"
                    , "Specificity"
                    , "OVerall Accuracy"
                    , "F1")
results_values2 <- c(class_sens2
                     , class_prec2
                     , class_spec2
                     , class_acc2
                     , class_F12)
#
tbl_class2 <- data.frame(results_names2, results_values2)
names(tbl_class2) <- c("Performance Metrics", "Percent")
tbl_class2$Percent <- round(tbl_class2$Percent * 100, 2)
knitr::kable(tbl_class2, caption = "Classification Performance Metrics")
Classification Performance Metrics
Performance Metrics Percent
Sensitivity (Recall) 99.66
Precision 100.00
Specificity 100.00
OVerall Accuracy 99.97
F1 99.83

rarify

The rarify function subsamples count data to a fixed count per sample. It takes as an input a 3 column data frame (SampleID, TaxonID, Count) and returns a similar dataframe with revised Counts. The names of the columns does not matter as they are specified in the code. Any non-count taxa (e.g., fish in a bug sample) should be removed prior to using the rarify function. The function code is from USEPA Corvallis John Van Sickle’s R code for RIVPACS (v1.0, 2005-06-10) and was tweaked for the addition of a user provided seed so repeatable results can be obtained.

The other function inputs are subsample size (target number of organisms in each sample) and seed. The seed is given so the results can be reproduced from the same input file. If no seed is given a random seed is used. An example seed is the date of admission to the Union for each state where the data is collected (e.g., Washington is 18891111). These values can be found on Wikipedia on the right sidebar for each State.

If you are running the 500-count BCG model and any of your samples have more than 600 organisms (the upper limit for the model), you should randomly subsample your data to 600 (600 is +20% of the 500-count target). This is done to make richness metrics comparable across the 500-count samples. You can do this with the Rarify routine.

# Subsample to 500 organisms (from over 500 organisms) for 12 samples.

# Packages
#library(BioMonTools)
#library(knitr)

# load bio data
df_biodata <- BioMonTools::data_bio2rarify
#dim(df_biodata)
#kable(head(df_biodata))

# subsample
mySize <- 500
Seed_OR <- 18590214
Seed_WA <- 18891111
Seed_US <- 17760704
bugs_mysize <- BioMonTools::rarify(inbug = df_biodata
                                   , sample.ID = "SampleID"
                                   , abund = "N_Taxa"
                                   , subsiz = mySize
                                   , mySeed = Seed_US)
#> Rarify of samples complete. 
#>  Number of samples =  12 
#>  Execution time (sec) =  0.0169999999999995

# view results
#dim(bugs_mysize)
#kable(head(bugs_mysize))

# Compare pre- and post- subsample counts
df_compare <- merge(df_biodata
                    , bugs_mysize
                    , by = c("SampleID", "TaxaID")
                    , suffixes = c("_Orig","_500"))
df_compare <- df_compare[,c("SampleID", "TaxaID", "N_Taxa_Orig", "N_Taxa_500")]
knitr::kable(head(df_compare), caption = "Comparison, by Sample")
Comparison, by Sample
SampleID TaxaID N_Taxa_Orig N_Taxa_500
01103CSR_Bug_2001-08-27_0 Ceratopogoninae 237 142
01103CSR_Bug_2001-08-27_0 Cheumatopsyche 2 0
01103CSR_Bug_2001-08-27_0 Chironomini 8 4
01103CSR_Bug_2001-08-27_0 Cladocera 2 1
01103CSR_Bug_2001-08-27_0 Coenagrionidae 2 1
01103CSR_Bug_2001-08-27_0 Corbicula 2 1

# compare totals
tbl_totals <- aggregate(cbind(N_Taxa_Orig, N_Taxa_500) ~ SampleID
                        , df_compare
                        , sum)
knitr::kable(head(tbl_totals), caption = "Comparison, sample totals")
Comparison, sample totals
SampleID N_Taxa_Orig N_Taxa_500
01103CSR_Bug_2001-08-27_0 816 500
02087REF_Bug_2002-08-22_0 542 500
03013CSR_Bug_2003-07-01_0 508 500
03053CSR_Bug_2003-08-14_0 573 500
03054CSR_Bug_2003-08-18_0 558 500
06012CSR_Bug_2006-09-05_0 545 500

## Not run: 
# save the data
#write.table(bugs_mysize, paste("bugs",mySize,"txt",sep="."),sep="\t")
## End(Not run)

Flags

Results should be interpreted with caution if they are flagged for any of the criteria listed below. It is up to the user’s discretion as to how to handled flagged samples.

# Packages
#library(readxl)
#library(reshape2)
#library(knitr)
#library(BioMonTools)

# Import
df.samps.bugs <- readxl::read_excel(system.file("extdata/Data_Benthos.xlsx"
                                                , package = "BioMonTools")
                           , guess_max = 10^6)

# Calculate Metrics
# Extra columns to keep in results
keep.cols <- c("Area_mi2", "SurfaceArea", "Density_m2", "Density_ft2")
# Run Function
df.metrics <- BioMonTools::metric.values(df.samps.bugs
                                         , "bugs"
                                         , fun.cols2keep = keep.cols)
#> Updated col class; TOLVAL2 to numeric
#> Joining with `by = join_by(SAMPLEID, INDEX_NAME, INDEX_CLASS)`

# Flags
# Import QC Checks
df.checks <- readxl::read_excel(system.file("extdata/MetricFlags.xlsx"
                                          , package = "BioMonTools")
                                , sheet = "Flags") 
# Run Function
df.flags <- BioMonTools::qc.checks(df.metrics, df.checks)

# Change terminology; PASS/FAIL to NA/flag
df.flags[,"FLAG"][df.flags[,"FLAG"] == "FAIL"] <- "flag"
df.flags[, "FLAG"][df.flags[,"FLAG"] == "PASS"] <- NA
# long to wide format
df.flags.wide <- reshape2::dcast(df.flags
                                 , SAMPLEID ~ CHECKNAME
                                 , value.var = "FLAG")
# Calc number of "flag"s by row.
df.flags.wide$NumFlags <- rowSums(df.flags.wide == "flag", na.rm = TRUE)
# Rearrange columns
NumCols <- ncol(df.flags.wide)
df.flags.wide <- df.flags.wide[, c(1, NumCols, 2:(NumCols - 1))]
# View(df.flags.wide)

# Summarize Results
knitr::kable(table(df.flags[,"CHECKNAME"], df.flags[,"FLAG"], useNA = "ifany"))
flag NA
brackish organisms present 7 671
catchment, Large 0 678
catchment, small 158 520
individuals, dominant 02, Large 204 474
individuals, Large 0 678
individuals, small 80 598
Low density (ft2) 0 678
Low density (m2) 0 678
Ramellogammarus 8 670
surface area, small 112 566