Calculate metric statistics — metric.stats • BioMonTools

This function calculates metric statistics for use with developing a multi-metric index.

Inputs are a data frame with

metric.stats(
  fun.DF,
  col_metrics,
  col_SampID = "SAMPLEID",
  col_RefStatus = "Ref_Status",
  RefStatus_Ref = "Ref",
  RefStatus_Str = "Str",
  RefStatus_Oth = "Oth",
  col_DataType = "Data_Type",
  DataType_Cal = "Cal",
  DataType_Ver = "Ver",
  col_Subset = NULL,
  Subset_Value = NULL
)

Arguments

fun.DF: Data frame.
col_metrics: Column names for metrics.
col_SampID: Column name for unique sample identifier. Default = "SAMPLEID".
col_RefStatus: Column name for Reference Status. Default = "Ref_Status"
RefStatus_Ref: Reference Status name for Reference used in col_ RefStatus. Default = “Ref”. Use NULL if you don't use this value.
RefStatus_Str: Reference Status name for Stressed used in col_ RefStatus. Default = “Str”. Use NULL if you don't use this value.
RefStatus_Oth: Reference Status name for Other used in col_ RefStatus. Default = “Oth”. Use NULL if you don't use this value.
col_DataType: Column name for Data Type – Validation vs. Calibration. Default = "Data_Type"
DataType_Cal: Datatype name for Calibration used in col_DataType. Default = “Cal”. Use NULL if you don't use this value.
DataType_Ver: Datatype name for Verification used in col_DataType. Default = “Ver”. Use NULL if you don't use this value.
col_Subset: Column name to subset the data and run on each subset. Default = NULL. If NULL then no subset will be generated.
Subset_Value: Subset name to be used for creating subset. Default = NULL.

Value

data frame of metrics (rows) and statistics (columns). This is in long format with columns for INDEX_CLASS, RefStatus, and DataType.

Details

Summary statistics for the data are calculated.

The data is filtered by the column Subset for only a single value given by the user. If need further subsets re-run the function. If no subset is given the entire data set is used.

Statistics will be generated for up to 6 combinations for RefStatus (Ref, Oth, Str) and DataType (Cal, Ver).

The resulting dataframe will have the statistics in columns with the first 4 columns as: INDEX_CLASS (if col_Subset not provided), col_RefStatus, col_DataType, and Metric_Name.

The following statistics are generated with na.rm = TRUE.

* n = number

* min = minimum

* max = maximum

* mean = mean

* median = median

* range = range (max - min)

* sd = standard deviation

* cv = coefficient of variation (sd/mean)

* q05 = quantile, 5

* q10 = quantile, 10

* q25 = quantile, 25

* q50 = quantile, 50

* q75 = quantile, 75

* q90 = quantile, 90

* q95 = quantile, 95

Examples

# data, benthos
df_bugs <- data_mmi_dev

# Munge Names
names(df_bugs)[names(df_bugs) %in% "BenSampID"] <- "SAMPLEID"
names(df_bugs)[names(df_bugs) %in% "TaxaID"] <- "TAXAID"
names(df_bugs)[names(df_bugs) %in% "Individuals"] <- "N_TAXA"
names(df_bugs)[names(df_bugs) %in% "Exclude"] <- "EXCLUDE"
names(df_bugs)[names(df_bugs) %in% "Class"] <- "INDEX_CLASS"
names(df_bugs)[names(df_bugs) %in% "Unique_ID"] <- "SITEID"

# Calc Metrics
cols_keep <- c("Ref_v1", "CalVal_Class4", "SITEID", "CollDate", "CollMeth")
# INDEX_NAME and INDEX_CLASS kept by default
df_metval <- metric.values(df_bugs, "bugs", fun.cols2keep = cols_keep)
#>  
#> There are 7 missing fields in the data:
#> ELEVATION_ATTR, GRADIENT_ATTR, WSAREA_ATTR, HABSTRUCT, BCG_ATTR2, AIRBREATHER, UFC
#>  
#> If you continue the metrics associated with these fields will be invalid.
#> For example, if the HABIT field is missing all habit related metrics will not be correct.
#> Do you wish to continue (YES or NO)?
#> boo.Shiny == TRUE and interactive == FALSE
#>                so prompt skipped and value set to '1'.
#> Warning: Metrics related to the following fields are invalid:
#>    ELEVATION_ATTR
#>    GRADIENT_ATTR
#>    WSAREA_ATTR
#>    HABSTRUCT
#>    BCG_ATTR2
#>    AIRBREATHER
#>    UFC
#> Joining with `by = join_by(SAMPLEID, INDEX_NAME, INDEX_CLASS)`
#> Warning: The `.dots` argument of `group_by()` is deprecated as of dplyr 1.0.0.
#> ℹ The deprecated feature was likely used in the dplyr package.
#>   Please report the issue at <https://github.com/tidyverse/dplyr/issues>.

# Calc Stats
col_metrics   <- names(df_metval)[9:ncol(df_metval)]
col_SampID    <- "SAMPLEID"
col_RefStatus <- "REF_V1"
RefStatus_Ref <- "Ref"
RefStatus_Str <- "Strs"
RefStatus_Oth <- "Other"
col_DataType  <- "CALVAL_CLASS4"
DataType_Cal  <- "cal"
DataType_Ver  <- "verif"
col_Subset    <- "INDEX_CLASS"
Subset_Value  <- "CENTRALHILLS"
df_stats <- metric.stats(df_metval
                         , col_metrics
                         , col_SampID
                         , col_RefStatus
                         , RefStatus_Ref
                         , RefStatus_Str
                         , RefStatus_Oth
                         , col_DataType
                         , DataType_Cal
                         , DataType_Ver
                         , col_Subset
                         , Subset_Value)
#> Error in metric.stats(df_metval, col_metrics, col_SampID, col_RefStatus,     RefStatus_Ref, RefStatus_Str, RefStatus_Oth, col_DataType,     DataType_Cal, DataType_Ver, col_Subset, Subset_Value): Values missing from column 'INDEX_CLASS'; CENTRALHILLS

if (FALSE) {
# Save Results
write.table(df_stats
            , file.path(tempdir(), "metric.stats.tsv")
            , col.names = TRUE
            , row.names = FALSE
            , sep = "\t")
}