Data formatting

library(forestexplorR)
library(dplyr)

This vignette describes how mapped forest stand datasets should be formatted to ensure full compatibility with forestexplorR and introduces some data checking functions to assist with the formatting process.

Data file types

The forestexplorR package expects raw mapped forest stand data to consist of at least two distinct files:

Mapping data - coordinate locations of individual trees within stands
Tree census data - measurements of individual trees taken during repeated censuses of the stands

Mapping data

The mapping dataset should be a data frame where each row represents a unique tree and contains its identifying and location information. The built-in dataset mapping is a cleaned and correctly formatted example:

head(mapping)
#>         tree_id stand_id  tag species year  dbh x_coord y_coord
#> 1 AB08000100022     AB08   19    TSHE 2012 15.1   18.11   95.68
#> 2 AB08000100001     AB08 4001    TSHE 2012 50.4   11.78   96.77
#> 3 AB08000100002     AB08 4002    TSHE 2012 81.1   12.47   97.95
#> 4 AB08000100003     AB08 4003    TSHE 2012 20.0   14.94   91.91
#> 5 AB08000100004     AB08 4004    TSHE 2012 53.7   22.09   99.06
#> 6 AB08000100005     AB08 4005    TSHE 2012 24.4   12.16   91.48

To ensure full compatibility with forestexplorR functions, the mapping dataset should contain the columns: tree_id, stand_id, tag, species, x_coord, and y_coord. Additional columns can also be included without disrupting function performance and column order is not important. For details on what each required column represents, use ?mapping. If only a few functions from forestexplorR will be applied, some columns in mapping may not be required (e.g. the column tag is only required for stand_map()) so please read individual function documentation before spending time generating any missing columns.

Tree census data

The tree census dataset should be a data frame where each row represents the measurement data for a single tree during a specific census of the stand. This means each tree will appear on x rows where x is the number of censuses in which that tree was measured. The required columns are: tree_id, stand_id, species, year and dbh. For details on what each required column represents, use ?tree. The built-in dataset tree is a cleaned and correctly formatted example:

head(tree)
#>         tree_id stand_id  tag species year  dbh mort
#> 1 AB08000100022     AB08   19    TSHE 2002 15.0    0
#> 2 AB08000100022     AB08   19    TSHE 2008 15.2    0
#> 3 AB08000100022     AB08   19    TSHE 2012 15.1    0
#> 4 AB08000100022     AB08   19    TSHE 2017 15.3    0
#> 5 AB08000100001     AB08 4001    TSHE 1978 48.2    0
#> 6 AB08000100001     AB08 4001    TSHE 1984 48.2    0

Additional columns can also be included in the tree census dataset without disrupting function performance (e.g. tag in tree) and a mort column containing mortality status of each tree during each status is required to use mortality_model(). Column order is not important.

Data checking

To obtain accurate and useful neighborhood descriptions it is important that the mapping and tree census datasets are cleaned of any unusual or missing data. The forestexplorR package contains functions to assist with the cleaning process but these functions only highlight cases of missing data rather than automatically removing the associated observations. This is because not all cases of missing data prevent all types of analysis and some missing data can be inferred from data collection records (e.g. missing year of measurement in tree census data).

`mapping_check()`: Check mapping datasets

The built-in dataset messy_mapping contains examples of common data errors in mapping datasets. For instance, there are 10 tree ids that are connected to more than one mapping record:

messy_mapping %>%
  group_by(tree_id) %>%
  summarize(count = n()) %>%
  filter(count > 1)
#> # A tibble: 10 x 2
#>   tree_id       count
#>   <chr>         <int>
#> 1 AE10001100015     2
#> 2 AM16000400020     2
#> 3 AM16001200016     2
#> 4 AV02000100011     2
#> # ... with 6 more rows

The mapping_check() function checks a mapping dataset for a variety of common errors and returns a list containing two elements. The first element ($problem_trees) is a data frame containing the rows of the input mapping dataset that contain issues, with an additional final column describing the issue. The arguments “max_x” and “max_y” must be provided so the function can check for x and y coordinates beyond the stand boundary.

map_issues <- mapping_check(messy_mapping, max_x = 100, max_y = 100)
#> [1] "This is version 0.2.0!"
#> [1] "Potential formatting problems detected: please review output and correct errors or remove problem trees if necessary before continuing analysis"
head(map_issues$problem_trees)
#>         tree_id stand_id  tag species year  dbh x_coord y_coord
#> 1 TO04000600004     TO04 4476    ABAM 2012 24.3   61.81   67.76
#> 2 AV02000300011     AV02 2045    ABAM 2012 34.6   31.14   90.37
#> 3 AV14001200030     AV14   34    TSHE 2013  5.6   57.20    4.48
#> 4 AV02000100011     AV02 2011    TSHE 2008  8.6    4.22   92.57
#> 5 AV02001300002     AV02 1612    ABAM 2012 23.7   58.51   49.61
#> 6 AM16000400020     AM16 3528    ABAM 2012 37.5   37.62   56.15
#>                issue
#> 1 duplicated tree_id
#> 2 duplicated tree_id
#> 3 duplicated tree_id
#> 4 duplicated tree_id
#> 5 duplicated tree_id
#> 6 duplicated tree_id

The second element ($issue_summary) is a data frame summarizing the number and percentage of trees in the mapping dataset that have one or more problems and each specific type of problem.

head(map_issues$issue_summary)
#>                      issue count       pct
#> 1       at least one issue    71 0.7746018
#> 2 coordinates out of range    26 0.2836570
#> 3       duplicated tree_id    10 0.1090988
#> 4      missing coordinates    15 0.1636483
#> 5          missing species    10 0.1090988
#> 6         missing stand_id    10 0.1090988

`tree_check()`: Check tree census datasets

The built-in dataset messy_tree contains examples of common data errors in tree census datasets. For instance, there are 10 tree measurement records that have no dbh information:

messy_tree %>%
  filter(is.na(dbh))
#>          tree_id stand_id  tag species year dbh mort
#> 1  AB08000200006     AB08 4023    TSHE 1990  NA    0
#> 2  AB08000200008     AB08 4025    TSHE 2012  NA    1
#> 3  AB08000300029     AB08 8671    ABAM 2012  NA    0
#> 4  AB08000600020     AB08 4107    THPL 1984  NA    0
#> 5  AB08000700016     AB08 4130    THPL 2002  NA    0
#> 6  AB08001500015     AB08 4176    TSHE 2012  NA    0
#> 7  AB08001600003     AB08 4191    TSHE 1990  NA    0
#> 8  AB08001600008     AB08 4196    TSHE 1995  NA    0
#> 9  PP17000600019     PP17 9620    PSME 2013  NA    0
#> 10 PP17001100021     PP17    4    PIMO 2017  NA    0

The tree_check() function checks a tree census dataset for a variety of common errors. A mapping dataset needs to be supplied to the function so that trees in the tree census data that have no associated mapping can be identified. tree_check() returns a list containing two elements. The first element ($problem_trees) is a data frame containing the tree ids that were flagged as having a data issue and summarizes the issue.

tree_issues <- tree_check(tree_data = messy_tree, map_data = mapping)
#> [1] "Potential formatting problems detected: please review output and correct errors or remove problem trees if necessary before continuing analysis"
head(tree_issues$problem_trees)
#> # A tibble: 6 x 2
#>   tree_id       issue                   
#>   <chr>         <chr>                   
#> 1 AB08000100012 refers to multiple trees
#> 2 AB08000300005 refers to multiple trees
#> 3 AB08000300020 refers to multiple trees
#> 4 AB08000300024 refers to multiple trees
#> # ... with 2 more rows

The second element ($issue_summary) is a data frame summarizing the number and percentage of trees in the tree census dataset that have one or more problems and each specific type of problem. Note that a tree id will be flagged if just one of its measurement records contains an issue, so many flagged trees are likely to be usable for most analyses.

head(tree_issues$issue_summary)
#>                      issue count       pct
#> 1       at least one issue    59 5.7392996
#> 2              missing dbh    10 0.9727626
#> 3 missing measurement year    10 0.9727626
#> 4 missing mortality record    20 1.9455253
#> 5       missing species id    10 0.9727626
#> 6         missing stand id    10 0.9727626