This vignette describes how mapped forest stand datasets should be formatted to ensure full compatibility with forestexplorR and introduces some data checking functions to assist with the formatting process.
The forestexplorR package expects raw mapped forest stand data to consist of at least two distinct files:
Mapping data - coordinate locations of individual trees within stands
Tree census data - measurements of individual trees taken during repeated censuses of the stands
The mapping dataset should be a data frame where each row represents a unique tree and contains its identifying and location information. The built-in dataset mapping
is a cleaned and correctly formatted example:
head(mapping)
#> tree_id stand_id tag species year dbh x_coord y_coord
#> 1 AB08000100022 AB08 19 TSHE 2012 15.1 18.11 95.68
#> 2 AB08000100001 AB08 4001 TSHE 2012 50.4 11.78 96.77
#> 3 AB08000100002 AB08 4002 TSHE 2012 81.1 12.47 97.95
#> 4 AB08000100003 AB08 4003 TSHE 2012 20.0 14.94 91.91
#> 5 AB08000100004 AB08 4004 TSHE 2012 53.7 22.09 99.06
#> 6 AB08000100005 AB08 4005 TSHE 2012 24.4 12.16 91.48
To ensure full compatibility with forestexplorR functions, the mapping dataset should contain the columns: tree_id
, stand_id
, tag
, species
, x_coord
, and y_coord
. Additional columns can also be included without disrupting function performance and column order is not important. For details on what each required column represents, use ?mapping
. If only a few functions from forestexplorR will be applied, some columns in mapping
may not be required (e.g. the column tag
is only required for stand_map()
) so please read individual function documentation before spending time generating any missing columns.
The tree census dataset should be a data frame where each row represents the measurement data for a single tree during a specific census of the stand. This means each tree will appear on x rows where x is the number of censuses in which that tree was measured. The required columns are: tree_id
, stand_id
, species
, year
and dbh
. For details on what each required column represents, use ?tree
. The built-in dataset tree
is a cleaned and correctly formatted example:
head(tree)
#> tree_id stand_id tag species year dbh mort
#> 1 AB08000100022 AB08 19 TSHE 2002 15.0 0
#> 2 AB08000100022 AB08 19 TSHE 2008 15.2 0
#> 3 AB08000100022 AB08 19 TSHE 2012 15.1 0
#> 4 AB08000100022 AB08 19 TSHE 2017 15.3 0
#> 5 AB08000100001 AB08 4001 TSHE 1978 48.2 0
#> 6 AB08000100001 AB08 4001 TSHE 1984 48.2 0
Additional columns can also be included in the tree census dataset without disrupting function performance (e.g. tag
in tree
) and a mort
column containing mortality status of each tree during each status is required to use mortality_model()
. Column order is not important.
To obtain accurate and useful neighborhood descriptions it is important that the mapping and tree census datasets are cleaned of any unusual or missing data. The forestexplorR package contains functions to assist with the cleaning process but these functions only highlight cases of missing data rather than automatically removing the associated observations. This is because not all cases of missing data prevent all types of analysis and some missing data can be inferred from data collection records (e.g. missing year of measurement in tree census data).
mapping_check()
: Check mapping datasetsThe built-in dataset messy_mapping
contains examples of common data errors in mapping datasets. For instance, there are 10 tree ids that are connected to more than one mapping record:
messy_mapping %>%
group_by(tree_id) %>%
summarize(count = n()) %>%
filter(count > 1)
#> # A tibble: 10 x 2
#> tree_id count
#> <chr> <int>
#> 1 AE10001100015 2
#> 2 AM16000400020 2
#> 3 AM16001200016 2
#> 4 AV02000100011 2
#> # ... with 6 more rows
The mapping_check()
function checks a mapping dataset for a variety of common errors and returns a list containing two elements. The first element ($problem_trees
) is a data frame containing the rows of the input mapping dataset that contain issues, with an additional final column describing the issue. The arguments “max_x” and “max_y” must be provided so the function can check for x and y coordinates beyond the stand boundary.
map_issues <- mapping_check(messy_mapping, max_x = 100, max_y = 100)
#> [1] "This is version 0.2.0!"
#> [1] "Potential formatting problems detected: please review output and correct errors or remove problem trees if necessary before continuing analysis"
head(map_issues$problem_trees)
#> tree_id stand_id tag species year dbh x_coord y_coord
#> 1 TO04000600004 TO04 4476 ABAM 2012 24.3 61.81 67.76
#> 2 AV02000300011 AV02 2045 ABAM 2012 34.6 31.14 90.37
#> 3 AV14001200030 AV14 34 TSHE 2013 5.6 57.20 4.48
#> 4 AV02000100011 AV02 2011 TSHE 2008 8.6 4.22 92.57
#> 5 AV02001300002 AV02 1612 ABAM 2012 23.7 58.51 49.61
#> 6 AM16000400020 AM16 3528 ABAM 2012 37.5 37.62 56.15
#> issue
#> 1 duplicated tree_id
#> 2 duplicated tree_id
#> 3 duplicated tree_id
#> 4 duplicated tree_id
#> 5 duplicated tree_id
#> 6 duplicated tree_id
The second element ($issue_summary
) is a data frame summarizing the number and percentage of trees in the mapping dataset that have one or more problems and each specific type of problem.
head(map_issues$issue_summary)
#> issue count pct
#> 1 at least one issue 71 0.7746018
#> 2 coordinates out of range 26 0.2836570
#> 3 duplicated tree_id 10 0.1090988
#> 4 missing coordinates 15 0.1636483
#> 5 missing species 10 0.1090988
#> 6 missing stand_id 10 0.1090988
tree_check()
: Check tree census datasetsThe built-in dataset messy_tree
contains examples of common data errors in tree census datasets. For instance, there are 10 tree measurement records that have no dbh information:
messy_tree %>%
filter(is.na(dbh))
#> tree_id stand_id tag species year dbh mort
#> 1 AB08000200006 AB08 4023 TSHE 1990 NA 0
#> 2 AB08000200008 AB08 4025 TSHE 2012 NA 1
#> 3 AB08000300029 AB08 8671 ABAM 2012 NA 0
#> 4 AB08000600020 AB08 4107 THPL 1984 NA 0
#> 5 AB08000700016 AB08 4130 THPL 2002 NA 0
#> 6 AB08001500015 AB08 4176 TSHE 2012 NA 0
#> 7 AB08001600003 AB08 4191 TSHE 1990 NA 0
#> 8 AB08001600008 AB08 4196 TSHE 1995 NA 0
#> 9 PP17000600019 PP17 9620 PSME 2013 NA 0
#> 10 PP17001100021 PP17 4 PIMO 2017 NA 0
The tree_check()
function checks a tree census dataset for a variety of common errors. A mapping dataset needs to be supplied to the function so that trees in the tree census data that have no associated mapping can be identified. tree_check()
returns a list containing two elements. The first element ($problem_trees
) is a data frame containing the tree ids that were flagged as having a data issue and summarizes the issue.
tree_issues <- tree_check(tree_data = messy_tree, map_data = mapping)
#> [1] "Potential formatting problems detected: please review output and correct errors or remove problem trees if necessary before continuing analysis"
head(tree_issues$problem_trees)
#> # A tibble: 6 x 2
#> tree_id issue
#> <chr> <chr>
#> 1 AB08000100012 refers to multiple trees
#> 2 AB08000300005 refers to multiple trees
#> 3 AB08000300020 refers to multiple trees
#> 4 AB08000300024 refers to multiple trees
#> # ... with 2 more rows
The second element ($issue_summary
) is a data frame summarizing the number and percentage of trees in the tree census dataset that have one or more problems and each specific type of problem. Note that a tree id will be flagged if just one of its measurement records contains an issue, so many flagged trees are likely to be usable for most analyses.
head(tree_issues$issue_summary)
#> issue count pct
#> 1 at least one issue 59 5.7392996
#> 2 missing dbh 10 0.9727626
#> 3 missing measurement year 10 0.9727626
#> 4 missing mortality record 20 1.9455253
#> 5 missing species id 10 0.9727626
#> 6 missing stand id 10 0.9727626