-
Facundo Muñoz authoredFacundo Muñoz authored
README.Rmd 2.32 KiB
output: github_document
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
library(topomatch)
topomatch
Helper function for matching toponyms from different sources, that can be written in slightly different ways. Allows to inspect the matching and act accordingly.
countries1 <- spData::world$name_long
countries2 <- unique(maps::world.cities$country.etc)
(country_matches <- topomatch(countries1, countries2))
There are some manual fixes needed for those toponyms that weren't
correctly matched. Just write the fixes in a named vector.
If there is no correct match for one toponym, give it an NA
.
## Inspect the competing candidates for the unmatched countries
(bm <- best_matches(country_matches)[unmatched(country_matches)])
cnames_fixes <- setNames(
c("Congo Democratic Republic", NA, "Laos", "Korea North",
"Korea South", NA),
names(bm)
)
## Fix the incorrectly matches from similarity as well
cnames_fixes <- c(
cnames_fixes,
"United States" = "USA",
"French Southern and Antarctic Lands" = "France",
"Côte d'Ivoire" = "Ivory Coast",
"United Kingdom" = "UK",
"Antarctica" = NA,
"Northern Cyprus" = "Cyprus",
"Somaliland" = "Somalia",
"South Sudan" = "Sudan"
)
Now you can transcribe
the original toponyms to the
matched terms.
translate <- transcribe(country_matches, fixes = cnames_fixes)
translate(c("United Kingdom", "Kosovo"))
## "Translate" all of the original toponyms
countries1_trans <- translate(countries1)
## Only those "fixed" as NA are not found in the second list
countries1[!countries1_trans %in% countries2]
Method
Wraps local-global alignment algorithm borrwed from bioConductor
package Biostrings
. Works better than global alignment and requires
less fine-tuning (although is considerably slower too)
https://ro-che.info/articles/2016-12-11-local-alignment.
Installation
remotes::install_gitlab("umr-astre/topomatch", host = "forgemia.inra.fr")