GNverifier release v0.6.2, Advanced Search

GNverifier can help to answer the following questions:

  • Is a name-string a real name?
  • Is it spelled correctly, and if not, what might be the correct spelling?
  • Is the name currently in use?
  • If it is a synonym, what data sources consider to be currently accepted names?
  • Is a name a homonym?
  • What is a taxon that the name points to, and where is it placed in various classifications?

Biannual GNverifier database update is done for 14 datasets, new datasets are pending to be added.

The v0.6.2 version of GNverifier is out. It brings new features and some changes in API, input and output format. The main change is an ability to perform a search by name details, such as an abbreviated genus, author, year.

For example g:M. sp:galloprovincialis au:Oliv. y:-1800 will search for a name with genus starting with M, galloprovincialis as a specific epithet, an author starting with Oliv and a year earlier or equal to 1800.

Both name-verification and search return the same format of results. Because of that we needed to change the name of some fields in the output, so their meaning would correspond to both verification and search outputs.

  • inputId changed to id
  • input changed to name
  • preferredResults changed to results`

GNverifier’s API v1 still can be used (it did not undrgo any format changes), but the web-page and command line app gnverifier moved to use new API v0. When the new API stabilizes, it will be renamed to API v2.

Install GNverifier with Homebrew

brew tap gnames/gn
brew install gnverifier

or

brew upgrade gnverifier

Changes in functionality since v0.3.0

  • web-page shows the date when a name was imported into GNverifier database.
  • to make it easier to cite GNverifier, there is a DOI information on its GitHub page.
  • tab-delimited values (TSV) format is now supported.
  • added AlgaeBase to data-sources.
  • there are now options to return all matched results (use with caution, as the output might be excessively big).
  • show score details in the output and on the web-page.
  • advanced search is added to both command line and web-based user interfaces.

Deprecation of services.

GNI is the oldest version of GN name-verification algorithms. Most of its functionality now exists in GNverifier, so the GNI web-site is going to be removed in the beginning of 2022.

The Scala version of GNI will also be scheduled for removal.

GNresolver will continue to run the longest. It will not be deprecated until GNverifier API v2 will be released.

If you use old systems, consider switching to GNverifier because older systems will eventually be deprecated and stopped.

GNparser (Go language) release 1.5.0

GNparser v1.5.0 is out. The following changes happened since 1.3.3:

v1.5.0

Courtesy of Toby Marsden (@tobymarsden) GNparser in ‘cultivars mode’ is able to parse graft-chymeras. An example: “Cytisus purpureus + Laburnum anagyroides”. Note that cultivar-specific names are not recognized outside of the cultivars mode.

v1.4.2

Add support for authors with prefix ‘ver’. An example: “Cryptopleura farlowiana (J.Agardh) ver Steeg & Jossly”.

v1.4.1

Fixed parsing of multinomials where authorship is not separated by space. An example: “Paeonia daurica coriifolia(Rupr.) D.Y.Hong”.

v1.4.0

Support an output in tab-separated values format. Quite often TSV format is much easier to parse than CSV. Tab character is much less common inside of scientific names than , character. Therefore just splitting by \t breaks a row into its components in many cases. It is still recommended to use CSV libraries for any given language to avoid unexpected problems.

Authors that contain prefixes do and de los are now parsed correctly. An example: “… de Cássia Silva do Nascimento …”

Authors with suffex ter are parsed correctly. An example: “Dematiocladium celtidicola Crous, M.J. Wingf. & Y. Zhang ter”.

Added support for non-ASCII apostrophes in Authors’ names. An example: “Galega officinalis (L.) L`Hèr.”.

New GNparser C-binding package for Node.js by Toby Marsden

Toby Marsden also created GNparser wrapper for Node.js.

Update of C-binding for Ruby-based parser

The new v5.3.4 [biodiversity Ruby gem][biodiversity] is released using C-binding to GNparser v1.4.2

GNparser (Go language) release 1.3.3

GNparser v1.3.3 is out. The following changes happened since 1.3.0:

  • GNparser received a citable DOI (v1.3.1)

  • Names-exceptions that are hard to parse because they use some nomenclautral, or biochemical terms as specific epithets are [now covered]. Some examples:

Navicula bacterium
Xestia cfuscum
Bolivina prion
Bembidion satellites
Acrostichum nudum
Gnathopleustes den
  • 2-letter generic names are appended with 3 more genera (Do, Oo, Du).
Do holotrichius (beetle)
Oo spinosum (arachnid)
Nu aakhu (annelid)
  • Known prefixes in authorships are appended with 3 more prefixes adding support for authors like:
delle Chiaje
dos Santos
ten Broeke
ten Hove
  • Parsing of names with ms in like Crisia eburneodenticulata Smitt ms in Busk, 1875 is supported (normalized to Crisia eburneodenticulata Smitt ex Busk, 1875).

  • More annotation ‘stop’ words are added, fixing parsing for names like:

Crisina excavata (d'Orbigny, 1853) non (d'Orbigny, 1853)
Eulima excellens Verkrüzen fide Paetel, 1887
Porina reussi Meneghini in De Amicis, 1885 vide Neviani (1900)

Many thanks to @diatomsRcool, @KatjaSchulz and @joelnitta for feature requests and bug reports!

Clib libraries are now provided with each new release

GNparser can be incorportated via C-binding into many other languages. To make such incorporation easier, the clib files for MacOS, Linux and MS Windows are now provided with every new release.

macos-latest-clib.zip
ubuntu-latest-clib.zip
windows-latest-clib.zip

Update of C-binding for Ruby-based parser

The new v5.3.3 biodiversity Ruby gem is released using C-binding to GNparser v1.3.3

GNparser for JavaScript

@tobymarsden incorporated GNparser C-binding into a Node.js package. He plans to release the new package for NPM.

GNparser (Go language) release 1.3.0

GNparser v1.3.0 is out. The major new functionality is an ability to recognize and parse botanical cultivar names. This ability was added to GNparser by Toby Marsden, thanks Toby for a great patch!

In addition to ICN nomenclatural code for botanical scientific names there is an ICNCP nomenclatural code for cultivated plants. ICNCP supports names like:

Dahlia ‘Doris Day’
Fragaria 'Cambridge Favourite'
Rosa multiflora cv. 'Crimson Rambler'

Now, if these names are parsed as cultivars, the cultivar epithet is included into a canonical form like Rosa multiflora ‘Crimson Rambler’. However such addition would create problems for users who are more interested in the canonical form according to ICN: Rosa multiflora. Therefore, by default GNparser will process such names according to ICN code, providing a warning:

{
  "quality": 2,
  "warning": "Cultivar epithet"
}

If a user does need to treat such names as cultivars, there is a flag in the command line app: gnparser "Rosa multiflora cv. 'Crimson Rambler'" -C. When parsed with this flag, warning will disappear and canonical forms will include cultivar information. GNparser web-interface has a “cultivar checkbox” now, as well as there is a “cultivar” option in GNparser RESTful API.

Parsed detailed data for cultivars:

gnparser "Rosa multiflora cv. 'Crimson Rambler'" -C -d -f pretty
{
  "parsed": true,
  "quality": 1,
  "verbatim": "Rosa multiflora cv. 'Crimson Rambler'",
  "normalized": "Rosa multiflora ‘Crimson Rambler’",
  "canonical": {
    "stemmed": "Rosa multiflor ‘Crimson Rambler’",
    "simple": "Rosa multiflora ‘Crimson Rambler’",
    "full": "Rosa multiflora ‘Crimson Rambler’"
  },
  "cardinality": 3,
  "details": {
    "species": {
      "genus": "Rosa",
      "species": "multiflora",
      "cultivar": "‘Crimson Rambler’"
    }
  },
  "words": [
    {
      "verbatim": "Rosa",
      "normalized": "Rosa",
      "wordType": "GENUS",
      "start": 0,
      "end": 4
    },
    {
      "verbatim": "multiflora",
      "normalized": "multiflora",
      "wordType": "SPECIES",
      "start": 5,
      "end": 15
    },
    {
      "verbatim": "Crimson Rambler",
      "normalized": "‘Crimson Rambler’",
      "wordType": "CULTIVAR",
      "start": 21,
      "end": 36
    }
  ],
  "id": "38ff69c4-7e1a-5a26-bfc4-ee641fed6ba7",
  "parserVersion": "nightly"
}

In addition Toby found and helped to fix problems with stemming of hybrid formulas and with providing correct output for hybrid signs in the “details:words” section. Again, thanks for this contribution Toby Marsden!

You can grab GNparser v1.3.0 binaries and follow installation instructions, or use Homebrew to install it on operating systems that support it:

brew tap gnames/gn
brew install gnparser

GNfinder release v0.14.1

Find scientific names in plain texts, PDF files, MS documents etc.

GNfinder is a program for finding scientific names in texts. GNfinder exists for several years now and is responsible for creation of name indices for Biodiversity Heritage Library (BHL) and HathiTrust Digital Library. The program is fast enough to process 5 million pages of BHL in just a couple of hours.

The new GNfinder v0.14.1 can find names not only in plain UTF8-encoded texts, but also in a large variety of files including PDF, MS Word, MS Excel, and images. In this blog post we describe how it can be used.

GNfinder code follows Semantic Versioning practices. So users need to be aware that for versions 0.x.x backward incompatible changes might happen.

Summary

GNfinder is a command line application, that can also be used as a RESTful service. In the near future it will also have a web-based user interface and will run at https://finder.globalnames.org

The program uses heuristic and Natural Language Processing (NLP) algorithms for name finding.

Performance

For a test we used 4MB PDF file that contains ~2000 unique names. These names are mentioned in text ~13000 times.

Time for conversion to UTF8-encoded plain text: 2.5 sec

Time for name-finding: 0.4 sec

Time for name-verification of 2000 uniquely found names: 2.5 sec

Installation

The program consists of one stand-alone file, so it is easy to install. The binaries for MS WIndows, Mac OS or Linux can be downloaded from GitHub. In addition GNfinder can be installed using a Homebrew package manager with the following terminal commands:

brew tap gnames/gn
brew install gnfinder

For more detailed installation instructions see the documentation on GitHub.

Usages

GNfinder is a command line application. It requires an internet connection for converting files to UTF8-encoded text and for name-verification.

To get help use:

gnfinder -h

To get names from a UTF8-encoded text file (with -U flag no internet connection is required):

gnfinder -U file-with-names.txt

To get names from any other kind of file:

gnfinder file.pdf

To find names, to verify them and to output results in JSON format:

gnfinder file.pdf -v -f pretty > file-names.csv

To convert PDF file into text:

gnfinder -I file.pdf > file.txt

Tutorial

I wrote a tutorial how to exctract scientific names in parallel from a large number of PDF files.

For more information read the documentation on GitHub.

RESTful (API)

You can run GNfinder as a RESTful API service as well. For now API can work only with UTF-8 encoded texts, but other file formats will be available via API as well after completion of a web-based user interface.

GNverifier release v0.3.0

Very fast scientific name checker is out.

There are millions of checklists in use by scientists and nature enthusiasts. Very often, such lists contain misspellings or outdated names. To help researchers clean up their checklists and monitor their quality, we are releasing GNverifier v0.3.0 written in Go language.

GNverivier code follows Semantic Versioning practices. So users need to be aware that for versiong 0.x.x backward incompatible changes might happen.

We released several implementations of name-verification (reconciliation/resolution) before. All of them did not have enough speed for verifying massive lists of scientific names. This release provides from 10x to 100x throughput improvements compared to older implementations.

Summary

GNverifier can help to answer the following questions:

  • Is a name-string a real name?
  • Is it spelled correctly, and if not, what might be the correct spelling?
  • Is the name currently in use?
  • If it is a synonym, what data sources consider to be currently accepted names?
  • Is a name a homonym?
  • What is a taxon that the name points to, and where is it placed in various classifications?

Name verification and reconciliation involves several steps.

  • Exact match: input name matches canonical form located in one or more data-sources.
  • Fuzzy match: if no exact match is found, there is a fuzzy matching of canonical forms
  • Partial Exact match: if the previous two steps failed, we remove words from the end or from the middle of a name and try to match what is left until we end up with a bare genus.
  • Partial Fuzzy match: in case if the partial exact match did not work, and the remained name is not uninomial, we apply fuzzy matching algorithms.

A scoring algorithm then sorts matched results. The “About” page contains more detailed information about matching and scoring.

Performance

We observe speeds of ~2,500 names per second for checklists that are coming from optical character recognition process and contain many misspellings.

Usages

The simplest way to use GNverifier is via web-interface. The online application emits results in HTML, CSV and JSON formats and can process up to 5000 names per request.

For larger datasets, and as an alternative, there is a command line application that can be downloaded for Windows, Mac, and Linux.

gnverifier file-with-names.txt

This version adds an option -c or --capitalize to fix name-strings’ capitalization before verification. It is beneficial for web-interface, as it allows users “to be lazy” when they try to match names.

$ gnverifier "drsophila melanogaster" -c -f pretty
INFO[0000] Using config file: /home/dimus/.config/gnverifier.yaml.
{
  "inputId": "b20a7c40-f593-5a68-a048-0a24742b4283",
  "input": "drsophila melanogaster",
  "inputCapitalized": true,
  "matchType": "Fuzzy",
  "bestResult": {
    "dataSourceId": 1,
    "dataSourceTitleShort": "Catalogue of Life",
    "curation": "Curated",
    "recordId": "2586298",
    "localId": "69bbaee49e7c2f749ee7712f3f168920",
    "outlink": "http://www.catalogueoflife.org/annual-checklist/2019/details/species/id/69bbaee49e7c2f749ee7712f3f168920",
    "entryDate": "2020-06-15",
    "matchedName": "Drosophila melanogaster Meigen, 1830",
    "matchedCardinality": 2,
    "matchedCanonicalSimple": "Drosophila melanogaster",
    "matchedCanonicalFull": "Drosophila melanogaster",
    "currentRecordId": "2586298",
    "currentName": "Drosophila melanogaster Meigen, 1830",
    "currentCardinality": 2,
    "currentCanonicalSimple": "Drosophila melanogaster",
    "currentCanonicalFull": "Drosophila melanogaster",
    "isSynonym": false,
    "classificationPath": "Animalia|Arthropoda|Insecta|Diptera|Drosophilidae|Drosophila|Drosophila melanogaster",
    "classificationRanks": "kingdom|phylum|class|order|family|genus|species",
    "classificationIds": "3939792|3940206|3940214|3946159|3946225|4031785|2586298",
    "editDistance": 1,
    "stemEditDistance": 1,
    "matchType": "Fuzzy"
  },
  "dataSourcesNum": 28,
  "curation": "Curated"
}

It is possible to map a checklist to one of 100+ data sources aggregated in GNverifier.

The following command will match all names from file-with-names.txt against the Catalogue of Life.

gnverifier file-with-names.txt -s 1 -o -f pretty

It is also possible to run web-interface locally by running:

gnverifier -p 4000

After running the command above, the interface can be accessed by a browser via https://globalnames.org URL.

One can find a complete list of gnverifier by running:

gnverifier -h

Application Programming Interface (API)

GNverifier does not keep all the data needed for processing name-strings locally. It uses a remote API located at https://verifier.globalnames.org/api/v1.

The RESTful API is public. It has an OpenAPI description and is available for external scripts.

Deprecation of old systems

There are several older approaches to solve the same problem:

If you use any of these, consider switching to GNverifier because older systems will eventually be deprecated and stopped.

GNparser (Go language) release 1.2.0

Version 1.2.0 of GNparser is out. It adds an option to parse low-case names in case if a checklist does not follow nomenclatural standards.

$ gnparser "plantago major" --capitalize
Id,Verbatim,Cardinality,CanonicalStem,CanonicalSimple,CanonicalFull,Authorship,Year,Quality
085e38af-e19b-56e5-9fec-5d81a467a656,plantago major,2,Plantago maior,Plantago major,Plantago major,,,4

Capitalization does not apply if for named hybrids

$ gnparser "xAus bus" -c
Id,Verbatim,Cardinality,CanonicalStem,CanonicalSimple,CanonicalFull,Authorship,Year,Quality
9b24b828-88a6-58b7-ac76-1342c8ac135d,xAus bus,2,Aus bus,Aus bus,× Aus bus,,,3

GNparser assigns Quality=4 (the worst) and issues a warning.

$ gnparser "plantago major" -c -f pretty
{
  "parsed": true,
  "quality": 4,
  "qualityWarnings": [
    {
      "quality": 4,
      "warning": "Name starts with low-case character"
    }
  ],
  "verbatim": "plantago major",
  "normalized": "Plantago major",
  "canonical": {
    "stemmed": "Plantago maior",
    "simple": "Plantago major",
    "full": "Plantago major"
  },
  "cardinality": 2,
  "id": "085e38af-e19b-56e5-9fec-5d81a467a656",
  "parserVersion": "nightly"
}

GNparser (Go language) release 1.1.0

Scientific name parsing allows to determine a canonical form, the authorship of a name, and receive other meta-information. Canonical forms are crucial for comparing names from different data sources.

We are releasing GNparser v1.1.0 written in Go language. We support Semantic Versioning, therefore it is a stable version. Output format, functions, and settings are going to be backward compatible for many years (until v2).

This is the 3rd implementation of name-parsing for Global Names Architecture project. First one, written in Ruby, biodiversity gem, uses now the Go code of GNparser. Second one, written in Scala is archived, and awaits for a new maintainer.

Summary

GNparser is a sophisticated software, it is able to parse the most complex scientific names. It is also very fast, and able to parse more than 200 million names in an hour. The parser is a core component of many other Global Names Architecture projects.

It can be used via:

We also provide C-binding to its code. This approach allows to incorporate GNparser natively into all languages that support C-binding (such as Java, Python, Ruby etc)

Improvements since the last Scala-based release of GNparser

  • Speed — about 2 times faster than Scala-based version for CSV output, and about 8 times faster for JSON output.

  • Issue #27 — support for agamosp. agamossp. agamovar. ranks.
  • Issue #28 — support for non-ASCII apostrophes.
  • Issue #36 — support _ as a space for files in Newick format.
  • Issue #40 — support names where one of parentheses is missing.
  • Issue #43 — support for notho- (hybrid) ranks.
  • Issue #45 — support for natio rank.
  • Issue #46 — support for subg. rank.
  • Issue #48 — improve transliteration of diactritical characters.
  • Issue #49 — support for outdated names with several hyphens in specific epithet.
  • Issue #51 — distinguish between Aus (Bus) cus in botany and zoology (author or subgenus).
  • Issue #52 — support hyphen in outdated genus names.
  • Issue #57 — warn when f. might mean either filius or forma.
  • Issue #58 — distinguish between Aus (Bus) in ICN and ICZN (author or subgenus).
  • Issue #63 — normalize format to f. instead of fm..
  • Issue #60 — allow outdated ranks in form of Greek letters.
  • Issue #61 — support authors’ names with bis suffix.
  • Issue #66 — remove HTML tags from names, unless asked otherwise.
  • Issue #67 — add name’s authorship to the “root” of JSON structure.
  • Issue #68 — provide stemmed canonical form.
  • Issue #69 — provide shared C library to bind GNparser to other languages.
  • Issue #72 — parse surrogate names from BOLD project.
  • Issue #75 — normalize subspecies to subsp.
  • Issue #74 — support CSV output.
  • Issue #78 — parse virus-like non-virus names correctly.
  • Issue #79 — make CSV as a default output.
  • Issue #80 — add cardinality to output.
  • Issue #81 — support year ranges like ‘1778/79’.
  • Issue #82 — parse authors with prefix zu.
  • Issue #89 — allow subspec. as a rank.
  • Issue #90 — allow ß in names.
  • Issue #93 — parse y from Spanish papers as an author separator.
  • Issue #127 — release a stable 1.0.0 version.
  • Issue #162 — support bacterial Candidatus names.

gnfinder release 0.9.1 -- bug fixes

We are releasing a bug-fixing version (v0.9.1) of gnfinder, a written in Go project that provides ability to search for scientific names in plain UTF-8 encoded texts. It now has a better version and build timestamp report, its black list dictionary expanded by 5 more words, and a bug fixed that prevents client programs to break if name-verification returned an error instead of result. We also started experimenting with making gnfinder available for other languages through a C-shared library.

More can be found in gnfinder’s CHANGELOG file.

The gnfinder code is used to create scientific name indices in Biodiversity Heritage Library, HathiTrust Digital Library, and serves as an engine for GNRD

GNRD release (v0.9.0) switched to gnfinder from TaxonFinder/NetiNeti and became 25 times faster.

Big change came to GNRD, a program David Shorthouse and Dmitry Mozzherin released back in 2012. GNRD is a web application that is able to find scientific names in UTF-8 encoded plain texts, in PDFs, MS Word and MS Excel documents, and even images.

For a long time it used two name-finding libraries TaxonFinder (developed by Patrick Leary) and NetiNeti (developed by Lakshmi Manohar Akella). Both projects served us well all these years using complementary heuristic and natural language processing algorithms. Biodiversity Heritage Library, BioStor and many others used GNRD for detection of scientific names for many eyars with success. However the speed for large-scale name-finding was not satisfactory. To make large-scale name-detection possible we developed gnfinder that also uses both heuristic and NLP algorithms. With this new release of GNRD we substitute TasonFinder and Netineti engines with gnfinder.

We tried hard to keep API as close as possible to how it was before, however there are a few changes, especially at the name-verification (reconciliation and resolution) part. This change made both name-finding and name-verification much faster with increased quality. For example, it used to take 15 seconds to find names in a 1000-page biological book. Now it takes only 0.5 seconds. GNRD tries to get names with OCR errors as well, as a result you might get false positives. We do recommend to use name-verification option to weed out such false results.

If you need to cite GNRD in a paper, v0.9.0 has a DOI attached: 10.5281/zenodo.3569619

gnparser release (v0.12.0) can be used in most modern languages

A few days ago we released v0.12.0 of gnparser (Go version). This version made it possible to compile gnparser algorithms into C-compatible library. Such library makes it possible to use gnparser with its native speeds in any language that supports binding to C. Such languages inlude Python, Ruby, Java (via JNI), Rust, C, C++ and many others.

We already updated Ruby’s biodiversity parser gem to take benefit of a dramatic speed increase and parsing quality of gnparser.

Here are a quick benchmarks that compare how biodiversity performs before and now:

Program Version Full/Simple Names/min
gnparser 0.12.0 Simple 3,000,000
biodiversity 4.0.1 Simple 2,000,000
biodiversity 4.0.1 Full JSON 800,000
biodiversity 3.5.1 n/a 40,000

With this improved speed Encyclopedia of Life, which is written in Ruby, can process all their names using Ruby in less than 15 minutes.

README file of gnparser contains instructions how to make such c-shared library and biodiversity code is a good example of connecting the library to other languages.

gnparser release v0.3.3

We are happy to announce the release of gnparser. Changes in the v0.3.3 release:

  • add optionally showing canonical name UUID
scala> fromString("Homo sapiens").render(compact=false, showCanonicalUuid=true)
res0: String = 
// ...
  "canonical_name" : {
    "id" : "16f235a0-e4a3-529c-9b83-bd15fe722110",
    "value" : "Homo sapiens"
  },
// ...
  • add year’s range to AST node that is encoded with Year’s field rangeEnd: Option[CapturePosition]

  • parse names ending on hybrid sign (#88)

  • support hybrid abbreviation expansions (#310)

  • support raw hybrid formula parsing (#311)

  • continuous build is moved to CircleCI

  • and many structural changes, bug-fixes and quality improvements. They are described in the release documenation.

Scala-based gnparser v.0.3.0

To avoid confusion – gnparser is a new project, different from the formerly released biodiversity parser.

We are happy to announce the second public release of Scala-based Global Names Parser or gnparser. There are many significant changes in the v. 0.3.0 release.

  • Speed improvements. Parser is about 50% faster than already quite fast 0.2.0 version. We were able to parse 30 million names per CPU per hour with this release.

  • Compatibility with Scala 2.10.6: it was important for us to make the parser backward compatible with this older version of Scala, because we wanted to support Spark project.

  • Compatibility with Spark v. 1.6.1. Now the parser can be used in BigData projects running on Spark and massively parallelize parsing process using Spark platform. We added documenation describing how to use the parser with either Scala or Python natively on Spark.

  • Simplified parsing output in addition to “Standard output”: It analyzes the name-strings and returns its id, canonical form, canonical form with infraspecific ranks, authorship and a year.

  • Improved and stabilized JSON fields. You can find complete description of the parser JSON output in its JSON schema. We based names of fields on TDWG’s Taxon Concept Schema, and we indend to keep JSON format stable from now on.

  • There were many structural changes, bug-fixes and quality improvements. They are described in the release documenation.

Scala-based gnparser v.0.2.0

(Please note that gnparser is a new project, different from the formerly released biodiversity parser.)

We are happy to announce a public release of Global Names Parser or gnparser – the first project that marks transition of Global Names reconciliation and resolution services from “prototype” to “production”. The gnparser project is developed by @alexander-myltsev and @dimus in Scala language. GNParser can be used as a library, a command line tool, a socket server, a web-program and RESTful-service. It is easiest to try it at parser.globalnames.org

Scientific names might be expressed by quite different name strings. Sometimes the difference is just one comma, sometimes authors are included or excluded, sometimes ranks are omitted. With all this variability “in the wild” we need to figure out how to group all these different spelling variants. Name parsing is an unexpectedly complex and absolutely necessary step for connecting biological information via scientific names.

In 2008 Global Names released Biodiversity Gem – a scientific name parser written in Ruby for these purposes. The library in its 3 variants enjoyed a significant success – about 150 000 downloads and a notion as the most popular bio-library for Ruby language. It allowed to parse about 2-3 million names an hour, and had been the basis of name reconciliation for many projects from the moment of its publication.

GNParser is a direct descendant of the biodiversity gem. It serves the same purpose and input/output format of both projects are similar. It also marks eventual discontinuation of ‘biodiversity gem project’ and migration of all Global Names code to the new gnparser library.

Why did we go through the pain of making a completely new parser from scratch? The short answer is scalability and portability. We want to be able to remove parsing step from being a bottleneck for any number of names thrown at resolution services. For example finding all names in Biodiversity Heritage Library took us 43 days 3 years ago. Parsing step alone took more than 1 day. If we want to improve algorithms of finding names in BHL – we cannot wait 40 days. We want to be able to do it within one day and improve whole BHL index every time our algorithms are enhanced significantly.

We have an ambitious goal of making time spent on sending names to resolution services over internet and then time spent on transferring the answers back to be the bottlenecks of our name matching services. For such speeds we need a very fast parsing. Scala allows us to dramatically improve speed and scalability of the parsing step.

Having a parser running in Java Virtual Machine environment allows us to give biodiversity community a much more portable parsing tool. Out of the box parser library will work with Scala, Java, R, Jython and JRuby directly. We hope that it will speedup and simplify many biodiversity projects.

This is the first public release of the library. Please download it, run it, test it and give us your feedback, so we can improve it further. Happy parsing :smile:

WARNING: JSON output format might change slightly or dramatically, as we are in the process of refining it. The JSON format should be finalized for version 0.3.0

Sysopia release v0.5.0

Last week was the end of the Google Summer or Code season. Out of two projects that we had been mentoring one was not really about biology. It was a project for system administrators. A visualization tool for staistics about cpu usage, memory, disk space etc.

Sysopia

Everybody who runs involved biodiversity informatics projects knows how important it is to monitor your systems. There are several open source tools for that – Nagios, Sensu, Graphite, Systemd, Collectd…

Our monitoring system of choice is Sensu. It is very flexible and powerful tool, well designed and suitable for large number of tasks. One of these is collecting statistics from computers and store them in about any kind of database. As a result Sensu can be used for monitoring critical events and for collecting data about systems. The question is however how to visualize all the collected data.

We designed Sysopia to do exactly that. During the summer @vpowerrc expanded original prototype and created powerful and flexible visualization tool which is capable to give system administrator an understanding what is happening with 2-20 computers at a glance, receive life updates, and compare today’s statistics with up to one year of data. We already use Sysopia in production, and we are going to deploy it for Global Names as soon as our new computers are in place.

You can read more about sysopia on its help page

GN Parser v.3.4.1

New version 3.4.1 of GlobalNames Parser gem biodiversity is out. It adds ability to parse authors names starting with d' like Cirsium creticum d'Urv. subsp. creticum

which now is parsed correctly

GN Parser v.3.4.0

New version 3.4.0 of GlobalNames Parser gem biodiversity is out. It adds new method that allows to add infraspecific ranks to canonical forms after the fact.

It was possible to add ranks in canonical forms before using the following code:

require "biodiversity"
parser = ScientificNameParser.new(canonical_with_rank: true)
parsed = parser.parse("Carex scirpoidea subsp. convoluta (Kük.)")
parsed[:scientificName][:canonical]
#output: Carex scirpoidea subsp. convoluta

Now it is also possible to add ranks to canonical forms after the fact using static method ScientificNameParser.add_rank_to_canonical

require "biodiversity"
parser = ScientificNameParser.new
parsed = parser.parse("Carex scirpoidea subsp. convoluta (Kük.)")
parsed[:scientificName][:canonical]
#output: Carex scirpoidea convoluta
ScientificNameParser.add_rank_to_canonical(parsed)
parsed[:scientificName][:canonical]
#output: Carex scirpoidea subsp. convoluta

Global Names Parser v3.2.1

New version of Scientific Name Parser is out

This release has some backward compatibility issues with output.

Field “verbatim” is not preprocessed in any way

In previous versions we did strip empty spaces and new line characters around the name to generate “verbatim” field. Now name stays the way it was entered into the parser.

Old behavior:

“Homo sapiens “ -> …“verbatim”: “Homo sapiens”

“Homo sapiens\r\n” -> …“verbatim”: “Homo sapiens”

New behavior:

“Homo sapiens “ -> …“verbatim”: “Homo sapiens “

“Homo sapiens\r\n” -> …“verbatim”: “Homo sapiens\r\n”

Global Names UUID v5 is added to the output as “id” field

{
    "scientificName": {
        "id": "16f235a0-e4a3-529c-9b83-bd15fe722110",
        "parsed": true,
        "parser_version": "3.2.1",
        "verbatim": "Homo sapiens",
        "normalized": "Homo sapiens",
        "canonical": "Homo sapiens",
        "hybrid": false,
        "details": [{
            "genus": {
                "string": "Homo"
            },
            "species": {
                "string": "sapiens"
            }
        }],
        "parser_run": 1,
        "positions": {
            "0": ["genus", 4],
            "5": ["species", 12]
        }
    }
}

Read more about UUID v5 in another blog post

Names with underscores instead of spaces are supported

Such names are often used in representations of phyo-trees. Parser now substitutes underscores to spaces during normalization phase

Normalized canonical forms do not have apostrophes anymore

I am removing behavior introduced in v3.1.10 which would preserve apostrophes in normalized version of names like “Arca m’coyi Tenison-Woods”. Apostrophes are not code compliant.

New behavior:

{
    "scientificName": {
        "id": "b3a9b1a3-f73c-5333-8194-a84c6583d130",
        "parsed": true,
        "parser_version": "3.2.1",
        "verbatim": "Arca m'coyi Tenison-Woods",
        "normalized": "Arca mcoyi Tenison-Woods",
        "canonical": "Arca mcoyi",
        "hybrid": false,
        "details": [{
            "genus": {
                "string": "Arca"
            },
            "species": {
                "string": "mcoyi",
                "authorship": "Tenison-Woods",
                "basionymAuthorTeam": {
                    "authorTeam": "Tenison-Woods",
                    "author": ["Tenison-Woods"]
                }
            }
        }],
        "parser_run": 1,
        "positions": {
            "0": ["genus", 4],
            "5": ["species", 11],
            "12": ["author_word", 25]
        }
    }
}

New UUID v5 Generation Tool -- gn_uuid v0.5.0

We are releasing a new tool - gn_uuid to simplify creation of UUID version 5 identifiers for scientific name strings. UUID v5 has features which are particular useful for the biodiversity community.

UUID version 5: Description

Universally unique identifiers are very popular because for all practical purposes they guarantee globally unique IDs without any negotiation between different entities. There are several ways how UUIDs can be created:

UUID version Uniqueness is achieved by
version 1 Using computer’s MAC address and time
version 2 like v1 plus adding info about user and local domain
version 3 Using MD5 hash of a string in combination with a name space
version 4 Using pseudo-random algorithms
version 5 Using SHA1 hash of a string in combination with a name space

UUID v5 is generated using information from a string, so everyone who uses this method will generate exactly same ID out of the same string. Interested parties do need to agree on generation of a name space, and after that no matter which programming language they use – they will be able to exchange data about a string using their identifiers.

This gem already has a DNS domain “globalnames.org” defined as a name space, so generation of the UUID v5 becomes simpler.

I believe UUID v5 creates very exciting opportunities for biodiversity community. For example if one expert annotates a string or attaches data to it – this information can be linked globally and then harvested by anybody, without any preliminary negotiation.

Quite often researches make an argument that a scientific name is an identifier on its own and there is no need for another level of indirection like UUID. They are right, scientific name string can be an identifier, however, scientific names have severe shortcomings in such a role.

Why Scientific Names are Bad Identifiers for Computers

Scientific name strings have different length

More often than not identifiers end up in databases and used as a primary index to sort, connect and search data. Scientific name strings vary from 2 bytes to more than 500 bytes in length. So if they are used as keys in database they are inefficient, they waste a lot of space, become less efficient for finding or sorting information – indexes key size is usually determined by the the largest key.

UUIDs have always the same, rather small size – 16 bytes. Even when UUIDs are used in their “standard” string representation – they are still reasonably small – 36 characters. Storing them in a database as a number is obviously more efficient.

It is hard to spot differences in name strings

It is very hard for human eye to spot the difference between strings like this

  • Corchoropsis tomentosa var. psilocarpa (Harms & Loes.) C.Y.Wu & Y.Tang

  • Corchoropsis tomentosa var. psilocanpa (Harms & Loes.) C.Y.Wu & Y.Tang

Much easier for their corresponding UUIDs

  • 5edecb2b-903f-54f1-a087-b47b3b021fcd

  • 833c664b-7d00-5c3b-97a4-98b0ab7d0a9a

Name strings come in different encodings.

Currently Latin1, UTF-8 and UTF-16 are most popular encodings used in biodiversity. If authorship or name itself has characters outside of the 128bits of ASCII code – identically looking names will be quite different for computers.

Name strings are less stable because of their encoding

When names are moved from one database to another, from one paper to another sometimes they do not survive the trip. If you spent any time looking at scientific names in electronic form you did see something like this:

  • Acacia ampliceps ? Acacia bivenosa

  • Absidia macrospora V�nov� 1968

  • Absidia sphaerosporangioides Man<acute>ka & Truszk., 1958

  • Cnemisus kaszabi Endr?di 1964

Usually names like these had been submitted in a “wrong” encoding and some characters in them were misinterpreted. UUID on the other hand is just a hexadecimal number, which can be transitioned between various encodings more safely.

Name strings might look the same in print or on screen, but be different

  • Homo sapiens

  • Homo sаpiens

These two strings might look exactly the same on a screen or printed on paper, but in reality they are different. Here are their UUIDs:

  • 16f235a0-e4a3-529c-9b83-bd15fe722110

  • 093dc7f7-5915-56a5-87de-033e20310b14

The difference is that the second name has a Cyrillic а character, which in most cases will look exactly the same as Latin a character. And when the names are printed on paper there is absolutely no way to tell the difference. UUID will tell us that these two name strings are not the same.

Nothing prevents to continue to use name strings for human interaction

One argument that people often give – it is much easier for users to type

http://biosite.org/Parus_major

than

http://biosite.org/47d61c81-5a0f-5448-964a-34bbfb54ce8b

For most of us it is definitely true and nothing prevents developers to create links of the first type, while still using UUIDs behind the scene.

Why UUIDs v5 are better than any other UUIDs as a scientific name identifier

  • They can be generated independently by anybody and still be the same to the same name string

  • They use SHA1 algorithm which does not have (extremely rare) collisions found for MD5 algorithm

  • Same ID can be generated in any popular language following well-defined algorithm

Crossmap Tool v0.1.5

New version of gn_crossmap tool is out.

Global Names Crossmap tool allows to match names from a a spreadsheet to names from any data source in Global Names Resolver.

The main change in this version – output file with crossmap data now contains all fields from original input document and it allows to filter and sort data using any field from the input.

Other changes are

  • @dimus - #5 - All original fields are now preserved in the output file.

  • @dimus - #3 - If ingest has more than 10K rows – user will see logging events

  • @dimus - #4 Bug - Add error messages if headers don’t have necessary fields

  • @dimus - #2 - Header fields are now allowed to have trailing spaces

  • @dimus - #7 Bug - Empty rank does not break crossmapping anymore

  • @dimus - #1 Bug - Add missing rest-client gem