Scientific name parsing allows to determine a canonical form, the authorship of a name, and receive other meta-information. Canonical forms are crucial for comparing names from different data sources.

We are releasing GNparser v1.1.0 written in Go language. We support Semantic Versioning, therefore it is a stable version. Output format, functions, and settings are going to be backward compatible for many years (until v2).

This is the 3rd implementation of name-parsing for Global Names Architecture project. First one, written in Ruby, biodiversity gem, uses now the Go code of GNparser. Second one, written in Scala is archived, and awaits for a new maintainer.

Summary

GNparser is a sophisticated software, it is able to parse the most complex scientific names. It is also very fast, and able to parse more than 200 million names in an hour. The parser is a core component of many other Global Names Architecture projects.

It can be used via:

We also provide C-binding to its code. This approach allows to incorporate GNparser natively into all languages that support C-binding (such as Java, Python, Ruby etc)

Improvements since the last Scala-based release of GNparser

  • Speed — about 2 times faster than Scala-based version for CSV output, and about 8 times faster for JSON output.

  • Issue #27 — support for agamosp. agamossp. agamovar. ranks.
  • Issue #28 — support for non-ASCII apostrophes.
  • Issue #36 — support _ as a space for files in Newick format.
  • Issue #40 — support names where one of parentheses is missing.
  • Issue #43 — support for notho- (hybrid) ranks.
  • Issue #45 — support for natio rank.
  • Issue #46 — support for subg. rank.
  • Issue #48 — improve transliteration of diactritical characters.
  • Issue #49 — support for outdated names with several hyphens in specific epithet.
  • Issue #51 — distinguish between Aus (Bus) cus in botany and zoology (author or subgenus).
  • Issue #52 — support hyphen in outdated genus names.
  • Issue #57 — warn when f. might mean either filius or forma.
  • Issue #58 — distinguish between Aus (Bus) in ICN and ICZN (author or subgenus).
  • Issue #63 — normalize format to f. instead of fm..
  • Issue #60 — allow outdated ranks in form of Greek letters.
  • Issue #61 — support authors’ names with bis suffix.
  • Issue #66 — remove HTML tags from names, unless asked otherwise.
  • Issue #67 — add name’s authorship to the “root” of JSON structure.
  • Issue #68 — provide stemmed canonical form.
  • Issue #69 — provide shared C library to bind GNparser to other languages.
  • Issue #72 — parse surrogate names from BOLD project.
  • Issue #75 — normalize subspecies to subsp.
  • Issue #74 — support CSV output.
  • Issue #78 — parse virus-like non-virus names correctly.
  • Issue #79 — make CSV as a default output.
  • Issue #80 — add cardinality to output.
  • Issue #81 — support year ranges like ‘1778/79’.
  • Issue #82 — parse authors with prefix zu.
  • Issue #89 — allow subspec. as a rank.
  • Issue #90 — allow ß in names.
  • Issue #93 — parse y from Spanish papers as an author separator.
  • Issue #127 — release a stable 1.0.0 version.
  • Issue #162 — support bacterial Candidatus names.