Scientific name parsing allows to determine a canonical form, the authorship of a name, and receive other meta-information. Canonical forms are crucial for comparing names from different data sources.
We are releasing GNparser v1.1.0 written in Go language. We support Semantic Versioning, therefore it is a stable version. Output format, functions, and settings are going to be backward compatible for many years (until v2).
This is the 3rd implementation of name-parsing for Global Names Architecture project. First one, written in Ruby, biodiversity gem, uses now the Go code of GNparser. Second one, written in Scala is archived, and awaits for a new maintainer.
Summary
GNparser is a sophisticated software, it is able to parse the most complex scientific names. It is also very fast, and able to parse more than 200 million names in an hour. The parser is a core component of many other Global Names Architecture projects.
It can be used via:
We also provide C-binding to its code. This approach allows to incorporate GNparser natively into all languages that support C-binding (such as Java, Python, Ruby etc)
Improvements since the last Scala-based release of GNparser
-
Speed — about 2 times faster than Scala-based version for CSV output, and about 8 times faster for JSON output.
- Issue #27 — support for
agamosp. agamossp. agamovar.
ranks. - Issue #28 — support for non-ASCII apostrophes.
- Issue #36 — support
_
as a space for files in Newick format. - Issue #40 — support names where one of parentheses is missing.
- Issue #43 — support for
notho-
(hybrid) ranks. - Issue #45 — support for
natio
rank. - Issue #46 — support for
subg.
rank. - Issue #48 — improve transliteration of diactritical characters.
- Issue #49 — support for outdated names with several hyphens in specific epithet.
- Issue #51 — distinguish between
Aus (Bus) cus
in botany and zoology (author or subgenus). - Issue #52 — support hyphen in outdated genus names.
- Issue #57 — warn when
f.
might mean eitherfilius
orforma
. - Issue #58 — distinguish between
Aus (Bus)
in ICN and ICZN (author or subgenus). - Issue #63 — normalize
format
tof.
instead offm.
. - Issue #60 — allow outdated ranks in form of Greek letters.
- Issue #61 — support authors’ names with
bis
suffix. - Issue #66 — remove HTML tags from names, unless asked otherwise.
- Issue #67 — add name’s authorship to the “root” of JSON structure.
- Issue #68 — provide stemmed canonical form.
- Issue #69 — provide shared C library to bind GNparser to other languages.
- Issue #72 — parse surrogate names from BOLD project.
- Issue #75 — normalize subspecies to subsp.
- Issue #74 — support CSV output.
- Issue #78 — parse virus-like non-virus names correctly.
- Issue #79 — make CSV as a default output.
- Issue #80 — add cardinality to output.
- Issue #81 — support year ranges like ‘1778/79’.
- Issue #82 — parse authors with prefix
zu
. - Issue #89 — allow
subspec.
as a rank. - Issue #90 — allow
ß
in names. - Issue #93 — parse
y
from Spanish papers as an author separator. - Issue #127 — release a stable 1.0.0 version.
- Issue #162 — support bacterial
Candidatus
names.