(Please note that gnparser is a new project, different from the formerly released biodiversity parser.)
We are happy to announce a public release of Global Names Parser
or gnparser
– the first project that marks transition of Global Names
reconciliation and resolution services from “prototype” to “production”. The
gnparser
project is developed by @alexander-myltsev and @dimus in Scala
language. GNParser can be used as a library, a command line tool, a socket
server, a web-program and RESTful-service. It is easiest to try it at
parser.globalnames.org
Scientific names might be expressed by quite different name strings. Sometimes the difference is just one comma, sometimes authors are included or excluded, sometimes ranks are omitted. With all this variability “in the wild” we need to figure out how to group all these different spelling variants. Name parsing is an unexpectedly complex and absolutely necessary step for connecting biological information via scientific names.
In 2008 Global Names released Biodiversity Gem – a scientific name parser written in Ruby for these purposes. The library in its 3 variants enjoyed a significant success – about 150 000 downloads and a notion as the most popular bio-library for Ruby language. It allowed to parse about 2-3 million names an hour, and had been the basis of name reconciliation for many projects from the moment of its publication.
GNParser is a direct descendant of the biodiversity gem. It serves the same
purpose and input/output format of both projects are similar. It also marks
eventual discontinuation of ‘biodiversity gem project’ and migration of all
Global Names code to the new gnparser
library.
Why did we go through the pain of making a completely new parser from scratch? The short answer is scalability and portability. We want to be able to remove parsing step from being a bottleneck for any number of names thrown at resolution services. For example finding all names in Biodiversity Heritage Library took us 43 days 3 years ago. Parsing step alone took more than 1 day. If we want to improve algorithms of finding names in BHL – we cannot wait 40 days. We want to be able to do it within one day and improve whole BHL index every time our algorithms are enhanced significantly.
We have an ambitious goal of making time spent on sending names to resolution services over internet and then time spent on transferring the answers back to be the bottlenecks of our name matching services. For such speeds we need a very fast parsing. Scala allows us to dramatically improve speed and scalability of the parsing step.
Having a parser running in Java Virtual Machine environment allows us to give biodiversity community a much more portable parsing tool. Out of the box parser library will work with Scala, Java, R, Jython and JRuby directly. We hope that it will speedup and simplify many biodiversity projects.
This is the first public release of the library. Please download it, run it, test it and give us your feedback, so we can improve it further. Happy parsing
WARNING: JSON output format might change slightly or dramatically, as we are in the process of refining it. The JSON format should be finalized for version 0.3.0