Scala-based gnparser v.0.2.0

(Please note that gnparser is a new project, different from the formerly released biodiversity parser.)

We are happy to announce a public release of Global Names Parser or gnparser – the first project that marks transition of Global Names reconciliation and resolution services from “prototype” to “production”. The gnparser project is developed by @alexander-myltsev and @dimus in Scala language. GNParser can be used as a library, a command line tool, a socket server, a web-program and RESTful-service. It is easiest to try it at parser.globalnames.org

Scientific names might be expressed by quite different name strings. Sometimes the difference is just one comma, sometimes authors are included or excluded, sometimes ranks are omitted. With all this variability “in the wild” we need to figure out how to group all these different spelling variants. Name parsing is an unexpectedly complex and absolutely necessary step for connecting biological information via scientific names.

In 2008 Global Names released Biodiversity Gem – a scientific name parser written in Ruby for these purposes. The library in its 3 variants enjoyed a significant success – about 150 000 downloads and a notion as the most popular bio-library for Ruby language. It allowed to parse about 2-3 million names an hour, and had been the basis of name reconciliation for many projects from the moment of its publication.

GNParser is a direct descendant of the biodiversity gem. It serves the same purpose and input/output format of both projects are similar. It also marks eventual discontinuation of ‘biodiversity gem project’ and migration of all Global Names code to the new gnparser library.

Why did we go through the pain of making a completely new parser from scratch? The short answer is scalability and portability. We want to be able to remove parsing step from being a bottleneck for any number of names thrown at resolution services. For example finding all names in Biodiversity Heritage Library took us 43 days 3 years ago. Parsing step alone took more than 1 day. If we want to improve algorithms of finding names in BHL – we cannot wait 40 days. We want to be able to do it within one day and improve whole BHL index every time our algorithms are enhanced significantly.

We have an ambitious goal of making time spent on sending names to resolution services over internet and then time spent on transferring the answers back to be the bottlenecks of our name matching services. For such speeds we need a very fast parsing. Scala allows us to dramatically improve speed and scalability of the parsing step.

Having a parser running in Java Virtual Machine environment allows us to give biodiversity community a much more portable parsing tool. Out of the box parser library will work with Scala, Java, R, Jython and JRuby directly. We hope that it will speedup and simplify many biodiversity projects.

This is the first public release of the library. Please download it, run it, test it and give us your feedback, so we can improve it further. Happy parsing

WARNING: JSON output format might change slightly or dramatically, as we are in the process of refining it. The JSON format should be finalized for version 0.3.0

Scala-based gnparser v.0.2.0 ∞