Find scientific names in plain texts, PDF files, MS documents etc.

GNfinder is a program for finding scientific names in texts. GNfinder exists for several years now and is responsible for creation of name indices for Biodiversity Heritage Library (BHL) and HathiTrust Digital Library. The program is fast enough to process 5 million pages of BHL in just a couple of hours.

The new GNfinder v0.14.1 can find names not only in plain UTF8-encoded texts, but also in a large variety of files including PDF, MS Word, MS Excel, and images. In this blog post we describe how it can be used.

GNfinder code follows Semantic Versioning practices. So users need to be aware that for versions 0.x.x backward incompatible changes might happen.

Summary

GNfinder is a command line application, that can also be used as a RESTful service. In the near future it will also have a web-based user interface and will run at https://finder.globalnames.org

The program uses heuristic and Natural Language Processing (NLP) algorithms for name finding.

Performance

For a test we used 4MB PDF file that contains ~2000 unique names. These names are mentioned in text ~13000 times.

Time for conversion to UTF8-encoded plain text: 2.5 sec

Time for name-finding: 0.4 sec

Time for name-verification of 2000 uniquely found names: 2.5 sec

Installation

The program consists of one stand-alone file, so it is easy to install. The binaries for MS WIndows, Mac OS or Linux can be downloaded from GitHub. In addition GNfinder can be installed using a Homebrew package manager with the following terminal commands:

brew tap gnames/gn
brew install gnfinder

For more detailed installation instructions see the documentation on GitHub.

Usages

GNfinder is a command line application. It requires an internet connection for converting files to UTF8-encoded text and for name-verification.

To get help use:

gnfinder -h

To get names from a UTF8-encoded text file (with -U flag no internet connection is required):

gnfinder -U file-with-names.txt

To get names from any other kind of file:

gnfinder file.pdf

To find names, to verify them and to output results in JSON format:

gnfinder file.pdf -v -f pretty > file-names.csv

To convert PDF file into text:

gnfinder -I file.pdf > file.txt

Tutorial

I wrote a tutorial how to exctract scientific names in parallel from a large number of PDF files.

For more information read the documentation on GitHub.

RESTful (API)

You can run GNfinder as a RESTful API service as well. For now API can work only with UTF-8 encoded texts, but other file formats will be available via API as well after completion of a web-based user interface.