Find scientific names in plain texts, PDF files, MS documents etc.
GNfinder is a program for finding scientific names in texts. GNfinder exists for several years now and is responsible for creation of name indices for Biodiversity Heritage Library (BHL) and HathiTrust Digital Library. The program is fast enough to process 5 million pages of BHL in just a couple of hours.
The new GNfinder v0.14.1 can find names not only in plain UTF8-encoded texts, but also in a large variety of files including PDF, MS Word, MS Excel, and images. In this blog post we describe how it can be used.
GNfinder code follows Semantic Versioning practices. So users need to be aware that for versions 0.x.x backward incompatible changes might happen.
Summary
GNfinder is a command line application, that can also be used
as a RESTful service. In the near future it will also have a web-based user
interface and will run at https://finder.globalnames.org
The program uses heuristic and Natural Language Processing (NLP) algorithms for name finding.
Performance
For a test we used 4MB PDF file that contains ~2000 unique names. These names are mentioned in text ~13000 times.
Time for conversion to UTF8-encoded plain text: 2.5 sec
Time for name-finding: 0.4 sec
Time for name-verification of 2000 uniquely found names: 2.5 sec
Installation
The program consists of one stand-alone file, so it is easy to install. The binaries for MS WIndows, Mac OS or Linux can be downloaded from GitHub. In addition GNfinder can be installed using a Homebrew package manager with the following terminal commands:
brew tap gnames/gn
brew install gnfinder
For more detailed installation instructions see the documentation on GitHub.
Usages
GNfinder is a command line application. It requires an internet connection for converting files to UTF8-encoded text and for name-verification.
To get help use:
gnfinder -h
To get names from a UTF8-encoded text file (with -U flag no internet connection is required):
gnfinder -U file-with-names.txt
To get names from any other kind of file:
gnfinder file.pdf
To find names, to verify them and to output results in JSON format:
gnfinder file.pdf -v -f pretty > file-names.csv
To convert PDF file into text:
gnfinder -I file.pdf > file.txt
Tutorial
I wrote a tutorial how to exctract scientific names in parallel from a large number of PDF files.
For more information read the documentation on GitHub.
RESTful (API)
You can run GNfinder as a RESTful API service as well. For now API can work only with UTF-8 encoded texts, but other file formats will be available via API as well after completion of a web-based user interface.