For the last couple of years I have been working on making scientific-name detection possible on a massive scale. The result of this work was creation of several tools that increased the speed of name-finding dramatically:
- name-parsing: gnparser
- name-detection: gnfinder
- name-curation: gntagger
- name-verification (built by Alex Myltsev): gnindex
As a result we are able to scan large corpora of biodiversity literature, such as Biodhiversity Heritage Library — BHL (200 000 volumes) and HathiTrust Digital Library (16 000 000 volumes) in a matter of hours.
Recently, we presented our achievements at Biodiversity Next conference, an expanded version of traditional yearly TDWG meeting. I had a talk at a symposium Improving access to hidden scientific data in the Biodiversity Heritage Library. You can read about it in more details at a BHL blog post.
Such a significant collection of biodiversity literature as BHL gains dramatically in usability from data mining efforts. For many years they use the scientific names index that had been generated by our project: Global Names Architecture. Several years ago generation of such index was a very slow and laboreous task, that could not be repeated easily. With our recent developments we are able to index BHL repeatedly with ease. It gives us an opportunity to listen for feedback from BHL users, make incremental improvements in our algorithms and increase the quality of scientific names index continuously.
We are very interested to work with other people who try to enhance usability of literature aggregators like BHL by developing natural language processing and machine learning algorithms. During the conference several researches in the field agreed to participate in a brain-storming workshop at Illinois Natural History Survay, Champaign/Urbana to develop new technologies for data-mining in BHL and other similar corpora. We are planning to organize such a workshop in April/May 2020.