Power down

On May 20th all our servers are going to be down because of a power maintenance at our computer center. Sorry for inconvenience! We are going to be back as soon as possible.

Names in November meeting

Global Names developers Dmitry Mozzherin and Richard Pyle were invited to attend a workshop called “Names in November”, organized by the Catalog of Life and GBIF and hosted in Leiden. The three-day meeting involved more than twenty people from key taxonomic and nomenclatural organizations, and focused on discussing ways that a global information system of taxonomy, including both names data and accepted species information, could be designed to more seamlessly interconnect biodiversity data through organism names. Although the theme of the meeting was certainly not new (many participants in this meeting had attended similar meetings going back decades discussing essentially the same idea), the tone of the discussion was refreshing in that it focused comparatively little on politics and technical details, and instead concentrated on identifying whether such a shared taxonomic infrastructure was even possible (given the political, financial, and technical circumstances currently existing within the main likely partners), and what conditions would need to be met.

Many of the points that participants agreed on in terms of needs and services very closely matched the fundamental goals and infrastructure we have developed (and continue to develop) within the context of Global Names. Now that GN is much more closely coordinating with the Catalog of Life, GN data indexes and services will likely play an important role in implementing the shared global taxonomy resource envisioned during the meeting. Following this meeting, we have a renewed sense of focus within GN development to finish harmonizing integration of GNI and GNUB services, and especially to rapidly increase the effort to bulk-populate GNUB from existing data.

gnparser release v0.3.3

We are happy to announce the release of gnparser. Changes in the v0.3.3 release:

  • add optionally showing canonical name UUID
scala> fromString("Homo sapiens").render(compact=false, showCanonicalUuid=true)
res0: String = 
// ...
  "canonical_name" : {
    "id" : "16f235a0-e4a3-529c-9b83-bd15fe722110",
    "value" : "Homo sapiens"
  },
// ...
  • add year’s range to AST node that is encoded with Year’s field rangeEnd: Option[CapturePosition]

  • parse names ending on hybrid sign (#88)

  • support hybrid abbreviation expansions (#310)

  • support raw hybrid formula parsing (#311)

  • continuous build is moved to CircleCI

  • and many structural changes, bug-fixes and quality improvements. They are described in the release documenation.

Scala-based gnparser v.0.3.0

To avoid confusion – gnparser is a new project, different from the formerly released biodiversity parser.

We are happy to announce the second public release of Scala-based Global Names Parser or gnparser. There are many significant changes in the v. 0.3.0 release.

  • Speed improvements. Parser is about 50% faster than already quite fast 0.2.0 version. We were able to parse 30 million names per CPU per hour with this release.

  • Compatibility with Scala 2.10.6: it was important for us to make the parser backward compatible with this older version of Scala, because we wanted to support Spark project.

  • Compatibility with Spark v. 1.6.1. Now the parser can be used in BigData projects running on Spark and massively parallelize parsing process using Spark platform. We added documenation describing how to use the parser with either Scala or Python natively on Spark.

  • Simplified parsing output in addition to “Standard output”: It analyzes the name-strings and returns its id, canonical form, canonical form with infraspecific ranks, authorship and a year.

  • Improved and stabilized JSON fields. You can find complete description of the parser JSON output in its JSON schema. We based names of fields on TDWG’s Taxon Concept Schema, and we indend to keep JSON format stable from now on.

  • There were many structural changes, bug-fixes and quality improvements. They are described in the release documenation.

uBio and Nomenclator Zoologicus are back

uBio and Nomenclator Zoologicus online experienced difficulties this year and had been down a lot lately, mostly because there is no system administrator to look for them at Marine Biological Laboratory anymore.

I moved uBio from old hardware at Marine Biological Laboratory to Google Container Engine, and it is running again. Some functionality is not back yet, mostly due to some hard-coded configuration parameters in files. I hope problems with the code will be fixed eventually by interested parties (I do not plan to rewrite the code). I’ll coordinate my efforts with Dave Remsen and Patrick Leary, and hopefully together we will preserve uBio for the community.

Please note that being a system administrator for uBio is not part of my job. I like the project, I consider it to be a ‘precursor’ of GN, I will try my best to keep it running on my spare time. MBL/WHOI library pays for the cloud.

Docker containers to run uBio are located at dockerhub. We use Docker and Kubernetes at Google Container Engine to keep it alive.

Scala-based gnparser v.0.2.0

(Please note that gnparser is a new project, different from the formerly released biodiversity parser.)

We are happy to announce a public release of Global Names Parser or gnparser – the first project that marks transition of Global Names reconciliation and resolution services from “prototype” to “production”. The gnparser project is developed by @alexander-myltsev and @dimus in Scala language. GNParser can be used as a library, a command line tool, a socket server, a web-program and RESTful-service. It is easiest to try it at parser.globalnames.org

Scientific names might be expressed by quite different name strings. Sometimes the difference is just one comma, sometimes authors are included or excluded, sometimes ranks are omitted. With all this variability “in the wild” we need to figure out how to group all these different spelling variants. Name parsing is an unexpectedly complex and absolutely necessary step for connecting biological information via scientific names.

In 2008 Global Names released Biodiversity Gem – a scientific name parser written in Ruby for these purposes. The library in its 3 variants enjoyed a significant success – about 150 000 downloads and a notion as the most popular bio-library for Ruby language. It allowed to parse about 2-3 million names an hour, and had been the basis of name reconciliation for many projects from the moment of its publication.

GNParser is a direct descendant of the biodiversity gem. It serves the same purpose and input/output format of both projects are similar. It also marks eventual discontinuation of ‘biodiversity gem project’ and migration of all Global Names code to the new gnparser library.

Why did we go through the pain of making a completely new parser from scratch? The short answer is scalability and portability. We want to be able to remove parsing step from being a bottleneck for any number of names thrown at resolution services. For example finding all names in Biodiversity Heritage Library took us 43 days 3 years ago. Parsing step alone took more than 1 day. If we want to improve algorithms of finding names in BHL – we cannot wait 40 days. We want to be able to do it within one day and improve whole BHL index every time our algorithms are enhanced significantly.

We have an ambitious goal of making time spent on sending names to resolution services over internet and then time spent on transferring the answers back to be the bottlenecks of our name matching services. For such speeds we need a very fast parsing. Scala allows us to dramatically improve speed and scalability of the parsing step.

Having a parser running in Java Virtual Machine environment allows us to give biodiversity community a much more portable parsing tool. Out of the box parser library will work with Scala, Java, R, Jython and JRuby directly. We hope that it will speedup and simplify many biodiversity projects.

This is the first public release of the library. Please download it, run it, test it and give us your feedback, so we can improve it further. Happy parsing :smile:

WARNING: JSON output format might change slightly or dramatically, as we are in the process of refining it. The JSON format should be finalized for version 0.3.0

Writing Papers as Open Source -- Solutions

I decided to figure out how can I write scientific papers in truly Open Source fashion. And here are practical decisions that allowed me to do it:

Criteria

  • Paper draft is under true revision control system
  • Open access from the very beginning
  • Open tools/standards

Solutions

Revision control

To use full power of revision control system a project should be mostly in text format of some sort. We currently keep practically all our code on GitHub, so Git was a natural choice.

Open tools/formats

I decided to go with LaTeX, as it is tried and powerful markup, very well suited for scientific writing. It allows to work with plain text, so we can easily keep revisions in Git.

Vim is my editor of choice, but nothing prevents me or co-authors to use any other modern text editor for LaTeX.

Open access

With LaTeX and Git it is easy to provide early access to the work in progress, especially with 2 commercial products that give free access for open projects – GitHub and OverLeaf. Overleaf supports git, although not as well as GitHub does. So currently is it better to have GitHub as the main repository, and keep Overleaf as a glorified viewer and use it only as a secondary repository. Another useful tool is Mendeley for finding and organazing bibliography.

Final Result

I am still learning my ropes, but excited about the progress. And the paper about Global Names Parser is now on GitHub and OverLeaf! Overleaf allows anybody interested to see the paper in user-friendly PDF format. It also simplifies submission of papers to a large variety of open access journals.

I created a post on my personal blog describing how did I set up my system with LaTeX, vim and tmux.

Writing Papers as Open Source

Does a culture exist out there which considers a process of writing scientific papers to be akin to writing an open source code?

For the last 8 years I had been blessed being paid for doing open source development. It means that for that long about everything I do is almost instantly available publicly. This model fits my way of thinking, my values, and I see an advantage in making all I do available for public to see, comment, and enhance.

Now I am writing a paper, and I feel being thrown into “dark ages”. Whole paper writing and publishing culture was one of the reasons I left molecular biology and went to programming. I assume the following is usually true when people write a scientific paper –

  • People normally do not share publicly what they work on until it is published.
  • People normally use proprietary software to write papers
  • People often loose their copyright or ability to share their work when their paper is accepted by a journal

Obviously there is a progress with the last point, but what about other 2?

  • Can I use public revision control system when I am writing a paper?
  • Can I publish using a revision control system from the very first paragraph for all to see?
  • What open standards/tools can I use (LaTeX, or even markdown?) for writing a paper
  • Can I consider publishing paper to be a ‘release’, like for a program?
  • Should electronic version be frozen? Can it evolve after publication?

Honolulu Workshop

Last week (October 5-9 2015) @deepreef, @dimus, @alexander-myltsev had a workshop in Honolulu at Bishop Museum to sync ideas, learn more about each other work, and design new generation of services. The meeting had been productive and I think in the end our two GN groups get integrated. We are moving all our code under one roof at GitHub now.

We had an interesting meeting with @jar398 from Open Tree of Life trying to figure out how can we connect OTOL with all the other resources on the web, and @deepreef suggested to use his BioGuid project for these purposes. We moved BioGuid to github and added all the GitHubbish bells and whistles like a blog and [gitter][bioguid-gitter] for example to the project. I think it is pretty cool that we will have a downloadable csv file from all the IDs @deepreef collected that can be used by all other projects in new exiting ways.

Another interesting conversation was with Phylotastic project. We worked on an idea of making an application which will allow convert pictures of scientific names people take at museums or from pages of research papers into texts, extract names appeared there and build phylo-trees from these names using Open Tree. Also the app would show pictures from Encyclopedia of Life and pages from Wikipedia. Such app will mash up interfaces of Global Names to find and reconcile names, Open Tree to build trees, and EOL to get information about species.

BioGUID Wins Award!

As we annouced previously, BioGUID.org has been incorporated into the Global Names suite of indexes and services. Within two days of this happening, we recieved some wonderful news: BioGUID.org won second place in the GBIF Ebbe Nielsen Challenge! We’re very excited about this recognition, and it re-inforces our decision to incorporate BioGUID into the GNA system. You can follow continuing developments on the new BioGUID Blog.

BioGUID merges with GN

BioGUID.org, an indexing service that cross-links identifiers assigned to data objects in the biodiversity information universe, has now been incorporated into the Global Names suite of indexes and services. BioGUID.org represents the third major data component of Global Names (along side GNI and GNUB), and replaces a less robust identifier linking service that had previously been included within GNUB. In addition to the crucial role of cross-linking identifiers within the general GN architecture itself, the broader function of BioGUID.org falls within the scope of Global Names in the sense that identifiers can be thought of as names, and names play the same functional role as identifiers.

We are currently in the process of porting BioGUID into the GN GitHub, and you can follow developments on the new BioGUID Blog.

GNUB proposal submitted to NSF

Bishop Museum, in partnership with the Catalogue of Life, iDigBio, GBIF, WoRMS, PLAZI, BHL, the International Congresses of Dipterology, and Pensoft Publishers, submitted a proposal to the U.S. National Science Foundation’s Advances in Biological Informatics (ABI) program, to develop the Global Names Usage Bank. This proposed project will dramatically improve the core infrastructure behind GNUB in particular, and globalnames in general.

Global Names Gains Stable Funding

Moving Global Names to a stable ground… On October 1st I did sign a job offer from Species File Group and going to start at the new position on November 16th. This position is supported by a fund created by David Eades (thank you David) and allows to think of grants not like a vehicle for GN’s survival, but means of further enhancement of the system. I am honored and touched that David, and Species File Group created this amazing opportunity for Global Names. This move should also lead to a tight integration between Catalogue of Life and Global Names, which in my view is a win/win situation for both projects and for biodiversity.

Our next steps are releasing JVM-based scientific name parser, building scalable, reliable and fast name resolution service and integratation of name resolution with the key Global Names project developed by Rich Pyle and Rob WhittonGlobal Names Usage Bank for which Rich just submitted a grant proposal.

Crossmapping Names by IDs

Yesterday @hyanwong and @jhpoelen mentioned on eol and global names gitter chats that it would be great to be able to crossmap names from different sources by IDs.

What would be a use for such a crossmap? It would allow to quickly mashup data from various projects in interesting ways. For example to show images of species from EOL API on phylo-trees generated using Open Tree of Life API.

TreeOfLife

OpenTree Taxonomy has mapping of OTT IDs to NCBI IDs. Encyclopdia of Life also has mapping of NCBI IDs to EOL IDs. So if someone wants to map NCBI names to EOL using the same algorithm as EOL used – they only need to query data about IDs. Even better – it would create a very fast connection of one aggregator (Open Tree), to another aggregator (EOL) through IDs of other sources without doing explicit name resolution.

Such queries would be much faster, as they would be just about comparing indexed columns in a table. However the quality of the results from such approach would depend on the quality of name resolution used by the aggregators.

I am thinking about trying just that. As a pilot – we can generate Darwin Core Archive files from OTT and EOL which would contain information about IDs from other sources. Then we will need to add an API that would make it possible to run queries from such information.

Another good suggestion from @jhpoelen – is to publish data about this kind of crossmap as a csv file, that can be easily put in some kind of a database and used separately on its own.

Alexander Myltsev

Global Names is happy to welcome a new member – Alexander Myltsev (@alexander-myltsev on GitHub). Alexander is of parboiled2 fame. Parboiled2 is a Parsing Expressed Grammar parser for Scala, and it did originate from Alex’ code which he wrote as a Google Summer of Code participant in 2013.

Alexander Myltsev

Alexander lives in Moscow and currently works on a port of biodiversity parser to Scala; the project is called gnparser. The new parser is compatible with Java, JRuby, Jython and everything else written for Java virtual machine environment. When the parser is ready it will be the basis of a new Scala-based collection of GN tools.

Alexander had been working with us for a few months now, but I had been waiting with the announcement until major paperwork hurdles were solved.

Sysopia release v0.5.0

Last week was the end of the Google Summer or Code season. Out of two projects that we had been mentoring one was not really about biology. It was a project for system administrators. A visualization tool for staistics about cpu usage, memory, disk space etc.

Sysopia

Everybody who runs involved biodiversity informatics projects knows how important it is to monitor your systems. There are several open source tools for that – Nagios, Sensu, Graphite, Systemd, Collectd…

Our monitoring system of choice is Sensu. It is very flexible and powerful tool, well designed and suitable for large number of tasks. One of these is collecting statistics from computers and store them in about any kind of database. As a result Sensu can be used for monitoring critical events and for collecting data about systems. The question is however how to visualize all the collected data.

We designed Sysopia to do exactly that. During the summer @vpowerrc expanded original prototype and created powerful and flexible visualization tool which is capable to give system administrator an understanding what is happening with 2-20 computers at a glance, receive life updates, and compare today’s statistics with up to one year of data. We already use Sysopia in production, and we are going to deploy it for Global Names as soon as our new computers are in place.

You can read more about sysopia on its help page

Site globalnames.org merged with GN blog

For quite a while we used to have a Drupal-based site for GlobalNames. As we do have now a Jekyll-based blog, it was logical to move our static site as well. And now it did happen – both of them are accessible via globalnames.org

This new site will continue to be an ‘official’ blog for news about GNA, we will publish information about new releases of software here, documents and discussions about scientific names.

One great thing about this move – it is possible for anybody with account at github to participate – if you want to add a document, or a blog item – just fork the repository, add a post to the _posts directory and send a pull request. At some point we will add detailed instructions how to do that.

GN Parser v.3.4.1

New version 3.4.1 of GlobalNames Parser gem biodiversity is out. It adds ability to parse authors names starting with d' like Cirsium creticum d'Urv. subsp. creticum

which now is parsed correctly

GN Parser v.3.4.0

New version 3.4.0 of GlobalNames Parser gem biodiversity is out. It adds new method that allows to add infraspecific ranks to canonical forms after the fact.

It was possible to add ranks in canonical forms before using the following code:

require "biodiversity"
parser = ScientificNameParser.new(canonical_with_rank: true)
parsed = parser.parse("Carex scirpoidea subsp. convoluta (Kük.)")
parsed[:scientificName][:canonical]
#output: Carex scirpoidea subsp. convoluta

Now it is also possible to add ranks to canonical forms after the fact using static method ScientificNameParser.add_rank_to_canonical

require "biodiversity"
parser = ScientificNameParser.new
parsed = parser.parse("Carex scirpoidea subsp. convoluta (Kük.)")
parsed[:scientificName][:canonical]
#output: Carex scirpoidea convoluta
ScientificNameParser.add_rank_to_canonical(parsed)
parsed[:scientificName][:canonical]
#output: Carex scirpoidea subsp. convoluta

Catalogue of Life Meeting in Champaign/Urbana

There was a short meeting about Catalogue of Life future directions organized at Species File Group at Champaign/Urbana. Concerning Global Names it was a very productive meeting. It was great to understand the current state of Catalogue of Life, to see that CoL is not loosing momentum inspite of financial problems of biodiversity informatics in general. There was definitely interest in creating more bridges between various projects.

Yuri Roskov did present a ‘pilot’ project of cooperation between Encyclopeia of Life species pages group and CoL. Data about ~2000 species of scorpions had been harvested from html-based site to be used in both projects. I think it was a great exercise and I do hope it will be just the first example of such cooperation.

From the point of Global Names there were good news too. I think it was everybody’s feeling that Global Names resolution is an important complementary service for Catalogue of Life. Cooperation between various biodiversity projects was brought up again and again, and organizing biodiversity infrastructure as a mix of several projects where GBIF, EOL, CoL, GN etc work as modules of a bigger puzzle, complement and enhance each other.

One thing that was brought up is lack of nomenclatural component in GN. I talked about our plans to integrate GN Usage Bank and GN Resolver and demonstrate the flow of nomenclatural data into resolution/reconciliation process. We will try to make such connection by November and demonstrate the workflow on upcoming GBIF/CoL workshop.

Global Names Parser v3.2.1

New version of Scientific Name Parser is out

This release has some backward compatibility issues with output.

Field “verbatim” is not preprocessed in any way

In previous versions we did strip empty spaces and new line characters around the name to generate “verbatim” field. Now name stays the way it was entered into the parser.

Old behavior:

“Homo sapiens “ -> …“verbatim”: “Homo sapiens”

“Homo sapiens\r\n” -> …“verbatim”: “Homo sapiens”

New behavior:

“Homo sapiens “ -> …“verbatim”: “Homo sapiens “

“Homo sapiens\r\n” -> …“verbatim”: “Homo sapiens\r\n”

Global Names UUID v5 is added to the output as “id” field

{
    "scientificName": {
        "id": "16f235a0-e4a3-529c-9b83-bd15fe722110",
        "parsed": true,
        "parser_version": "3.2.1",
        "verbatim": "Homo sapiens",
        "normalized": "Homo sapiens",
        "canonical": "Homo sapiens",
        "hybrid": false,
        "details": [{
            "genus": {
                "string": "Homo"
            },
            "species": {
                "string": "sapiens"
            }
        }],
        "parser_run": 1,
        "positions": {
            "0": ["genus", 4],
            "5": ["species", 12]
        }
    }
}

Read more about UUID v5 in another blog post

Names with underscores instead of spaces are supported

Such names are often used in representations of phyo-trees. Parser now substitutes underscores to spaces during normalization phase

Normalized canonical forms do not have apostrophes anymore

I am removing behavior introduced in v3.1.10 which would preserve apostrophes in normalized version of names like “Arca m’coyi Tenison-Woods”. Apostrophes are not code compliant.

New behavior:

{
    "scientificName": {
        "id": "b3a9b1a3-f73c-5333-8194-a84c6583d130",
        "parsed": true,
        "parser_version": "3.2.1",
        "verbatim": "Arca m'coyi Tenison-Woods",
        "normalized": "Arca mcoyi Tenison-Woods",
        "canonical": "Arca mcoyi",
        "hybrid": false,
        "details": [{
            "genus": {
                "string": "Arca"
            },
            "species": {
                "string": "mcoyi",
                "authorship": "Tenison-Woods",
                "basionymAuthorTeam": {
                    "authorTeam": "Tenison-Woods",
                    "author": ["Tenison-Woods"]
                }
            }
        }],
        "parser_run": 1,
        "positions": {
            "0": ["genus", 4],
            "5": ["species", 11],
            "12": ["author_word", 25]
        }
    }
}

New UUID v5 Generation Tool -- gn_uuid v0.5.0

We are releasing a new tool - gn_uuid to simplify creation of UUID version 5 identifiers for scientific name strings. UUID v5 has features which are particular useful for the biodiversity community.

UUID version 5: Description

Universally unique identifiers are very popular because for all practical purposes they guarantee globally unique IDs without any negotiation between different entities. There are several ways how UUIDs can be created:

UUID version Uniqueness is achieved by
version 1 Using computer’s MAC address and time
version 2 like v1 plus adding info about user and local domain
version 3 Using MD5 hash of a string in combination with a name space
version 4 Using pseudo-random algorithms
version 5 Using SHA1 hash of a string in combination with a name space

UUID v5 is generated using information from a string, so everyone who uses this method will generate exactly same ID out of the same string. Interested parties do need to agree on generation of a name space, and after that no matter which programming language they use – they will be able to exchange data about a string using their identifiers.

This gem already has a DNS domain “globalnames.org” defined as a name space, so generation of the UUID v5 becomes simpler.

I believe UUID v5 creates very exciting opportunities for biodiversity community. For example if one expert annotates a string or attaches data to it – this information can be linked globally and then harvested by anybody, without any preliminary negotiation.

Quite often researches make an argument that a scientific name is an identifier on its own and there is no need for another level of indirection like UUID. They are right, scientific name string can be an identifier, however, scientific names have severe shortcomings in such a role.

Why Scientific Names are Bad Identifiers for Computers

Scientific name strings have different length

More often than not identifiers end up in databases and used as a primary index to sort, connect and search data. Scientific name strings vary from 2 bytes to more than 500 bytes in length. So if they are used as keys in database they are inefficient, they waste a lot of space, become less efficient for finding or sorting information – indexes key size is usually determined by the the largest key.

UUIDs have always the same, rather small size – 16 bytes. Even when UUIDs are used in their “standard” string representation – they are still reasonably small – 36 characters. Storing them in a database as a number is obviously more efficient.

It is hard to spot differences in name strings

It is very hard for human eye to spot the difference between strings like this

  • Corchoropsis tomentosa var. psilocarpa (Harms & Loes.) C.Y.Wu & Y.Tang

  • Corchoropsis tomentosa var. psilocanpa (Harms & Loes.) C.Y.Wu & Y.Tang

Much easier for their corresponding UUIDs

  • 5edecb2b-903f-54f1-a087-b47b3b021fcd

  • 833c664b-7d00-5c3b-97a4-98b0ab7d0a9a

Name strings come in different encodings.

Currently Latin1, UTF-8 and UTF-16 are most popular encodings used in biodiversity. If authorship or name itself has characters outside of the 128bits of ASCII code – identically looking names will be quite different for computers.

Name strings are less stable because of their encoding

When names are moved from one database to another, from one paper to another sometimes they do not survive the trip. If you spent any time looking at scientific names in electronic form you did see something like this:

  • Acacia ampliceps ? Acacia bivenosa

  • Absidia macrospora V�nov� 1968

  • Absidia sphaerosporangioides Man<acute>ka & Truszk., 1958

  • Cnemisus kaszabi Endr?di 1964

Usually names like these had been submitted in a “wrong” encoding and some characters in them were misinterpreted. UUID on the other hand is just a hexadecimal number, which can be transitioned between various encodings more safely.

Name strings might look the same in print or on screen, but be different

  • Homo sapiens

  • Homo sаpiens

These two strings might look exactly the same on a screen or printed on paper, but in reality they are different. Here are their UUIDs:

  • 16f235a0-e4a3-529c-9b83-bd15fe722110

  • 093dc7f7-5915-56a5-87de-033e20310b14

The difference is that the second name has a Cyrillic а character, which in most cases will look exactly the same as Latin a character. And when the names are printed on paper there is absolutely no way to tell the difference. UUID will tell us that these two name strings are not the same.

Nothing prevents to continue to use name strings for human interaction

One argument that people often give – it is much easier for users to type

http://biosite.org/Parus_major

than

http://biosite.org/47d61c81-5a0f-5448-964a-34bbfb54ce8b

For most of us it is definitely true and nothing prevents developers to create links of the first type, while still using UUIDs behind the scene.

Why UUIDs v5 are better than any other UUIDs as a scientific name identifier

  • They can be generated independently by anybody and still be the same to the same name string

  • They use SHA1 algorithm which does not have (extremely rare) collisions found for MD5 algorithm

  • Same ID can be generated in any popular language following well-defined algorithm

Crossmap Tool v0.1.5

New version of gn_crossmap tool is out.

Global Names Crossmap tool allows to match names from a a spreadsheet to names from any data source in Global Names Resolver.

The main change in this version – output file with crossmap data now contains all fields from original input document and it allows to filter and sort data using any field from the input.

Other changes are

  • @dimus - #5 - All original fields are now preserved in the output file.

  • @dimus - #3 - If ingest has more than 10K rows – user will see logging events

  • @dimus - #4 Bug - Add error messages if headers don’t have necessary fields

  • @dimus - #2 - Header fields are now allowed to have trailing spaces

  • @dimus - #7 Bug - Empty rank does not break crossmapping anymore

  • @dimus - #1 Bug - Add missing rest-client gem

iDigBio API Client v0.1.1

In a few weeks there will be an iDigBio API hackathon. As I menioned earlier we decided to add another API client written in Ruby before the hackathon starts. And Greg Traub and I are releasing iDigBio API Client written in Ruby today.

Greg started to make a Ruby client at iDigBio. I took his code and refactored it into a Ruby gem. So now instead of 0 we have 2 Ruby clients :smile:

This is the very first release, so if you will start using it and find something is wrong/missing please submit an issue. The gem uses beta API, so sometimes it might get ‘stuck’. This problem will go away when beta API will move to production.

Scientific Name Parser v3.1.10

New version of Scientific Name Parser is out

Addressing most of Issue #7

Do not parse non-virus names containing RNA

If a name was not detected as a virus but contains RNA word it will not be parsed anymore. It is a problem for some surrogate names, like Candida albicans RNA_CTR0-3 but they are very rare.

Name Action
Candida albicans RNA_CTR0-3 Not parsed
Alpha proteobacterium RNA12 Not parsed
Ustilaginoidea virens RNA virus Not parsed, marked as virus
Calathus (Lindrothius) KURNAKOV 1961 Parsed as before

Better detection of virus names

Names containing virophage *NPV, *sattelite, *particle are marked as ‘viruses’ and not parsed

Gossypium mustilinum symptomless alphasatellite
Okra leaf curl Mali alphasatellites-Cameroon
Bemisia betasatellite LW-2014
Tomato leaf curl Bangladesh betasatellites [India/Patna/Chilli/2008]
Intracisternal A-particles
Saccharomyces cerevisiae killer particle M1
Uranotaenia sapphirina NPV
Spodoptera exigua nuclear polyhedrosis virus SeMNPV
Spodoptera frugiperda MNPV
Rachiplusia ou MNPV (strain R1)
Orgyia pseudotsugata nuclear polyhedrosis virus OpMNPV
Mamestra configurata NPV-A
Helicoverpa armigera SNPV NNg1
Zamilon virophage
Sputnik virophage 3

Better handling of species/infraspecies epithets with apostrophe

Names like below are now parsed correctly. Their normalized/canonical forms preserve apostrophe

Junellia o'donelli Moldenke, 1946
Trophon d'orbignyi Carcelles, 1946
Arca m'coyi Tenison-Woods, 1878
Nucula m'andrewii Hanley, 1860
Eristalis l'herminierii Macquart
Odynerus o'neili Cameron
Serjania meridionalis Cambess. var. o'donelli F.A. Barkley

Global Names Grant Goes to Illinois

I did some soul-searching, advise-gathering, thinking, planning, and crystal ball gazing. And it seems that moving Global Names grant and myself to Species File Group is a right decision. Why? Because Marine Biological Laboratory is a hard core research institute, which completely depends on grants and as such is not well-suited for infrastructure projects. Global Names is definitely an infrastructure project and I know very well how bad it is to be responsible for an infrastructure project and not being able to work on it. It is just not a good way to do business.

It is my 8th year at MBL. I enjoy MBL, I love living on Cape Cod. I love an immense energy of MBL collective mind. I met really amazing people, amazing scientists here. I worked with great people at Encyclopedia of Life project. And also I was never sure if I will be there next year, or sometimes next month. I had weeks and months when I had no ability to move forward with Global Names, because it had no financial support at that time.

Species Files Group is long term financed, allowing a long term commitment. They are interested in Global Names, they do want me to continue to develop it, integrate it with Catalogue of Life. And these are my goals too. David Eades understands that Global Names will need a long-term investment in hardware, and he provides a generous annual fund for that. It means no more 7 year old computers running Global Names services. I also hope it will help to integrate Global Names Usage Bank, a crucial GN component developed by Rich Pyle and Rob Whitton.

Another big factor is ability to work closely with programmers and taxonomists of the SFG group. At MBL I am now the only one on EOL project (Jeremy is remote), and I feel I am getting stale without nomenclators/taxonomists around.

Of course we need to figure out how to move current GN computers without shutting down services for a few weeks. I imagine I would have to rent an expensive cloud setup for a month or two, and run GN from there while machines are in transit. We will have to figure out how to transfer grant, make a new hire for the project etc. But all of these are good problems to solve. I believe GN suddenly got a brighter future ahead.

New tool to crossmap checklists

Yesterday I released a new command line tool for name resolution called gn_crossmap. It is designed for people who work with checklists of scientific names using a spreadsheet software (MS Excel, Apple Numbers, Open Office, Libre Office, Google Sheets etc.) and want to compare names that they have with another reference source. The program takes a spreadsheet saved as csv file as input and generates another csv-based spreadsheet with resolution data. Examples of input and output are included into the code. README file describes how to use the project from a command line or as a Ruby library.

This program requires internet connection, Ruby >= 2.1 installed on the machine.

Basic usage is:

$ gem install gn_crossmap
$ crossmap -i input.csv -o ouput.csv -d 1

where


short long attr Description
-i –input checklist’s spreadsheet saved as csv file
-o –output path to the output file. Default is output.csv in the current directory
-d –data-source-id ID of one of the GN Resolver data sources, Catalogue of Life id (1) is default

Web interface to this program is also in works

This project started at the Catalogue of Life workshop in Leiden, which happened in March 2015. The main focus of the hackathon was to figure out how to help national checklist teams to create, maintain and compare data in their data. We determined 3 main approaches

  1. Crossmapping checklists against other checklists and/or reference sources
  2. Annotation of crossmapped data – ability to share metadata, report mistakes
  3. Distribution of species – how to fix occurance errors for a country

A hackathon group which worked on crossmapping produced a code which would compare checklists against Catalogue of Life. The gn_crossmap program I am releasing is based heavily on what we learned during the hackathon. Crossmaping code is mostly based on use cases from Rui Figueira and Wouter Koch. During the hackathon we also determined ways to improve quality of name resolution further by:

  • Using infraspecies’ rank (var., f. subsp. etc) in the matching and penalize score if ranks are different
  • Taking in account if matching authors are basyonym or combination authors
  • Using meta-information attached to names via sensu…, not … etc. to distinguish name usages

Kickoff Meeting for Disseminating Phylogenetic Knowledge Project

Yesterday Arlin Stoltzfus organized a kickoff meeting for the project that got funded by NSF this year - “Collaborative Research: ABI Development: An open infrastructure to disseminate phylogenetic knowledge”. Global Names is participating in project and I believe it will be an interesting ride.

The idea behind is pretty cool. Imagine that someone works on a group of organisms. They submit names of the organisms to a service and the service builds a phylogenetic tree out of the names. When tree is created it will start its own life similar to a repo on Github. People will be able to reuse it, annotate it for their own purposes, create derivative trees. It would be a pretty nice feature for Encyclopedia of Life to see how species belonging to a particular clade are related to each other through phylogeny. One problem with creation of such trees is name normalization. Scientific names can have many alternative spellings, so to find phylo-information we will need to be able to map names from user list to names which are recognized by the service.

I suspect that a crossmapping tool I am working on this week might be adjusted for this particular task, but as usual – the devil is in details and we will find out the requirements during the design process.

New Higher Level Classification for Catalogue of Life

Bob Corrigan sent around an email pointing at a paper in PLOS which describes new classification adopted by Catalogue of Life. After looking through the paper my understanding is that it is a step forward and at the same time business as usual for CoL.

Catalogue of Life needs a solid managerial classification for their data and according to article the goal is achieved:

Our goal, therefore, is to provide a hierarchical classification for the CoL and its contributors that (a) is ranked to encompass ordinal-level taxa to facilitate a seamless import of contributing databases; (b) serves the needs of the diverse public-domain user community, most of whom are familiar with the Linnaean conceptual system of ordering taxon relationships; and (c) is likely to be more or less stable for the next five years. Such a modern comprehensive hierarchy did not previously exist at this level of specificity.

Classifications are a dirty business so as usual –

These actual complexities of phylogenetic history emphasize that classification is a practical human enterprise where compromises must be made

Altogether looks like CoL gets a new hierarchical face.

Starting GSOC 2015 Sysopia Project

Our Google Summer of Code student – Viduranga Wijesooriya and I had our firts meeting today to start Google Summer of Code project – Sysopia. The purpose of the project is not names, however I do consider it to be important for EOL and for GN as it allows to spend less time on administration of computers and more time on writing code.

The idea behind the project is to create a dashboard that allows us to see what is going on with all computers in a system with one glance. The system shows several metrics graphs, each of which shows information about all machines at the same time. By default it shows data for 24 hours, so if everything works well it is enough for sysadmin to check out sysopia once a day to have a very good idea about what is happening with the system from the moment Sysopia is installed. We did install it for EOL and I find it very useful.

sysopia

Not much functionality is there yet, but graphs show well, and it is possible to get a point data by hovering over a line, and highlight a particular machine when hovering over the machine name in the dialog box.

Currently the only backend for sysopia is Sensu but we are going to expand it to other backends after we nail down the user interface.

Species File Group -- new home for GN?

Trying to find a permanent home for GN I travelled to Champaign-Urbana on Thursday-Friday to visit Species File Group at the University of Illinois. I do know this group rather well, as Lisa Walley and I went there more than a year ago for a hackathon organized by Matt Yoder.

I am quite impressed with work this group does and when Matt suggested me to join them – my first thought was – this might be a way to make Global Names financial situation more reliable!

Currently Global Names completely depends on grants, and grants come and go. It is a pretty bad way to finance an infrastructure project, as you do not want to have roads or electricity depend on unstable funding. We always want to be able to drive to a store or a concert, and we want to always be able to have lights in our homes. Same goes with projects like GN. If people start using them – they start depend on them and it is really bad situation when funding dries out and service deteriorates as a result.

The visit went well. I had a great opportunity to talk to David Eades, Matt, Dmitry Dmitriev, and Yuri Roscov. It was especially great to talk to Yuri, as he is the main person behind the Catalogue of Life content. I consider Catalogue of Life to be one of the most important use cases and partners for GN, and talking to Yuri for an extended amount of time was extremely helpful.

Originally position was about helping Yuri to automate his work-flow, however when Matt and I talked on Skype the accent started to shift towards supporting Global Names. David Eades, Matt, and Yuri all believe that GN is a missing link in Catalogue of Life functionality and as such by working closely with Yuri, and figuring out what GN can do for Catalogue of Life actually does help to automate some of hard parts of Yuri’s work.

The meeting was very encouraging and inspiring. Now I have to think hard and make a decision. I would love be able to keep my house on Cape and I would love to be able to come in summer and work with MBL and RISD. And it seems nothing prevents me to spend 3 months on Cape in summer if I move. On Monady I am going to talk about my trip at my work at MBL.

Good names, bad names?

Just had a great conversation with Jorrit Poelen from GloBI. Jorrit uses GN Resolver to clean up names for Globi, and on top of knowing that name exists in GN Resolver he also needs to know if the name is actually valid.

Resolver by design contains ‘good’ and ‘bad’ names. We do need to know what kind of misspellings exist in the wild and map information associated with them to good names. These misspellings and outright wrong names make Jorrit’s life much harder, as we do not have a tool that clearly marks ‘good’ names as good. There are ways to be more or less sure if a name is good:

  • If names belong to a specific clade – use only highly curated sources
  • Count how many sources know about a name
  • See if a name appears at least in one curated source
  • Check if name got parsed

But all these approaches are not universal, and do not give a clear answer. So what would be a solution?

It seems that a good solution would be to write a classifier which takes in account all relevant features and meta-features of a name, considers them and then puts the name into ‘good’ or ‘bad’ bucket. Every name has several features associated with it and we can train a Bayes classifier to make a decision if name is ‘good’ or ‘bad’ using these features. When it is done running through our ~20 million names – each of them will be marked as trusted or not.

I am pretty much sure that such classifier, especially at its first iteration will make mistakes. How can we deal with them? Here is an idea – when API returns back data back to a user – data will have two new fields – ‘trusted’ as yes/no and a URL to complain about this decision something like:

http:/resolver.globalnames.org/trusted?name_id=123&wrong_value=1

People can just copy and paste this URL into browser, or set it as a “Report a mistake” button for every name in their results html. If this button is pushed GN Resolver will register human curation event and data from this event will be used to improve performance of the classifier algorithm. Human curations will trump computer algorithm and they can be collected in a new data source for feedbacks…

Details of the interface can be decided later when we build the classifier. I know that the problem of separating trusted names from untrusted is a task that about everybody who uses resolver actively asked me about one time or another. So who and when can build it? And now I am thinking that our Google Summer of Code student might be interested in making it happen instead of improving NetiNeti. I personally think automatic curation of names is more important.

Jorrit submitted an issue about this idea Globit issue

iDigBio hackathon preparation

On June 3rd I am going to iDigBio hackathon meeting which will be about finding ways to enhance their API. Today there was a pre-hackathon meeting where iDigBio folks explained how did they implement their API, it’s backend, and how do they use their own API for their GUIs.

I had been very impress with what they have done. Backend is based on Elastic search, API is RESTful and json based. What was a surprise for me – the API calls are often take pure json as arguments. It was also great to see how did they simplified Elastic search queries for API, keeping their API queries simple and powerful at the same time.

They also made Python and R clients for the API. So I will try to make Ruby version of the API client before the hackathon.

Launching NetiNeti Google Summer of Code Project

Today we had our first meeting to start NetiNeti enhancement project funded by Google Summer of Code. The studend who was selected do do the job is a 4th year graduate from University of Philadelphia Wencan Luo.

The purpose of the project is to improve performance of our NLP-based scientific name finding tool – NetiNeti, developed 5 years ago by [Lakshmi Manohar Acella][lakshmi]. Lets see how it will go…

Official coding time starts on May 27th – for now we are going through a design phase – figuring out who are the users of the application, then we will try to do idealized design of its features, find implementaion paths and limitations existing to implement the features and then Wencan is going to do exploration of features.

For the process we are going to use ZenHub to manage issues, obviosusly GitHub for the code and ability to have a project related blog with Github and Jekyll.

Google Summer of Code 2015 starts

Today is the official start of Google Summer of Code work. Up to now organizations submitted their ideas, organizations had been chosen by Goggle, students decided which ideas and organizations they like and submitted their proposals, Google decided on how many projects from every organizations they are willing to fund, and finally – the best proposals from students were matched to the funded ideas. Encyclopedia of Life and Global Names submitted 4 proposals and we got 3 of them funded, so congratulations to us and to students :)

Student: Avinash Daiict
Mentor: Amr Morad

DevOps Dashboard Sysopia

Student: Viduranga Wijesooriya
Mentor: dimus

Finding Scientific Names

Student: Wencan Luo
Mentor: dimus

I had been very happy to see how many people were interested in the idea of finding scientific names in texts – we had 12 proposals, so competition was fierce this year! I think we got great students and I am looking forward to the Google Summer of Code 2015.

Gracious gift from Encylopedia of Life

Encyclopedia of Life folks made a truly glorious present to Global Names consisting of several Dell 710 servers which were used for running EOL at Harvard. Now, when site moved to Smithsoninan EOL donates some of these computers to GN, and 10 of them are already at Marine Biological Laboratory, waiting to be plugged to internet and electricity. Another truly amazing gift from EOL is 14 100gb hard drives which will run GN databases.

I feel happy warm and fuzzy – thank you EOL! I hope with this new hardware I will be able to increase GN capacity about 5x using current code! Next step – installing Chef, Docker, and GN applications to serve the biodiversity community.

GNA work continues

Last year we did get the second round of funding for Global Names Architecture development. Our first grant had been about exploration of how to find scientific names in texts, how to crossmap different spelling variants of the same names to each other, how to connect names to literature collected at Biodiversity Heritage Library, how to organize scientific name usages, how to register new zoological scientific names electronically. Several intersting projects spanned out of this effort and you can read about them at Global Names site.

It was a hard year for Encyclopedia of Life where I work, and for Global Names. I did have to spend most of the first 8 months since our second NSF grant got funding helping EOL with system administration support and transfer EOL site from Harvard to Smithsonian Museum of Natural History. It is done now, and I am happy to be able to work on Global Names project again!

What kind of resources do we have now? 2 months of Paddy (David Patterson) time, 2 months of Rich Pyle time, About a 1.5 years of mine, and 1 year of another developer. We also got 2 excellent participants for Google Summer of Code this year, so it is 6 months of their time as well. And a quest for further funding continues as I write.

Encyclopedia of Life kindly donated a lot of hardware, and Marine Biological Laboratory provides us with a whole rack of space, fast internet connection. So we are set for an exciting year ahead!

What are the plans? This grant covers work on name finding and name resolution. We try to find major use-cases (Arctos, EOL, iDigBio, Catalogue of Life, GBIF) and satisfy their needs. We expect it will cover needs of 90% of other users, and the remaining 10% of functionality will trickle by means of going through github issues, fixing bugs, adding features, thinking about new ideas.