On May 20th all our servers are going to be down because of a power maintenance at our computer center. Sorry for inconvenience! We are going to be back as soon as possible.
On May 20th all our servers are going to be down because of a power maintenance at our computer center. Sorry for inconvenience! We are going to be back as soon as possible.
Global Names developers Dmitry Mozzherin and Richard Pyle were invited to attend a workshop called “Names in November”, organized by the Catalog of Life and GBIF and hosted in Leiden. The three-day meeting involved more than twenty people from key taxonomic and nomenclatural organizations, and focused on discussing ways that a global information system of taxonomy, including both names data and accepted species information, could be designed to more seamlessly interconnect biodiversity data through organism names. Although the theme of the meeting was certainly not new (many participants in this meeting had attended similar meetings going back decades discussing essentially the same idea), the tone of the discussion was refreshing in that it focused comparatively little on politics and technical details, and instead concentrated on identifying whether such a shared taxonomic infrastructure was even possible (given the political, financial, and technical circumstances currently existing within the main likely partners), and what conditions would need to be met.
Many of the points that participants agreed on in terms of needs and services very closely matched the fundamental goals and infrastructure we have developed (and continue to develop) within the context of Global Names. Now that GN is much more closely coordinating with the Catalog of Life, GN data indexes and services will likely play an important role in implementing the shared global taxonomy resource envisioned during the meeting. Following this meeting, we have a renewed sense of focus within GN development to finish harmonizing integration of GNI and GNUB services, and especially to rapidly increase the effort to bulk-populate GNUB from existing data.
add year’s range to AST node that is encoded with
parse names ending on hybrid sign (#88)
support hybrid abbreviation expansions (#310)
support raw hybrid formula parsing (#311)
continuous build is moved to CircleCI
and many structural changes, bug-fixes and quality improvements. They are described in the release documenation.
To avoid confusion – gnparser is a new project, different from the formerly released biodiversity parser.
Speed improvements. Parser is about 50% faster than already quite fast 0.2.0 version. We were able to parse 30 million names per CPU per hour with this release.
Compatibility with Scala 2.10.6: it was important for us to make the parser backward compatible with this older version of Scala, because we wanted to support Spark project.
Compatibility with Spark v. 1.6.1. Now the parser can be used in BigData projects running on Spark and massively parallelize parsing process using Spark platform. We added documenation describing how to use the parser with either Scala or Python natively on Spark.
Simplified parsing output in addition to “Standard output”: It analyzes the name-strings and returns its id, canonical form, canonical form with infraspecific ranks, authorship and a year.
Improved and stabilized JSON fields. You can find complete description of the parser JSON output in its JSON schema. We based names of fields on TDWG’s Taxon Concept Schema, and we indend to keep JSON format stable from now on.
There were many structural changes, bug-fixes and quality improvements. They are described in the release documenation.
uBio and Nomenclator Zoologicus online experienced difficulties this year and had been down a lot lately, mostly because there is no system administrator to look for them at Marine Biological Laboratory anymore.
I moved uBio from old hardware at Marine Biological Laboratory to Google Container Engine, and it is running again. Some functionality is not back yet, mostly due to some hard-coded configuration parameters in files. I hope problems with the code will be fixed eventually by interested parties (I do not plan to rewrite the code). I’ll coordinate my efforts with Dave Remsen and Patrick Leary, and hopefully together we will preserve uBio for the community.
Please note that being a system administrator for uBio is not part of my job. I like the project, I consider it to be a ‘precursor’ of GN, I will try my best to keep it running on my spare time. MBL/WHOI library pays for the cloud.
Docker containers to run uBio are located at dockerhub. We use Docker and Kubernetes at Google Container Engine to keep it alive.
(Please note that gnparser is a new project, different from the formerly released biodiversity parser.)
We are happy to announce a public release of Global Names Parser
gnparser – the first project that marks transition of Global Names
reconciliation and resolution services from “prototype” to “production”. The
gnparser project is developed by @alexander-myltsev and @dimus in Scala
language. GNParser can be used as a library, a command line tool, a socket
server, a web-program and RESTful-service. It is easiest to try it at
Scientific names might be expressed by quite different name strings. Sometimes the difference is just one comma, sometimes authors are included or excluded, sometimes ranks are omitted. With all this variability “in the wild” we need to figure out how to group all these different spelling variants. Name parsing is an unexpectedly complex and absolutely necessary step for connecting biological information via scientific names.
In 2008 Global Names released Biodiversity Gem – a scientific name parser written in Ruby for these purposes. The library in its 3 variants enjoyed a significant success – about 150 000 downloads and a notion as the most popular bio-library for Ruby language. It allowed to parse about 2-3 million names an hour, and had been the basis of name reconciliation for many projects from the moment of its publication.
GNParser is a direct descendant of the biodiversity gem. It serves the same
purpose and input/output format of both projects are similar. It also marks
eventual discontinuation of ‘biodiversity gem project’ and migration of all
Global Names code to the new
Why did we go through the pain of making a completely new parser from scratch? The short answer is scalability and portability. We want to be able to remove parsing step from being a bottleneck for any number of names thrown at resolution services. For example finding all names in Biodiversity Heritage Library took us 43 days 3 years ago. Parsing step alone took more than 1 day. If we want to improve algorithms of finding names in BHL – we cannot wait 40 days. We want to be able to do it within one day and improve whole BHL index every time our algorithms are enhanced significantly.
We have an ambitious goal of making time spent on sending names to resolution services over internet and then time spent on transferring the answers back to be the bottlenecks of our name matching services. For such speeds we need a very fast parsing. Scala allows us to dramatically improve speed and scalability of the parsing step.
Having a parser running in Java Virtual Machine environment allows us to give biodiversity community a much more portable parsing tool. Out of the box parser library will work with Scala, Java, R, Jython and JRuby directly. We hope that it will speedup and simplify many biodiversity projects.
This is the first public release of the library. Please download it, run it, test it and give us your feedback, so we can improve it further. Happy parsing
WARNING: JSON output format might change slightly or dramatically, as we are in the process of refining it. The JSON format should be finalized for version 0.3.0
I decided to figure out how can I write scientific papers in truly Open Source fashion. And here are practical decisions that allowed me to do it:
To use full power of revision control system a project should be mostly in text format of some sort. We currently keep practically all our code on GitHub, so Git was a natural choice.
I decided to go with LaTeX, as it is tried and powerful markup, very well suited for scientific writing. It allows to work with plain text, so we can easily keep revisions in Git.
Vim is my editor of choice, but nothing prevents me or co-authors to use any other modern text editor for LaTeX.
With LaTeX and Git it is easy to provide early access to the work in progress, especially with 2 commercial products that give free access for open projects – GitHub and OverLeaf. Overleaf supports git, although not as well as GitHub does. So currently is it better to have GitHub as the main repository, and keep Overleaf as a glorified viewer and use it only as a secondary repository. Another useful tool is Mendeley for finding and organazing bibliography.
I am still learning my ropes, but excited about the progress. And the paper about Global Names Parser is now on GitHub and OverLeaf! Overleaf allows anybody interested to see the paper in user-friendly PDF format. It also simplifies submission of papers to a large variety of open access journals.
I created a post on my personal blog describing how did I set up my system with LaTeX, vim and tmux.
Does a culture exist out there which considers a process of writing scientific papers to be akin to writing an open source code?
For the last 8 years I had been blessed being paid for doing open source development. It means that for that long about everything I do is almost instantly available publicly. This model fits my way of thinking, my values, and I see an advantage in making all I do available for public to see, comment, and enhance.
Now I am writing a paper, and I feel being thrown into “dark ages”. Whole paper writing and publishing culture was one of the reasons I left molecular biology and went to programming. I assume the following is usually true when people write a scientific paper –
Obviously there is a progress with the last point, but what about other 2?
Last week (October 5-9 2015) @deepreef, @dimus, @alexander-myltsev had a workshop in Honolulu at Bishop Museum to sync ideas, learn more about each other work, and design new generation of services. The meeting had been productive and I think in the end our two GN groups get integrated. We are moving all our code under one roof at GitHub now.
We had an interesting meeting with @jar398 from Open Tree of Life trying to figure out how can we connect OTOL with all the other resources on the web, and @deepreef suggested to use his BioGuid project for these purposes. We moved BioGuid to github and added all the GitHubbish bells and whistles like a blog and [gitter][bioguid-gitter] for example to the project. I think it is pretty cool that we will have a downloadable csv file from all the IDs @deepreef collected that can be used by all other projects in new exiting ways.
Another interesting conversation was with Phylotastic project. We worked on an idea of making an application which will allow convert pictures of scientific names people take at museums or from pages of research papers into texts, extract names appeared there and build phylo-trees from these names using Open Tree. Also the app would show pictures from Encyclopedia of Life and pages from Wikipedia. Such app will mash up interfaces of Global Names to find and reconcile names, Open Tree to build trees, and EOL to get information about species.
As we annouced previously, BioGUID.org has been incorporated into the Global Names suite of indexes and services. Within two days of this happening, we recieved some wonderful news: BioGUID.org won second place in the GBIF Ebbe Nielsen Challenge! We’re very excited about this recognition, and it re-inforces our decision to incorporate BioGUID into the GNA system. You can follow continuing developments on the new BioGUID Blog.
BioGUID.org, an indexing service that cross-links identifiers assigned to data objects in the biodiversity information universe, has now been incorporated into the Global Names suite of indexes and services. BioGUID.org represents the third major data component of Global Names (along side GNI and GNUB), and replaces a less robust identifier linking service that had previously been included within GNUB. In addition to the crucial role of cross-linking identifiers within the general GN architecture itself, the broader function of BioGUID.org falls within the scope of Global Names in the sense that identifiers can be thought of as names, and names play the same functional role as identifiers.
Bishop Museum, in partnership with the Catalogue of Life, iDigBio, GBIF, WoRMS, PLAZI, BHL, the International Congresses of Dipterology, and Pensoft Publishers, submitted a proposal to the U.S. National Science Foundation’s Advances in Biological Informatics (ABI) program, to develop the Global Names Usage Bank. This proposed project will dramatically improve the core infrastructure behind GNUB in particular, and globalnames in general.
Moving Global Names to a stable ground… On October 1st I did sign a job offer from Species File Group and going to start at the new position on November 16th. This position is supported by a fund created by David Eades (thank you David) and allows to think of grants not like a vehicle for GN’s survival, but means of further enhancement of the system. I am honored and touched that David, and Species File Group created this amazing opportunity for Global Names. This move should also lead to a tight integration between Catalogue of Life and Global Names, which in my view is a win/win situation for both projects and for biodiversity.
Our next steps are releasing JVM-based scientific name parser, building scalable, reliable and fast name resolution service and integratation of name resolution with the key Global Names project developed by Rich Pyle and Rob Whitton – Global Names Usage Bank for which Rich just submitted a grant proposal.
What would be a use for such a crossmap? It would allow to quickly mashup data from various projects in interesting ways. For example to show images of species from EOL API on phylo-trees generated using Open Tree of Life API.
OpenTree Taxonomy has mapping of OTT IDs to NCBI IDs. Encyclopdia of Life also has mapping of NCBI IDs to EOL IDs. So if someone wants to map NCBI names to EOL using the same algorithm as EOL used – they only need to query data about IDs. Even better – it would create a very fast connection of one aggregator (Open Tree), to another aggregator (EOL) through IDs of other sources without doing explicit name resolution.
Such queries would be much faster, as they would be just about comparing indexed columns in a table. However the quality of the results from such approach would depend on the quality of name resolution used by the aggregators.
I am thinking about trying just that. As a pilot – we can generate Darwin Core Archive files from OTT and EOL which would contain information about IDs from other sources. Then we will need to add an API that would make it possible to run queries from such information.
Another good suggestion from @jhpoelen – is to publish data about this kind of crossmap as a csv file, that can be easily put in some kind of a database and used separately on its own.
Global Names is happy to welcome a new member – Alexander Myltsev (@alexander-myltsev on GitHub). Alexander is of parboiled2 fame. Parboiled2 is a Parsing Expressed Grammar parser for Scala, and it did originate from Alex’ code which he wrote as a Google Summer of Code participant in 2013.
Alexander lives in Moscow and currently works on a port of biodiversity parser to Scala; the project is called gnparser. The new parser is compatible with Java, JRuby, Jython and everything else written for Java virtual machine environment. When the parser is ready it will be the basis of a new Scala-based collection of GN tools.
Alexander had been working with us for a few months now, but I had been waiting with the announcement until major paperwork hurdles were solved.
Last week was the end of the Google Summer or Code season. Out of two projects that we had been mentoring one was not really about biology. It was a project for system administrators. A visualization tool for staistics about cpu usage, memory, disk space etc.
Everybody who runs involved biodiversity informatics projects knows how important it is to monitor your systems. There are several open source tools for that – Nagios, Sensu, Graphite, Systemd, Collectd…
Our monitoring system of choice is Sensu. It is very flexible and powerful tool, well designed and suitable for large number of tasks. One of these is collecting statistics from computers and store them in about any kind of database. As a result Sensu can be used for monitoring critical events and for collecting data about systems. The question is however how to visualize all the collected data.
We designed Sysopia to do exactly that. During the summer @vpowerrc expanded original prototype and created powerful and flexible visualization tool which is capable to give system administrator an understanding what is happening with 2-20 computers at a glance, receive life updates, and compare today’s statistics with up to one year of data. We already use Sysopia in production, and we are going to deploy it for Global Names as soon as our new computers are in place.
You can read more about sysopia on its help page
For quite a while we used to have a Drupal-based site for GlobalNames. As we do have now a Jekyll-based blog, it was logical to move our static site as well. And now it did happen – both of them are accessible via globalnames.org
This new site will continue to be an ‘official’ blog for news about GNA, we will publish information about new releases of software here, documents and discussions about scientific names.
One great thing about this move – it is possible for anybody with account at github to participate – if you want to add a document, or a blog item – just fork the repository, add a post to the _posts directory and send a pull request. At some point we will add detailed instructions how to do that.
New version 3.4.1 of GlobalNames Parser gem biodiversity is out.
It adds ability to parse authors names starting with
Cirsium creticum d'Urv. subsp. creticum
which now is parsed correctly
New version 3.4.0 of GlobalNames Parser gem biodiversity is out. It adds new method that allows to add infraspecific ranks to canonical forms after the fact.
It was possible to add ranks in canonical forms before using the following code:
Now it is also possible to add ranks to canonical forms after the fact using
New version 3.3.0 of GlobalNames Parser gem biodiversity is out. It adds new
option to socket server
--host to change default 127.0.0.1
host setting. To see full set of options run
There was a short meeting about Catalogue of Life future directions organized at Species File Group at Champaign/Urbana. Concerning Global Names it was a very productive meeting. It was great to understand the current state of Catalogue of Life, to see that CoL is not loosing momentum inspite of financial problems of biodiversity informatics in general. There was definitely interest in creating more bridges between various projects.
Yuri Roskov did present a ‘pilot’ project of cooperation between Encyclopeia of Life species pages group and CoL. Data about ~2000 species of scorpions had been harvested from html-based site to be used in both projects. I think it was a great exercise and I do hope it will be just the first example of such cooperation.
From the point of Global Names there were good news too. I think it was everybody’s feeling that Global Names resolution is an important complementary service for Catalogue of Life. Cooperation between various biodiversity projects was brought up again and again, and organizing biodiversity infrastructure as a mix of several projects where GBIF, EOL, CoL, GN etc work as modules of a bigger puzzle, complement and enhance each other.
One thing that was brought up is lack of nomenclatural component in GN. I talked about our plans to integrate GN Usage Bank and GN Resolver and demonstrate the flow of nomenclatural data into resolution/reconciliation process. We will try to make such connection by November and demonstrate the workflow on upcoming GBIF/CoL workshop.
This release has some backward compatibility issues with output.
In previous versions we did strip empty spaces and new line characters around the name to generate “verbatim” field. Now name stays the way it was entered into the parser.
“Homo sapiens “ -> …“verbatim”: “Homo sapiens”
“Homo sapiens\r\n” -> …“verbatim”: “Homo sapiens”
“Homo sapiens “ -> …“verbatim”: “Homo sapiens “
“Homo sapiens\r\n” -> …“verbatim”: “Homo sapiens\r\n”
Read more about UUID v5 in another blog post
Such names are often used in representations of phyo-trees. Parser now substitutes underscores to spaces during normalization phase
I am removing behavior introduced in v3.1.10 which would preserve apostrophes in normalized version of names like “Arca m’coyi Tenison-Woods”. Apostrophes are not code compliant.
We are releasing a new tool - gn_uuid to simplify creation of UUID version 5 identifiers for scientific name strings. UUID v5 has features which are particular useful for the biodiversity community.
Universally unique identifiers are very popular because for all practical purposes they guarantee globally unique IDs without any negotiation between different entities. There are several ways how UUIDs can be created:
|UUID version||Uniqueness is achieved by|
|version 1||Using computer’s MAC address and time|
|version 2||like v1 plus adding info about user and local domain|
|version 3||Using MD5 hash of a string in combination with a name space|
|version 4||Using pseudo-random algorithms|
|version 5||Using SHA1 hash of a string in combination with a name space|
UUID v5 is generated using information from a string, so everyone who uses this method will generate exactly same ID out of the same string. Interested parties do need to agree on generation of a name space, and after that no matter which programming language they use – they will be able to exchange data about a string using their identifiers.
This gem already has a DNS domain “globalnames.org” defined as a name space, so generation of the UUID v5 becomes simpler.
I believe UUID v5 creates very exciting opportunities for biodiversity community. For example if one expert annotates a string or attaches data to it – this information can be linked globally and then harvested by anybody, without any preliminary negotiation.
Quite often researches make an argument that a scientific name is an identifier on its own and there is no need for another level of indirection like UUID. They are right, scientific name string can be an identifier, however, scientific names have severe shortcomings in such a role.
More often than not identifiers end up in databases and used as a primary index to sort, connect and search data. Scientific name strings vary from 2 bytes to more than 500 bytes in length. So if they are used as keys in database they are inefficient, they waste a lot of space, become less efficient for finding or sorting information – indexes key size is usually determined by the the largest key.
UUIDs have always the same, rather small size – 16 bytes. Even when UUIDs are used in their “standard” string representation – they are still reasonably small – 36 characters. Storing them in a database as a number is obviously more efficient.
It is very hard for human eye to spot the difference between strings like this
Corchoropsis tomentosa var. psilocarpa (Harms & Loes.) C.Y.Wu & Y.Tang
Corchoropsis tomentosa var. psilocanpa (Harms & Loes.) C.Y.Wu & Y.Tang
Much easier for their corresponding UUIDs
Currently Latin1, UTF-8 and UTF-16 are most popular encodings used in biodiversity. If authorship or name itself has characters outside of the 128bits of ASCII code – identically looking names will be quite different for computers.
When names are moved from one database to another, from one paper to another sometimes they do not survive the trip. If you spent any time looking at scientific names in electronic form you did see something like this:
Acacia ampliceps ? Acacia bivenosa
Absidia macrospora V�nov� 1968
Absidia sphaerosporangioides Man<acute>ka & Truszk., 1958
Cnemisus kaszabi Endr?di 1964
Usually names like these had been submitted in a “wrong” encoding and some characters in them were misinterpreted. UUID on the other hand is just a hexadecimal number, which can be transitioned between various encodings more safely.
These two strings might look exactly the same on a screen or printed on paper, but in reality they are different. Here are their UUIDs:
The difference is that the second name has a Cyrillic
а character, which in
most cases will look exactly the same as Latin
a character. And when the
names are printed on paper there is absolutely no way to tell the difference.
UUID will tell us that these two name strings are not the same.
One argument that people often give – it is much easier for users to type
For most of us it is definitely true and nothing prevents developers to create links of the first type, while still using UUIDs behind the scene.
New version of gn_crossmap tool is out.
The main change in this version – output file with crossmap data now contains all fields from original input document and it allows to filter and sort data using any field from the input.
Other changes are
@dimus - #5 - All original fields are now preserved in the output file.
@dimus - #3 - If ingest has more than 10K rows – user will see logging events
@dimus - #4 Bug - Add error messages if headers don’t have necessary fields
@dimus - #2 - Header fields are now allowed to have trailing spaces
@dimus - #7 Bug - Empty rank does not break crossmapping anymore
@dimus - #1 Bug - Add missing rest-client gem
In a few weeks there will be an iDigBio API hackathon. As I menioned earlier we decided to add another API client written in Ruby before the hackathon starts. And Greg Traub and I are releasing iDigBio API Client written in Ruby today.
This is the very first release, so if you will start using it and find something is wrong/missing please submit an issue. The gem uses beta API, so sometimes it might get ‘stuck’. This problem will go away when beta API will move to production.
Addressing most of Issue #7
If a name was not detected as a virus but contains RNA word it will not be parsed anymore. It is a problem for some surrogate names, like Candida albicans RNA_CTR0-3 but they are very rare.
|Candida albicans RNA_CTR0-3||Not parsed|
|Alpha proteobacterium RNA12||Not parsed|
|Ustilaginoidea virens RNA virus||Not parsed, marked as virus|
|Calathus (Lindrothius) KURNAKOV 1961||Parsed as before|
*particle are marked as ‘viruses’ and
Gossypium mustilinum symptomless alphasatellite Okra leaf curl Mali alphasatellites-Cameroon Bemisia betasatellite LW-2014 Tomato leaf curl Bangladesh betasatellites [India/Patna/Chilli/2008] Intracisternal A-particles Saccharomyces cerevisiae killer particle M1 Uranotaenia sapphirina NPV Spodoptera exigua nuclear polyhedrosis virus SeMNPV Spodoptera frugiperda MNPV Rachiplusia ou MNPV (strain R1) Orgyia pseudotsugata nuclear polyhedrosis virus OpMNPV Mamestra configurata NPV-A Helicoverpa armigera SNPV NNg1 Zamilon virophage Sputnik virophage 3
Names like below are now parsed correctly. Their normalized/canonical forms preserve apostrophe
Junellia o'donelli Moldenke, 1946 Trophon d'orbignyi Carcelles, 1946 Arca m'coyi Tenison-Woods, 1878 Nucula m'andrewii Hanley, 1860 Eristalis l'herminierii Macquart Odynerus o'neili Cameron Serjania meridionalis Cambess. var. o'donelli F.A. Barkley
I did some soul-searching, advise-gathering, thinking, planning, and crystal ball gazing. And it seems that moving Global Names grant and myself to Species File Group is a right decision. Why? Because Marine Biological Laboratory is a hard core research institute, which completely depends on grants and as such is not well-suited for infrastructure projects. Global Names is definitely an infrastructure project and I know very well how bad it is to be responsible for an infrastructure project and not being able to work on it. It is just not a good way to do business.
It is my 8th year at MBL. I enjoy MBL, I love living on Cape Cod. I love an immense energy of MBL collective mind. I met really amazing people, amazing scientists here. I worked with great people at Encyclopedia of Life project. And also I was never sure if I will be there next year, or sometimes next month. I had weeks and months when I had no ability to move forward with Global Names, because it had no financial support at that time.
Species Files Group is long term financed, allowing a long term commitment. They are interested in Global Names, they do want me to continue to develop it, integrate it with Catalogue of Life. And these are my goals too. David Eades understands that Global Names will need a long-term investment in hardware, and he provides a generous annual fund for that. It means no more 7 year old computers running Global Names services. I also hope it will help to integrate Global Names Usage Bank, a crucial GN component developed by Rich Pyle and Rob Whitton.
Another big factor is ability to work closely with programmers and taxonomists of the SFG group. At MBL I am now the only one on EOL project (Jeremy is remote), and I feel I am getting stale without nomenclators/taxonomists around.
Of course we need to figure out how to move current GN computers without shutting down services for a few weeks. I imagine I would have to rent an expensive cloud setup for a month or two, and run GN from there while machines are in transit. We will have to figure out how to transfer grant, make a new hire for the project etc. But all of these are good problems to solve. I believe GN suddenly got a brighter future ahead.
Yesterday I released a new command line tool for name resolution called gn_crossmap. It is designed for people who work with checklists of scientific names using a spreadsheet software (MS Excel, Apple Numbers, Open Office, Libre Office, Google Sheets etc.) and want to compare names that they have with another reference source. The program takes a spreadsheet saved as csv file as input and generates another csv-based spreadsheet with resolution data. Examples of input and output are included into the code. README file describes how to use the project from a command line or as a Ruby library.
This program requires internet connection, Ruby >= 2.1 installed on the machine.
Basic usage is:
$ gem install gn_crossmap $ crossmap -i input.csv -o ouput.csv -d 1
|-i||–input||checklist’s spreadsheet saved as csv file|
|-o||–output||path to the output file. Default is output.csv in the current directory|
|-d||–data-source-id||ID of one of the GN Resolver data sources, Catalogue of Life id (1) is default|
Web interface to this program is also in works
This project started at the Catalogue of Life workshop in Leiden, which happened in March 2015. The main focus of the hackathon was to figure out how to help national checklist teams to create, maintain and compare data in their data. We determined 3 main approaches
A hackathon group which worked on crossmapping produced a
code which would compare checklists against Catalogue of
gn_crossmap program I am releasing is based heavily on what we
learned during the hackathon. Crossmaping code is mostly based on use cases
from Rui Figueira and Wouter Koch. During the hackathon we
also determined ways to improve quality of name resolution further by:
Yesterday Arlin Stoltzfus organized a kickoff meeting for the project that got funded by NSF this year - “Collaborative Research: ABI Development: An open infrastructure to disseminate phylogenetic knowledge”. Global Names is participating in project and I believe it will be an interesting ride.
The idea behind is pretty cool. Imagine that someone works on a group of organisms. They submit names of the organisms to a service and the service builds a phylogenetic tree out of the names. When tree is created it will start its own life similar to a repo on Github. People will be able to reuse it, annotate it for their own purposes, create derivative trees. It would be a pretty nice feature for Encyclopedia of Life to see how species belonging to a particular clade are related to each other through phylogeny. One problem with creation of such trees is name normalization. Scientific names can have many alternative spellings, so to find phylo-information we will need to be able to map names from user list to names which are recognized by the service.
I suspect that a crossmapping tool I am working on this week might be adjusted for this particular task, but as usual – the devil is in details and we will find out the requirements during the design process.
Bob Corrigan sent around an email pointing at a paper in PLOS
which describes new classification adopted by Catalogue of Life. After
looking through the paper my understanding is that it is a step forward and at
the same time business as usual for
Catalogue of Life needs a solid managerial classification for their data and
according to article the goal is achieved:
Our goal, therefore, is to provide a hierarchical classification for the CoL and its contributors that (a) is ranked to encompass ordinal-level taxa to facilitate a seamless import of contributing databases; (b) serves the needs of the diverse public-domain user community, most of whom are familiar with the Linnaean conceptual system of ordering taxon relationships; and (c) is likely to be more or less stable for the next five years. Such a modern comprehensive hierarchy did not previously exist at this level of specificity.
Classifications are a dirty business so as usual –
These actual complexities of phylogenetic history emphasize that classification is a practical human enterprise where compromises must be made
Altogether looks like CoL gets a new hierarchical face.
Our Google Summer of Code student – Viduranga Wijesooriya and I had our firts meeting today to start Google Summer of Code project – Sysopia. The purpose of the project is not names, however I do consider it to be important for EOL and for GN as it allows to spend less time on administration of computers and more time on writing code.
The idea behind the project is to create a dashboard that allows us to see what is going on with all computers in a system with one glance. The system shows several metrics graphs, each of which shows information about all machines at the same time. By default it shows data for 24 hours, so if everything works well it is enough for sysadmin to check out sysopia once a day to have a very good idea about what is happening with the system from the moment Sysopia is installed. We did install it for EOL and I find it very useful.
Not much functionality is there yet, but graphs show well, and it is possible to get a point data by hovering over a line, and highlight a particular machine when hovering over the machine name in the dialog box.
Currently the only backend for sysopia is Sensu but we are going to expand it to other backends after we nail down the user interface.
Trying to find a permanent home for GN I travelled to Champaign-Urbana on Thursday-Friday to visit Species File Group at the University of Illinois. I do know this group rather well, as Lisa Walley and I went there more than a year ago for a hackathon organized by Matt Yoder.
I am quite impressed with work this group does and when Matt suggested me to join them – my first thought was – this might be a way to make Global Names financial situation more reliable!
Currently Global Names completely depends on grants, and grants come and go. It is a pretty bad way to finance an infrastructure project, as you do not want to have roads or electricity depend on unstable funding. We always want to be able to drive to a store or a concert, and we want to always be able to have lights in our homes. Same goes with projects like GN. If people start using them – they start depend on them and it is really bad situation when funding dries out and service deteriorates as a result.
The visit went well. I had a great opportunity to talk to David Eades, Matt, Dmitry Dmitriev, and Yuri Roscov. It was especially great to talk to Yuri, as he is the main person behind the Catalogue of Life content. I consider Catalogue of Life to be one of the most important use cases and partners for GN, and talking to Yuri for an extended amount of time was extremely helpful.
Originally position was about helping Yuri to automate his work-flow, however when Matt and I talked on Skype the accent started to shift towards supporting Global Names. David Eades, Matt, and Yuri all believe that GN is a missing link in Catalogue of Life functionality and as such by working closely with Yuri, and figuring out what GN can do for Catalogue of Life actually does help to automate some of hard parts of Yuri’s work.
The meeting was very encouraging and inspiring. Now I have to think hard and make a decision. I would love be able to keep my house on Cape and I would love to be able to come in summer and work with MBL and RISD. And it seems nothing prevents me to spend 3 months on Cape in summer if I move. On Monady I am going to talk about my trip at my work at MBL.
Just had a great conversation with Jorrit Poelen from GloBI. Jorrit uses GN Resolver to clean up names for Globi, and on top of knowing that name exists in GN Resolver he also needs to know if the name is actually valid.
Resolver by design contains ‘good’ and ‘bad’ names. We do need to know what kind of misspellings exist in the wild and map information associated with them to good names. These misspellings and outright wrong names make Jorrit’s life much harder, as we do not have a tool that clearly marks ‘good’ names as good. There are ways to be more or less sure if a name is good:
But all these approaches are not universal, and do not give a clear answer. So what would be a solution?
It seems that a good solution would be to write a classifier which takes in account all relevant features and meta-features of a name, considers them and then puts the name into ‘good’ or ‘bad’ bucket. Every name has several features associated with it and we can train a Bayes classifier to make a decision if name is ‘good’ or ‘bad’ using these features. When it is done running through our ~20 million names – each of them will be marked as trusted or not.
I am pretty much sure that such classifier, especially at its first iteration will make mistakes. How can we deal with them? Here is an idea – when API returns back data back to a user – data will have two new fields – ‘trusted’ as yes/no and a URL to complain about this decision something like:
People can just copy and paste this URL into browser, or set it as a “Report a mistake” button for every name in their results html. If this button is pushed GN Resolver will register human curation event and data from this event will be used to improve performance of the classifier algorithm. Human curations will trump computer algorithm and they can be collected in a new data source for feedbacks…
Details of the interface can be decided later when we build the classifier. I know that the problem of separating trusted names from untrusted is a task that about everybody who uses resolver actively asked me about one time or another. So who and when can build it? And now I am thinking that our Google Summer of Code student might be interested in making it happen instead of improving NetiNeti. I personally think automatic curation of names is more important.
On June 3rd I am going to iDigBio hackathon meeting which will be about finding ways to enhance their API. Today there was a pre-hackathon meeting where iDigBio folks explained how did they implement their API, it’s backend, and how do they use their own API for their GUIs.
I had been very impress with what they have done. Backend is based on Elastic search, API is RESTful and json based. What was a surprise for me – the API calls are often take pure json as arguments. It was also great to see how did they simplified Elastic search queries for API, keeping their API queries simple and powerful at the same time.
They also made Python and R clients for the API. So I will try to make Ruby version of the API client before the hackathon.
Today we had our first meeting to start NetiNeti enhancement project funded by Google Summer of Code. The studend who was selected do do the job is a 4th year graduate from University of Philadelphia Wencan Luo.
The purpose of the project is to improve performance of our NLP-based scientific name finding tool – NetiNeti, developed 5 years ago by [Lakshmi Manohar Acella][lakshmi]. Lets see how it will go…
Official coding time starts on May 27th – for now we are going through a design phase – figuring out who are the users of the application, then we will try to do idealized design of its features, find implementaion paths and limitations existing to implement the features and then Wencan is going to do exploration of features.
For the process we are going to use
ZenHub to manage issues, obviosusly
GitHub for the code and ability to have a project related blog with Github
Today is the official start of Google Summer of Code work. Up to now organizations submitted their ideas, organizations had been chosen by Goggle, students decided which ideas and organizations they like and submitted their proposals, Google decided on how many projects from every organizations they are willing to fund, and finally – the best proposals from students were matched to the funded ideas. Encyclopedia of Life and Global Names submitted 4 proposals and we got 3 of them funded, so congratulations to us and to students :)
I had been very happy to see how many people were interested in the idea of finding scientific names in texts – we had 12 proposals, so competition was fierce this year! I think we got great students and I am looking forward to the Google Summer of Code 2015.
Encyclopedia of Life folks made a truly glorious present to Global Names consisting of several Dell 710 servers which were used for running EOL at Harvard. Now, when site moved to Smithsoninan EOL donates some of these computers to GN, and 10 of them are already at Marine Biological Laboratory, waiting to be plugged to internet and electricity. Another truly amazing gift from EOL is 14 100gb hard drives which will run GN databases.
I feel happy warm and fuzzy – thank you EOL! I hope with this new hardware I will be able to increase GN capacity about 5x using current code! Next step – installing Chef, Docker, and GN applications to serve the biodiversity community.
Last year we did get the second round of funding for Global Names Architecture development. Our first grant had been about exploration of how to find scientific names in texts, how to crossmap different spelling variants of the same names to each other, how to connect names to literature collected at Biodiversity Heritage Library, how to organize scientific name usages, how to register new zoological scientific names electronically. Several intersting projects spanned out of this effort and you can read about them at Global Names site.
It was a hard year for Encyclopedia of Life where I work, and for Global Names. I did have to spend most of the first 8 months since our second NSF grant got funding helping EOL with system administration support and transfer EOL site from Harvard to Smithsonian Museum of Natural History. It is done now, and I am happy to be able to work on Global Names project again!
What kind of resources do we have now? 2 months of Paddy (David Patterson) time, 2 months of Rich Pyle time, About a 1.5 years of mine, and 1 year of another developer. We also got 2 excellent participants for Google Summer of Code this year, so it is 6 months of their time as well. And a quest for further funding continues as I write.
What are the plans? This grant covers work on name finding and name resolution. We try to find major use-cases (Arctos, EOL, iDigBio, Catalogue of Life, GBIF) and satisfy their needs. We expect it will cover needs of 90% of other users, and the remaining 10% of functionality will trickle by means of going through github issues, fixing bugs, adding features, thinking about new ideas.