We are releasing a new tool - gn_uuid to simplify creation of UUID
version 5 identifiers for scientific name strings. UUID v5 has features which
are particular useful for the biodiversity community.
UUID version 5: Description
Universally unique identifiers are very popular because for all
practical purposes they guarantee globally unique IDs without any
negotiation between different entities. There are several ways how UUIDs can be
||Uniqueness is achieved by
||Using computer’s MAC address and time
||like v1 plus adding info about user and local domain
||Using MD5 hash of a string in combination with a name space
||Using pseudo-random algorithms
||Using SHA1 hash of a string in combination with a name space
UUID v5 is generated using information from a string, so everyone who uses this
method will generate exactly same ID out of the same string. Interested parties
do need to agree on generation of a name space, and after that no matter which
programming language they use – they will be able to exchange data about a
string using their identifiers.
This gem already has a DNS domain “globalnames.org” defined as a name space, so
generation of the UUID v5 becomes simpler.
I believe UUID v5 creates very exciting opportunities for biodiversity
community. For example if one expert annotates a string or attaches data to it
– this information can be linked globally and then harvested by anybody,
without any preliminary negotiation.
Quite often researches make an argument that a scientific name is an identifier
on its own and there is no need for another level of indirection like UUID.
They are right, scientific name string can be an identifier, however,
scientific names have severe shortcomings in such a role.
Why Scientific Names are Bad Identifiers for Computers
Scientific name strings have different length
More often than not identifiers end up in databases and used as a primary index
to sort, connect and search data. Scientific name strings vary from 2 bytes to
more than 500 bytes in length. So if they are used as keys in database they
are inefficient, they waste a lot of space, become less efficient for finding
or sorting information – indexes key size is usually determined by the
the largest key.
UUIDs have always the same, rather small size – 16 bytes. Even when UUIDs are
used in their “standard” string representation – they are still reasonably
small – 36 characters. Storing them in a database as a number is obviously more
It is hard to spot differences in name strings
It is very hard for human eye to spot the difference between strings like this
Much easier for their corresponding UUIDs
Name strings come in different encodings.
Currently Latin1, UTF-8 and UTF-16 are most popular encodings used in
biodiversity. If authorship or name itself has characters outside of the
128bits of ASCII code – identically looking names will be quite different for
Name strings are less stable because of their encoding
When names are moved from one database to another, from one paper to another
sometimes they do not survive the trip. If you spent any time looking at
scientific names in electronic form you did see something like this:
Acacia ampliceps ? Acacia bivenosa
Absidia macrospora V�nov� 1968
Absidia sphaerosporangioides Man<acute>ka & Truszk., 1958
Cnemisus kaszabi Endr?di 1964
Usually names like these had been submitted in a “wrong” encoding and some
characters in them were misinterpreted. UUID on the other hand is just a
hexadecimal number, which can be transitioned between various encodings more
Name strings might look the same in print or on screen, but be different
These two strings might look exactly the same on a screen or printed on paper,
but in reality they are different. Here are their UUIDs:
The difference is that the second name has a Cyrillic
а character, which in
most cases will look exactly the same as Latin
a character. And when the
names are printed on paper there is absolutely no way to tell the difference.
UUID will tell us that these two name strings are not the same.
Nothing prevents to continue to use name strings for human interaction
One argument that people often give – it is much easier for users to type
For most of us it is definitely true and nothing prevents developers
to create links of the first type, while still using UUIDs behind the scene.
Why UUIDs v5 are better than any other UUIDs as a scientific name identifier
They can be generated independently by anybody and still be the same to the
same name string
They use SHA1 algorithm which does not have (extremely rare) collisions
found for MD5 algorithm
Same ID can be generated in any popular language following