Skip to content

GlobalNamesArchitecture/biodiversity

Repository files navigation

Biodiversity

DOI Gem Version Continuous Integration Status

Parses taxonomic scientific name and breaks it into semantic elements.

Important: Biodiversity parser >= 4.0.0 uses binding to https://github.com/gnames/gnparser and is not backward compatible with older versions. However it is much much faster and better than previous versions.

This gem does not have a remote server or a command line executable anymore. For such features use https://github.com/gnames/gnparser.

Installation

sudo gem install biodiversity

The gem should work on Linux, Mac and Windows (64bit) machines

Benchmarks

The fastest way to go through a massive amount of names is to use Biodiversity::Parser.parse_ary([big array], simple: true) function.

For example parsing a large file with one name per line:

#!/usr/bin/env ruby

require 'biodiversity'

P = Biodiversity::Parser
count = 0
File.open('all_names.txt').each_slice(50_000) do |sl|
  count += 1
  res = P.parse_ary(sl, true)
  puts count * 50_000
  puts res[0]
end

Here are comparative results of running parsers against a file with 24 million names on a 4CPU hyperthreaded laptop:

Program Version Full/Simple Names/min
gnparser 0.12.0 Simple 3,000,000
biodiversity 4.0.1 Simple 2,000,000
biodiversity 4.0.1 Full JSON 800,000
biodiversity 3.5.1 n/a 40,000

Example usage

You can use it as a library in Ruby:

require 'biodiversity'

#to find the gem version number
Biodiversity.version

# Note that the version in parsed output will correspond to the version of
# gnparser.

# to parse a scientific name into a simple Ruby hash
Biodiversity::Parser.parse("Plantago major", simple: true)

# to parse many scientific names using all computer CPUs
Biodiversity::Parser.parse_ary(["Plantago major", ... ], simple: true)

# to parse a scientific name into a very detailed Ruby hash
Biodiversity::Parser.parse("Plantago major")

# to parse many scientific names with all details using all computer CPUs
Biodiversity::Parser.parse_ary(["Plantago major", ... ])

#to get json representation
Biodiversity::Parser.parse("Plantago").to_json

# to clean name up
Biodiversity::Parser.parse("      Plantago       major    ")[:normalized]

# to get canonical form with or without infraspecies ranks, as well as
# stemmed version.
parsed = Biodiversity::Parser.parse("Seddera latifolia H. & S. var. latifolia")
parsed[:canonical][:full]
parsed[:canonical][:simple]
parsed[:canonical][:stem]

# to get detailed information about elements of the name
Biodiversity::Parser.parse("Pseudocercospora dendrobii (H.C. Burnett 1883) U. \
Braun & Crous 2003")[:details]

# to parse a botanical cultivar
Biodiversity::Parser.parse("Sarracenia flava 'Maxima'", with_cultivars: true)

'Surrogate' is a broad group which includes 'Barcode of Life' names, and various undetermined names with cf. sp. spp. nr. in them:

parser.parse("Coleoptera BOLD:1234567")[:surrogate]

What is "nameStringID" in the parsed results?

ID field contains UUID v5 hexadecimal string. ID is generated out of bytes from the name string itself, and identical id can be generated using any popular programming language. You can read more about UUID version 5 in a blog post

For example "Homo sapiens" should generate "16f235a0-e4a3-529c-9b83-bd15fe722110" UUID

Copyright

Authors: Dmitry Mozzherin, Hernán Lucas Pereira

Contributors: Patrick Leary

Copyright (c) 2008-2024 Dmitry Mozzherin. See LICENSE for further details.