Find scientific names on web pages, PDFs, Microsoft Office documents, images, or in freeform text. Encrypted or image-based PDFs and image files first pass through an OCR routine using Tesseract prior to using the excellent TaxonFinder and NetiNeti names discovery engines. The language of incoming content is determined using unsupervised language detection. If found to be other than English, TaxonFinder is preferentially used. Found names can be optionally resolved against a number of resources.
Code on GitHub