The False Promise of the Keyword Search: Optical Character Recognition in Digital Collections

Abstract

The past ten years have seen a dramatic rise in digitization efforts in libraries in order to support research. Despite this widespread interest, digitization is no small task; it requires considerable time and labor—and thus, financial resources. Unfortunately, institutions are failing to realize the full potential of these incredible investments, chiefly due to the interface design of digital collections, which feature keyword search as the primary discovery model. Materials in digital collections do not lend themselves well to keyword search—images do not have textual content, so a keyword search for an image is wholly reliant on its descriptive metadata. And even text-based materials fail on this front, due to the limitations of the optical character recognition (OCR) technologies that enable keyword searching. OCR accuracy ratings can dip under 60% depending on the clarity of the image, the size of the font, the language of the text, and if the text is handwritten. Currently, there is no clear understanding from the user’s perspective of what OCR technology is, how inconsistently it is applied across collections, and how that could affect their search results. This poster explores OCR, its “thinking machine” algorithms, and the implications for discovery in digital collections. It also considers alternatives to keyword search in digital collections, with the ultimate goal of making digital collections more navigable and useful. The poster is based on a paper located here: https://savannahlake.github.io/mlisportfolio/documents/issue_paper_lake.pdf

Presenters

Savannah Lake
Information Studies, UCLA

Details

Presentation Type

Poster Session

Theme

2021 Special Focus - Research in the Age of Thinking Machines: Implications for Scholars, Libraries, and Publishers

KEYWORDS

OCR, Optical character recognition, Keyword search, Digital collections, Scholarly research

Digital Media

This presenter hasn’t added media.
Request media and follow this presentation.