Library of Congress

With LC Labs and Digirati

United-States ✧ 2023-2024

Prototyping and evaluation of artificial intelligence-assisted extraction methods for structuring historical copyright registries.

Corpus

11,000 registers comprising approximately

500,000 copyright forms

Processing

-> Automatic Text Recognition

-> Information Extraction

TEKLIA Library of Congress

TEKLIA Library of Congress

Processing workflow

Testing three workflows integrating machine learning and human intervention (HITL) for extracting textual information from textual and/or visual elements of digitized records
Automatic extraction of the following fields: rights holder (claimant), type of work, author(s), title of the work, dates of receipt (copy, request, declaration of honor, fees), class and registration number, date of first publication, printer, volume, issue number and date of publication
Evaluation of methods with a ground truth of register books available online, for a selection of the most effective method of producing a structured dataset covering all historical registers.

TEKLIA Library of Congress