Skip to Content

Library of Congress

With LC Labs and Digirati

United-States ✧ 2023-2024


Prototyping and evaluation of artificial intelligence-assisted extraction methods for structuring historical copyright registries.


Corpus


11,000 registers comprising approximately

500,000 copyright forms

TEKLIA Library of Congress
TEKLIA Library of Congress

Processing workflow 


  • Testing three workflows integrating machine learning and human intervention (HITL) for extracting textual information from textual and/or visual elements of digitized records

  • Automatic extraction of the following fields: rights holder (claimant), type of work, author(s), title of the work, dates of receipt (copy, request, declaration of honor, fees), class and registration number, date of first publication, printer, volume, issue number and date of publication

  • Evaluation of methods with a ground truth of register books available online, for a selection of the most effective method of producing a structured dataset covering all historical registers.

TEKLIA Library of Congress