Library of Congress
With LC Labs and Digirati
United-States ✧ 2023-2024
Prototyping and evaluation of artificial intelligence-assisted extraction methods for structuring historical copyright registries.
Corpus
11,000 registers comprising approximately
500,000 copyright forms
Processing workflow
- Testing three workflows integrating machine learning and human intervention (HITL) for extracting textual information from textual and/or visual elements of digitized records
- Automatic extraction of the following fields: rights holder (claimant), type of work, author(s), title of the work, dates of receipt (copy, request, declaration of honor, fees), class and registration number, date of first publication, printer, volume, issue number and date of publication
- Evaluation of methods with a ground truth of register books available online, for a selection of the most effective method of producing a structured dataset covering all historical registers.
