Skip to Content

SocFace

Digital database compiling a century of French censuses

-> SocFace

Project


Creation of a digital database to open access for researchers and the general public to demographic data of all individuals who lived in France between 1836 and 1936.


Corpus


400 million records, the complete nominal lists from the censuses of 1836 to 1936


Processing 


-> Automatic Text Recognition

-> Information Extraction 

Development of a digital database

SocFace TEKLIA

SocFace objectives

The 19th century saw France undergo profound transformations, evolving into an organization that still influences contemporary society.While these changes are well-documented at the macroeconomic level, their impact on individual life trajectories is less well understood.Censuses provide us with a comprehensive snapshot of the French population every five years from 1836 to 1936. Studying individuals' life trajectories requires not only fully and automatically transcribing the handwritten census lists from this period, but also identifying individuals and linking their entries across different censuses.

Socface is a collaborative research project conducted by the National Institute for Demographic Studies (INED), the Paris School of Economics, the French Interministerial Archives Service (SIAF) and TEKLIA, with financial support from the National Research Agency.

The Socface project aims to extract individual demographic and economic data across the entire French territory from nominal census lists digitized by archive services throughout France, covering the period from 1836 to 1936.




Collect, process, transcribe, organize and analyze all nominal lists from censuses from 1836 to 1936 (20 censuses).

To produce a comprehensive database of individuals who lived in France between 1836 and 1936 and make it accessible online.

To provide tools for historical and demographic research to analyze long-term social changes.

The data from SocFace 

At the beginning of the 19th century, France was a predominantly rural country. The Industrial Revolution brought about unprecedented economic and social transformations. The industrialization and urbanization of the country led to major changes in the microstructures of society.

With the data extracted by SocFace, it will now be possible to produce large-scale statistical analyses, or to study individual trajectories with precision, to expand our knowledge of these historical developments.

Learn more

SocFace TEKLIA


SocFace TEKLIA
An example of a French census book (Paris, late 19th century)

Reconstructing French history "from the ground up": a century of population censuses deciphered by the Socface project


This article was originally published in issue 144 of the journal Culture et Recherche, which focused on Open Science.

Authors :

  • Lionel Kesztenbaum is Director of Research at the National Institute for Demographic Studies (INED)
  • Manonmani Restif is the Project Manager for the FranceArchives portal at the Interministerial Archives Service of France (SIAF)
  • Christopher Kermorvant is the President of the company Teklia

Introduction

Socface is a research project, supported by the French National Research Agency (ANR), on the censuses of the French population from 1836 to 1936. It mobilizes researchers in the humanities and social sciences, engineers and archivists, and illustrates many aspects of open science as well as the contributions and challenges of automatic handwritten text recognition (HTR, Handwritten Text Recognition).

Project objective

The Socface project aims to automatically transcribe all the nominal lists from the censuses of 1836 to 1936 (twenty censuses) in order to produce, study, and disseminate a database of individuals who lived in France during this period.Supported by the French National Research Agency (ANR), this project illustrates many aspects of open science as well as the contributions and challenges of automatic handwriting recognition.

Importance of personal data

It also highlights the ever-increasing appetite of various archive users for personal data: today, the vast majority of searches conducted in archive services focus on this type of source.Every person has a right to be represented in the archives, as soon as their life has included some events, whether happy or, more often, unhappy.

Socface deserves a special place because of its scope: it covers a very large corpus – the same typology, treated over 100 years, preserved in nearly 100 structures in metropolitan France and overseas.

The origins of the project

The growing interest in individual data, particularly personal data, is fueled by technical developments (ease of digitization, dissemination of images on the Web, improvements in automatic handwriting recognition techniques, etc.) as much as it feeds them: the demand from users (researchers, genealogists or informed amateurs) motivates digitization campaigns just as the appetite of quantitative research in social sciences for "micro" data stimulates the development of automatic handwriting recognition.

Virtuous circle of digitization

Socface perfectly illustrates this virtuous circle around a single source (censuses), which is one of the few types of documents to have been almost entirely digitized by archives services, creating a corpus that should eventually exceed 10 million images despite destruction, whether intentional or accidental. This near-exhaustive digitization was a prerequisite for such a research project to be carried out.

Text extraction

Once this condition was met, historians' appetite for this mass of data wasn't enough; an efficient system still needed to be devised to extract the text contained within these millions of images. The considerable progress in automatic handwriting recognition in recent years, thanks to advances in artificial intelligence technologies, makes this extraction possible.Historical handwritten documents, from the Middle Ages to the present day, are now amenable to automatic transcription, allowing for direct use. This automatic recognition is particularly valuable for very large-scale processing where manual transcription, even collaborative, is not feasible.

The role of collaboration in handwriting recognition

However, handwriting recognition is not a self-contained, entirely autonomous process. Developing a high-performing handwriting recognition system requires a training phase for models on annotated data, using supervised machine learning techniques. The latest models, based on deep learning technologies, can be trained with a much simpler protocol than their predecessors. Today, it is no longer necessary to precisely transcribe documents, indicating the position and content of lines of text.It is possible to train models using data entered into a form, much like one would for archival research. This much faster and more natural protocol allows for the use of volunteers to create the annotations.

The Socface project has thus launched around ten collaborative annotation campaigns to create training data using Teklia’s Callico platform.

Using existing annotations

Furthermore, existing annotations, made by genealogical societies or in departmental archives, can also be used to train the machine.In fact, the quality of recognition is improved by a whole range of external information: from the list of surnames (and their frequency) to the names of localities in each municipality, including a rough estimate of age distributions over time, anything that can give the machine an idea, however vague, of the "universe of possibilities" is valuable.

In this sense, Socface is very directly a product of open science.

Processing, analyzing and distributing millions of images

The century of French history studied by Socface is marked by dramatic changes often summarized by a few broadly outlined concepts: urbanization, industrialization, and demographic transition.However, the spatial variation of these phenomena across metropolitan France, their mechanisms, and their consequences remain relatively poorly understood.Socface's contribution, particularly in matching individuals across censuses to reconstruct their life trajectories (migratory, professional, and familial), is to enable the study of this heterogeneity, to grasp how these trajectories intersect, or do not intersect, with "Grand History," how they are influenced by it, and how they, in turn, influence it.

Data dissemination

A second direct outcome of the project will be the free dissemination of this data, making it accessible to everyone.For archives, this availability of a large volume of data, both in the FranceArchives name database and on the websites of archival services, represents a tremendous opportunity to develop new services for their users interested in individual microhistory.It also opens up possibilities for pooling resources within the archival network to increase the stock of interoperable archival metadata.

Future impact of Socface

Ultimately, Socface will have a tremendous multiplier effect.On the one hand, it will inevitably encourage the digitization of missing censuses, and even their identification.On the other hand, it can provide a foundation for implementing other large-scale source analysis projects.More broadly, it should foster collaboration between archivists and the research community, with the former able to re-evaluate their digitization policies, for example by developing a national approach around comprehensive typologies, while the latter will need to be more diligent in sharing the data it produces with archival services.

Progress report - June 2022


At the Mnesys days at Naoned, Christopher Kermorvant presents a progress report on the project, 6 months after its start.


-> See on Vimeo

News from Socface - June 2023


Image collection

The first challenge of the Socface project is collecting all census images from archives.Progress on this collection is available on the Socface website.Participation from the various archives is generally very enthusiastic.Once the images and associated metadata are received, all this data must be integrated and organized into the Arkindex platform, requiring significant standardization work.


Collaborative transcriptions

At the heart of all artificial intelligence projects lies the data needed to train the machines.This data must be both abundant and of high quality.Therefore, it is always produced by humans.The Socface project is no exception, and 11 collaborative transcription campaigns were launched at the end of February on the Callico platform.

For each campaign, 100 pages were randomly selected from all the images in the relevant archive.The lines corresponding to each individual were then automatically detected and presented for data entry in Callico.Therefore, between 2,500 and 3,000 lines per campaign must be transcribed, following very precise and specific instructions.

Progress of the campaigns

Around forty volunteers contribute to the various campaigns, along with about ten regular active members.The progress of the campaigns varies, depending on the number and activity of the volunteers:


SocFace TEKLIA

Some campaigns are double-noted, which explains the number of annotations exceeding 3000.

Annotation duration

The median annotation time is 36 seconds.However, this varies considerably depending on the images and the annotator.Often, to do a thorough job, it's necessary to do some research to verify a name or location.

SocFace TEKLIA


Data entered

An analysis of the most frequently entered values confirms the quality of the data: the most frequent first names are those expected:


SocFace TEKLIA

For names, the mention "idem" is very frequent, as expected, because the instructions indicate that this mention must be entered and not replaced by the reference value.

SocFace TEKLIA


For professions, farmers are the most frequent, in different forms, which will need to be standardized to carry out statistical analyses.


SocFace TEKLIA

AI model training

The artificial intelligence model for automatically transcribing lists has been developed.A first version has been trained on data from Paris provided by the POPP project, as well as on collaborative transcriptions carried out on the Loiret archives website. Transcriptions from the Callico campaigns will soon be added to the training dataset.


SocFace TEKLIA


The first automatic transcription data on an entire departmental scale will be delivered to researchers this summer.

Thanks !

Thank you to all the participants in the Socface project!