Member-only story

National Archives — OCR and Named Entity Extraction with Python

Johan Louwers
4 min readDec 30, 2022

--

Declassified — TOP SECRET

As part of a tradition I tend to work on some coding projects during Christmas season. The list of potential projects is a growing list during the year and I pick a couple of them for fun and for personal educational purposes. One of the potential projects on the list this year was born from reading a book.

The book in question talked about research being done by a team of researchers and the need to dig through piles of scanned documents retrieved from national archives around the world. The task of obtaining the documents, decipher them and put them into context to each other was part of finding the needle in the haystack.

The big time consumers and savers

The big time consumers in the process where finding and obtaining the records, trying to decipher the often bad quality scanned documents and extracting information from it to enable the team to relate documents to each other and value them.

While finding and obtaining the documents is not something that is easily solved, or at least not easily solved by me personally, other problems are solvable.

A huge time saver for researchers who have to dig through national archives could be; taking away the task of decipher badly scanned documents as well as supporting…

--

--

Johan Louwers
Johan Louwers

Written by Johan Louwers

Johan Louwers is a technology enthousiasts with a long background in supporting enterprises and startups alike as CTO, Chief Enterprise Architect and developer.

No responses yet