Member-only story
National Archives — OCR and Named Entity Extraction with Python
As part of a tradition I tend to work on some coding projects during Christmas season. The list of potential projects is a growing list during the year and I pick a couple of them for fun and for personal educational purposes. One of the potential projects on the list this year was born from reading a book.
The book in question talked about research being done by a team of researchers and the need to dig through piles of scanned documents retrieved from national archives around the world. The task of obtaining the documents, decipher them and put them into context to each other was part of finding the needle in the haystack.
The big time consumers and savers
The big time consumers in the process where finding and obtaining the records, trying to decipher the often bad quality scanned documents and extracting information from it to enable the team to relate documents to each other and value them.
While finding and obtaining the documents is not something that is easily solved, or at least not easily solved by me personally, other problems are solvable.
A huge time saver for researchers who have to dig through national archives could be; taking away the task of decipher badly scanned documents as well as supporting…