Graph Based Keyword Spotting in Handwritten Historical Documents

Time 2015 — 2018
FundingHasler Foundation
ResearchersKaspar Riesen

Abstract: Many libraries all around the world have started digitizing their most valuable old handwritings in order to preserve world’s cultural heritage. The large number of available handwritten document images evoked the need to make them amenable to searching and browsing. Yet, the automatic transcription of text images is still a widely unsolved problem (especially for degraded historical manuscripts). Therefore, keyword spotting (KWS) has been proposed instead of a complete transcriptions. KWS refers to the process of retrieving all instances of a given keyword or a key phrase from a document. A large variety of algorithms have been developed for KWS during the last two decades. Yet, graph based representations and graph matching have been very rarely used for this specific task. Most probably this is due to well known problems arising with graphs in the field of unconstrained handwriting recognition. Yet, KWS is not necessarily based on handwriting recognition. In fact, it turns out that the paradigm of graph matching is able to meet the requirements of KWS. That is, through the representation of isolated words with graphs, the search and retrieval in documents can be regarded as matching of an input graph (keyword) with a large set of various graphs or with one large graph (document). The major objective of the present project is to develop specific graph representations, graph matching technologies as well as novel graph embedding and kernel techniques within the field of graph based KWS. The main focus of the project is to get a deep understanding of the advantages and limitations of graph based methods in the field of KWS. For testing our novel algorithms, two historical documents will be used, viz. the George Washington and the Parcival Manuscript. These historical documents are well known in the community of document analysis, and moreover, benchmark tests for KWS are available on them.