A rule-based approach to the recognition of newspaper articles


Abstract of my MSc thesis in computer science

The layout of the modern daily press is very complex: a single page usually contains several articles, images, charts, advertisements. Each article has a title, and is typeset over a series of text blocks, which need to be read in the correct sequence. After such a page has been acquired by a computer, it can be processed by an OCR system, which recognizes the single blocks placed on the page, and extracts the text from the text blocks. The layout however is still needed to understand the contents, because the blocks disposition and appearance still convey the information about which block belongs to which article, and what is the correct reading order. If this information is discarded, what remains is just an unordered list of pieces of articles.

In this research, we introduce NewsReader, an experimental system which is able to reconstruct the original articles, using the layout information and the textual contents. A set of rules capture a number of geometric features of the page, while others focus on the textual similarities between blocks, an approach which is inspired by previous researches over the lexical clustering of texts. The observations made by all the rules are finally used to generate the final result.

One of the innovations brought by this system is the ability to gather details from various sources, successfully using a compact set of general rules, which are valid for most of the western daily press. The system has been rigorously tested over a set of real newspaper pages, and showed a very good level of performances, with average precision and recall above 97%.

The thesis is structured as follows: in the introduction, the general context of the research is presented, and the main concepts are defined. In the second chapter, previous researches over the lexical clustering and the computation of the reading order of generic publications are reviewed. In the third, we present the document image analysis approach, the NewsReader architecture, the employed rules, and some execution examples. In the fourth chapter, the experiment setup is described, and the full results are reported and commented. Finally, the conclusions are drawn, in the fifth chapter. In the appendices, we give some clues about the implementation, and we report listings which are cited in the text.

Information and Links

Join the fray by commenting, tracking what others have to say, or linking to it from your blog.


Other Posts

Write a Comment

Take a moment to comment and tell us what you think. Some basic HTML is allowed for formatting.

Reader Comments

Be the first to leave a comment!