Parsing general scientific journal articles in HTML format

Objectives of this document

Online scientific journal articles in HTML format are designed to be readable for humans. While usually people read articles without having problems, parsing them into nicely formatted plain text data structures is not easy. HTML codes almost always contain tags that change the style of text, thus simply extracting everything in a HTML DOM tree is not the correct solution.

This document goes through the basic rendering specifications of HTML, and discusses a possible method of extracting plain text paragraphs in a accurate and efficient way. We will first discuss the HTML rendering behaviours, then we will list several considerations in implementing a robost HTML text extractor. Finally, we will show the pseudocode for the HTML extractor implemented in LimeSoup.

HTML rendering behaviours

There are two essential types of elements in HTML DOM trees: block elements and inline elements. As the name suggests, block elements are displayed as blocks, while inline elements are displayed in certain "lines", which is basically paragraphs.

(to be continued...)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing general scientific journal articles in HTML format

Objectives of this document

HTML rendering behaviours

Clone this wiki locally