Skip to content

Parsing general scientific journal articles in HTML format

Haoyan Huo edited this page Jul 29, 2019 · 1 revision

Objectives of this document

Online scientific journal articles in HTML format are designed to be readable for humans. While usually people read articles without having problems, parsing them into nicely formatted plain text data structures is not easy. HTML codes almost always contain tags that change the style of text, thus simply extracting everything in a HTML DOM tree is not the correct solution.

This document goes through the basic rendering specifications of HTML, and discusses a possible method of extracting plain text paragraphs in a accurate and efficient way. We will first discuss the HTML rendering behaviours, then we will list several considerations in implementing a robost HTML text extractor. Finally, we will show the pseudocode for the HTML extractor implemented in LimeSoup.

HTML rendering behaviours

There are two essential types of elements in HTML DOM trees: block elements and inline elements. As the name suggests, block elements are displayed as blocks, while inline elements are displayed in certain "lines", which is basically paragraphs.

(to be continued...)