An overview of web page content extraction

8 min readJun 14, 2020

In this article, we study the area of webpage content extraction, an area which has many useful applications. We also study a few popular algorithms for extraction.

Introduction

Webpage content extraction refers to the process of extracting relevant content from a webpage and leaving out the irrelevant (noisy) content such as ads, table of contents, header and footer etc. It is also called by other names such as boilerplate removal.

Some applications of webpage content extraction can be

translation of the webpage content to a different language
reading the webpage content (a voice reader)
summarising the content
presenting the content in a readable form (reader mode in web browsers).

From a high level, this is a classification problem, the objective being to label each HTML element in the webpage, including text, images, multimedia and dynamic content, as relevant or noise. The more general version of this problem statement is to label the structure of the webpage, i.e. which part is the heading, which is the table of contents, which is the main body, which is a relevant image, which is the footer etc.

However, there are problems when developing an approach to solve the problem. The web is huge, having lots of unstructured data in image and different formats. So to develop an algorithm that will work for all kinds of websites is understandably hard. HTML, the language in which webpages are written, is not a very structured language and is perhaps a little too flexible. The layout of the webpage, how it appears visually, does not have any necessary correspondence with how the HTML is written. For example, what looks like an HTML table can merely be a formatting mechanism for different div elements and vice versa. This increases the difficulty of developing a solution to this problem.

Various approaches have been proposed to solve this problem. Below we examine each of the algorithms and approaches in brief.

Extraction of webpage content based on Semantic Web

The Semantic web was proposed in 2000s as an improved version of world wide web. The goal was to create a framework where web content was easy to understand. It used languages such as OWL (web ontology language). If a webpage uses this, then extraction becomes trivial since the content is already tagged semantically.

Extraction of web content based on webpage templates

This approach can be used in some cases where the webpages follow a common and generally known template. For example, wikipedia articles. Then it is relatively easy to train a model to recognise the template.

Extraction of content based on heuristic rules

This method involves writing rules (such as based on regular expressions). This may work with some simpler format or older HTML pages, but not for newer HTML pages or those having dynamically changing content.

Some example heuristic rules could be made using the following observations:

Normally, content blocks (HTML block which is labeled as relevant content) are clustered together in webpage, so are noisy blocks. So if a content block is identified, its neighbours also likely to be content and vice versa
Noisy blocks (such as ads) have more HTML tags and less text
Content blocks have more and longer text

So we can define parameters such as

Text density (text words per line in the HTML block)
Link density (HTML links per line in the HTML block)

and make some rules using such parameters. The exact rules may be based on experimentation on a dataset of webpages.

Important algorithms using this technique are BTE and maximum subsequence segmentation (MSS) algorithm.

Body text extraction (BTE) algorithm (original, Finn 2001): Identify single continuous region with most text and least tags https://www.aidanf.net/posts/bte-body-text-extraction
Max Subsequence Segmentation (MSS): WWW 09 paper by Pasternack. In this approach, we think of webpage as a series of tokens. This reduces the problem of webpage content extraction to maximum subsequence (finding the maximum substring). To decide if the webpage contains an article, we identify a contiguous block of HTML, remove noise from the identified block http://cogcomp.org/papers/PasternackRo09.pdf

Arc90 Readability service

This algorithm, called Arc90 Readability, was one of the earliest algorithms used in web browsers. It was first used in Mozilla Firefox and in Apple Safari (modified). It was Heuristic based, had a number of heuristic rules. The steps included:

Remove script tags, css etc from webpage
Traverse DOM tree of the webpage, assign score for each node as per rules, and bubble the score for the node parents, build a new DOM tree
Examine tags in webpage to assign score: H1 to H6, DIV, P, A, UL, OL, SCRIPT, INPUT, FORM, PRE, BLOCKQUOTE, DIV
Positive score: Body, content, article, column, entry, hentry, main, page, pagination, text, post, blog
Negative score (eliminate): hidden, banner, footer, masthead, footnote
Get <p> paragraphs, add parents of each to list, initialize 0 score of parents, add or subtract points, find and render parent with most points

References:

JusText algorithm

This algorithm was based on PhD thesis by Jan Pomikálek. It segments webpage using grammatical and other rules. The idea is that relevant (not ads) content is more likely to be grammatically correct, and vice versa. Noisy content, on the other hand, is more likely to have lists, enumerations etc function words. This algorithm uses stop words etc to determine if it is content text or not.

References:

Illustration of the labeling of HTML snippets in a webpage using JusText algorithm

Machine learning techniques to extract content

This involves using natural language processing (NLP) techniques such as web scraping and POS tagging to create a labelled dataset of webpage content. For example, we can label a particular HTML snippet as representing an image, another for heading, another for table of contents etc. Once the dataset is created, we train a machine learning model (such as support vector machine, decision trees, conditional random fields, or deep neural network) on the dataset. This is used to predict the labels for a new HTML page.

Some papers and PhD thesis using machine learning techniques are:

Pomikálek, J., 2011. Removing boilerplate and duplicate content from web corpora. Doctoral dissertation, Masarykova univerzita, Fakulta informatiky.
Vogels, T., Ganea, O.E. and Eickhoff, C., 2018. Web2Text: Deep Structured Boilerplate Removal. arXiv preprint arXiv:1801.02607.

Boilerpipe algorithm

The Boilerpipe algorithm, or variants of it, are used in Chromium based browsers such as Google Chrome. The feature in Chrome is called DOM Distiller. It is based on 2010 paper (boilerplate detection using shallow text features) and PhD thesis by Kohlschutter.

In this algorithm the features used: for current, previous, next blocks of HTML are as follows:

Word Count (Number of words in current HTML block)
Average word length (number of chars)
Average sentence length (full stops, separators ?!.:)
Link density (number of links in block)
Text density (Words per line after word wrapping@80 chars per line)
Start uppercase ratio (how many words start with uppercase)
Densitometric features (Block fusion: fuse text fragments of similar text density)

References

https://github.com/kohlschutter/boilerpipe
http://www.l3s.de/~kohlschuetter/boilerplate/
http://boilerpipe-web.appspot.com/
Kohlschütter C, Fankhauser P, Nejdl W. 2010. Boilerplate detection using shallow text features. In Proceedings of the third ACM international conference on Web search and data mining WSDM ’10, ACM.

Content Extraction based on Tag Rations (CETR) algorithm

The CETR was described in a paper published in WWW 2010 conference (content extraction via tag ratios). In this approach, we plot the tag ratio histogram (tag chars per line). After that, we apply a Gaussian Smoothing pass to give a Smoothed histogram. Finally we apply K Means clustering to find content/noise.

References

https://www3.nd.edu/~tweninge/cetr/
Similar approach: content code blurring (Gottron,2008)

Webpage extraction using Conditional Random Fields

This approach was described in a 2007 paper by Michal Marek, Powel Pecina. Here the problem is visualized as a sequence labelling problem. Their approach won the CleanEval 2007 competition.

The steps include the following :

Preprocessing: clean page, tidy, remove <scripts, style, embedded>
Divide page into content blocks
Manually label: content block as header, para, list, continuation, noisy blocks
Generate a feature vector for each block and run learning algorithm
Features: markup based (container.p, container.a, split.br), content based (char.alpha-rel, token.num-abs, sentence.avg-length), document related (position, document.word-count)

Reference: https://ufal.mff.cuni.cz/~pecina/files/cleaneval-2007.pdf

Computer Vision based techniques

This approach takes the actual webpage layout into account. Considering that the input HTML alone does not cover the CSS and even the CSS + HTML cannot cover the actual layout (there are additional factors such as screen size, type of agent such as mobile phone or PC or tab). Here computer vision techniques are used to segment the webpage into regions and to train the model to label each region separately. To make our task easier, we can use some rules that are true for most webpages such as: Content blocks are usually around the center of the webpage, noisy blocks around sides

The idea behind using such techniques is this: humans can instantly tell what is content and what is noise, even if it is for a different language webpage. So this approach is more natural to how we identify a webpage.

VIPS algorithm

VIPS is one of the techniques that simulates the layout of the webpage, It was first proposed by researchers in Microsoft Research China in 2003. This approach builds a visual tree from the DOM, simulates how the webpage ‘looks’ and which elements are coherent.

Partitioning of the webpage using the VIPS algorithm is done as follows:

Extract nodes from DOM tree, find the horizontal and vertical separators
Identify the visual blocks (change of font, size, color foreground or background) from DOM of parent HTML
Assign a score to each block : Degree of coherence of the elements. Repeat previous steps until coherence < threshold
Finally, construct the visual tree

References:

http://www.cad.zju.edu.cn/home/dengcai/VIPS/VIPS.html
https://www.microsoft.com/en-us/research/publication/vips-a-vision-based-page-segmentation-algorithm/
https://github.com/tpopela/vips_java
Cai, D., Yu, S., Wen, J. R., & Ma, W. Y. 2003. VIPS: a vision-based page segmentation algorithm. Technical report MSR-TR-2003–79, Microsoft Research

Diffbot (a pure computer vision based technique)

Diffbot is a paid web service for automatic extraction of web data using computer vision techniques. Their algorithm follows a two step approach:

Render the web page fully, including images, CSS, Ajax (10x faster webkit)
Everything the user + browser sees: Analyze its visual layout using CV (object detection), layout, position info + browser info
Using Machine Learning, get likelihood of component being part of title, author, text, image
Return as JSON response

References:

https://www.youtube.com/watch?v=Vm5joQ_Y5ow
www.diffbot.com/testdrive
Diffbot : a company with extraction APIs based on image techniques. https://www.diffbot.com/products/automatic/

Datasets to measure the accuracy of the extraction algorithm

CleanEval: https://github.com/ppke-nlpg/CleanPortalEval and https://cleaneval.sigwac.org.uk/devset.html Cleaneval is the standard dataset used for evaluating boilerplate removal in academia, industry. It was part of a competition in 2007, has 58 English HTML pages and 51 Chinese HTML, annotated by 23 students. It includes gold standard files: Plain text with simple markup without noise (header, text, paragraphs). It also has a scoring script to measure the accuracy of any extraction algorithm.
Mozilla Readability test dataset: https://github.com/mozilla/readability
Dragnet dataset https://github.com/dragnet-org/dragnet and https://github.com/seomoz/dragnet_data
Google news dataset L3S-GN1 (news articles crawled from Google news)
https://github.com/geodrome/page-signal

Available Github projects of content extraction algorithms with code

Arc90 Readability https://github.com/masukomi/arc90-readability
Boilerpipe https://github.com/kohlschutter/boilerpipe
VIPS https://github.com/tpopela/vips_java
Goldminer https://github.com/endredy/GoldMiner
Mozilla readability https://github.com/mozilla/readability

An overview of web page content extraction

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Joy Bose

Responses (4)