An overview of web page content extraction

Joy Bose
8 min readJun 14, 2020

--

In this article, we study the area of webpage content extraction, an area which has many useful applications. We also study a few popular algorithms for extraction.

Introduction

Webpage content extraction refers to the process of extracting relevant content from a webpage and leaving out the irrelevant (noisy) content such as ads, table of contents, header and footer etc. It is also called by other names such as boilerplate removal.

Reader Mode in a web browser

Some applications of webpage content extraction can be

  1. translation of the webpage content to a different language
  2. reading the webpage content (a voice reader)
  3. summarising the content
  4. presenting the content in a readable form (reader mode in web browsers).

From a high level, this is a classification problem, the objective being to label each HTML element in the webpage, including text, images, multimedia and dynamic content, as relevant or noise. The more general version of this problem statement is to label the structure of the webpage, i.e. which part is the heading, which is the table of contents, which is the main body, which is a relevant image, which is the footer etc.

However, there are problems when developing an approach to solve the problem. The web is huge, having lots of unstructured data in image and different formats. So to develop an algorithm that will work for all kinds of websites is understandably hard. HTML, the language in which webpages are written, is not a very structured language and is perhaps a little too flexible. The layout of the webpage, how it appears visually, does not have any necessary correspondence with how the HTML is written. For example, what looks like an HTML table can merely be a formatting mechanism for different div elements and vice versa. This increases the difficulty of developing a solution to this problem.

Various approaches have been proposed to solve this problem. Below we examine each of the algorithms and approaches in brief.

Extraction of webpage content based on Semantic Web

The Semantic web was proposed in 2000s as an improved version of world wide web. The goal was to create a framework where web content was easy to understand. It used languages such as OWL (web ontology language). If a webpage uses this, then extraction becomes trivial since the content is already tagged semantically.

Extraction of web content based on webpage templates

This approach can be used in some cases where the webpages follow a common and generally known template. For example, wikipedia articles. Then it is relatively easy to train a model to recognise the template.

Extraction of content based on heuristic rules

This method involves writing rules (such as based on regular expressions). This may work with some simpler format or older HTML pages, but not for newer HTML pages or those having dynamically changing content.

Some example heuristic rules could be made using the following observations:

  • Normally, content blocks (HTML block which is labeled as relevant content) are clustered together in webpage, so are noisy blocks. So if a content block is identified, its neighbours also likely to be content and vice versa
  • Noisy blocks (such as ads) have more HTML tags and less text
  • Content blocks have more and longer text

So we can define parameters such as

  • Text density (text words per line in the HTML block)
  • Link density (HTML links per line in the HTML block)

and make some rules using such parameters. The exact rules may be based on experimentation on a dataset of webpages.

Important algorithms using this technique are BTE and maximum subsequence segmentation (MSS) algorithm.

  • Body text extraction (BTE) algorithm (original, Finn 2001): Identify single continuous region with most text and least tags https://www.aidanf.net/posts/bte-body-text-extraction
  • Max Subsequence Segmentation (MSS): WWW 09 paper by Pasternack. In this approach, we think of webpage as a series of tokens. This reduces the problem of webpage content extraction to maximum subsequence (finding the maximum substring). To decide if the webpage contains an article, we identify a contiguous block of HTML, remove noise from the identified block http://cogcomp.org/papers/PasternackRo09.pdf

Arc90 Readability service

This algorithm, called Arc90 Readability, was one of the earliest algorithms used in web browsers. It was first used in Mozilla Firefox and in Apple Safari (modified). It was Heuristic based, had a number of heuristic rules. The steps included:

  • Remove script tags, css etc from webpage
  • Traverse DOM tree of the webpage, assign score for each node as per rules, and bubble the score for the node parents, build a new DOM tree
  • Examine tags in webpage to assign score: H1 to H6, DIV, P, A, UL, OL, SCRIPT, INPUT, FORM, PRE, BLOCKQUOTE, DIV
  • Positive score: Body, content, article, column, entry, hentry, main, page, pagination, text, post, blog
  • Negative score (eliminate): hidden, banner, footer, masthead, footnote
  • Get <p> paragraphs, add parents of each to list, initialize 0 score of parents, add or subtract points, find and render parent with most points

References:

JusText algorithm

This algorithm was based on PhD thesis by Jan Pomikálek. It segments webpage using grammatical and other rules. The idea is that relevant (not ads) content is more likely to be grammatically correct, and vice versa. Noisy content, on the other hand, is more likely to have lists, enumerations etc function words. This algorithm uses stop words etc to determine if it is content text or not.

References:

Illustration of the labeling of HTML snippets in a webpage using JusText algorithm

Machine learning techniques to extract content

This involves using natural language processing (NLP) techniques such as web scraping and POS tagging to create a labelled dataset of webpage content. For example, we can label a particular HTML snippet as representing an image, another for heading, another for table of contents etc. Once the dataset is created, we train a machine learning model (such as support vector machine, decision trees, conditional random fields, or deep neural network) on the dataset. This is used to predict the labels for a new HTML page.

Some papers and PhD thesis using machine learning techniques are:

  1. Pomikálek, J., 2011. Removing boilerplate and duplicate content from web corpora. Doctoral dissertation, Masarykova univerzita, Fakulta informatiky.
  2. Vogels, T., Ganea, O.E. and Eickhoff, C., 2018. Web2Text: Deep Structured Boilerplate Removal. arXiv preprint arXiv:1801.02607.

Boilerpipe algorithm

The Boilerpipe algorithm, or variants of it, are used in Chromium based browsers such as Google Chrome. The feature in Chrome is called DOM Distiller. It is based on 2010 paper (boilerplate detection using shallow text features) and PhD thesis by Kohlschutter.

In this algorithm the features used: for current, previous, next blocks of HTML are as follows:

  • Word Count (Number of words in current HTML block)
  • Average word length (number of chars)
  • Average sentence length (full stops, separators ?!.:)
  • Link density (number of links in block)
  • Text density (Words per line after word wrapping@80 chars per line)
  • Start uppercase ratio (how many words start with uppercase)
  • Densitometric features (Block fusion: fuse text fragments of similar text density)

References

  1. https://github.com/kohlschutter/boilerpipe
  2. http://www.l3s.de/~kohlschuetter/boilerplate/
  3. http://boilerpipe-web.appspot.com/
  4. Kohlschütter C, Fankhauser P, Nejdl W. 2010. Boilerplate detection using shallow text features. In Proceedings of the third ACM international conference on Web search and data mining WSDM ’10, ACM.

Content Extraction based on Tag Rations (CETR) algorithm

The CETR was described in a paper published in WWW 2010 conference (content extraction via tag ratios). In this approach, we plot the tag ratio histogram (tag chars per line). After that, we apply a Gaussian Smoothing pass to give a Smoothed histogram. Finally we apply K Means clustering to find content/noise.

References

Webpage extraction using Conditional Random Fields

This approach was described in a 2007 paper by Michal Marek, Powel Pecina. Here the problem is visualized as a sequence labelling problem. Their approach won the CleanEval 2007 competition.

The steps include the following :

  • Preprocessing: clean page, tidy, remove <scripts, style, embedded>
  • Divide page into content blocks
  • Manually label: content block as header, para, list, continuation, noisy blocks
  • Generate a feature vector for each block and run learning algorithm
  • Features: markup based (container.p, container.a, split.br), content based (char.alpha-rel, token.num-abs, sentence.avg-length), document related (position, document.word-count)

Reference: https://ufal.mff.cuni.cz/~pecina/files/cleaneval-2007.pdf

Computer Vision based techniques

This approach takes the actual webpage layout into account. Considering that the input HTML alone does not cover the CSS and even the CSS + HTML cannot cover the actual layout (there are additional factors such as screen size, type of agent such as mobile phone or PC or tab). Here computer vision techniques are used to segment the webpage into regions and to train the model to label each region separately. To make our task easier, we can use some rules that are true for most webpages such as: Content blocks are usually around the center of the webpage, noisy blocks around sides

The idea behind using such techniques is this: humans can instantly tell what is content and what is noise, even if it is for a different language webpage. So this approach is more natural to how we identify a webpage.

VIPS algorithm

VIPS is one of the techniques that simulates the layout of the webpage, It was first proposed by researchers in Microsoft Research China in 2003. This approach builds a visual tree from the DOM, simulates how the webpage ‘looks’ and which elements are coherent.

Partitioning of the webpage using the VIPS algorithm is done as follows:

  • Extract nodes from DOM tree, find the horizontal and vertical separators
  • Identify the visual blocks (change of font, size, color foreground or background) from DOM of parent HTML
  • Assign a score to each block : Degree of coherence of the elements. Repeat previous steps until coherence < threshold
  • Finally, construct the visual tree
VIPS algorithm

References:

  1. http://www.cad.zju.edu.cn/home/dengcai/VIPS/VIPS.html
  2. https://www.microsoft.com/en-us/research/publication/vips-a-vision-based-page-segmentation-algorithm/
  3. https://github.com/tpopela/vips_java
  4. Cai, D., Yu, S., Wen, J. R., & Ma, W. Y. 2003. VIPS: a vision-based page segmentation algorithm. Technical report MSR-TR-2003–79, Microsoft Research

Diffbot (a pure computer vision based technique)

Diffbot is a paid web service for automatic extraction of web data using computer vision techniques. Their algorithm follows a two step approach:

  • Render the web page fully, including images, CSS, Ajax (10x faster webkit)
  • Everything the user + browser sees: Analyze its visual layout using CV (object detection), layout, position info + browser info
  • Using Machine Learning, get likelihood of component being part of title, author, text, image
  • Return as JSON response

References:

Datasets to measure the accuracy of the extraction algorithm

  1. CleanEval: https://github.com/ppke-nlpg/CleanPortalEval and https://cleaneval.sigwac.org.uk/devset.html Cleaneval is the standard dataset used for evaluating boilerplate removal in academia, industry. It was part of a competition in 2007, has 58 English HTML pages and 51 Chinese HTML, annotated by 23 students. It includes gold standard files: Plain text with simple markup without noise (header, text, paragraphs). It also has a scoring script to measure the accuracy of any extraction algorithm.
  2. Mozilla Readability test dataset: https://github.com/mozilla/readability
  3. Dragnet dataset https://github.com/dragnet-org/dragnet and https://github.com/seomoz/dragnet_data
  4. Google news dataset L3S-GN1 (news articles crawled from Google news)
    https://github.com/geodrome/page-signal

Available Github projects of content extraction algorithms with code

  1. Arc90 Readability https://github.com/masukomi/arc90-readability
  2. Boilerpipe https://github.com/kohlschutter/boilerpipe
  3. VIPS https://github.com/tpopela/vips_java
  4. Goldminer https://github.com/endredy/GoldMiner
  5. Mozilla readability https://github.com/mozilla/readability

--

--

Joy Bose
Joy Bose

Written by Joy Bose

Working as a software developer in machine learning projects. Interested in the intersection between technology, machine learning, society and well being.

Responses (4)