

How to unwrap unwanted tags

Overview

In this example we will use the flatten_factory processor to unwrap unwanted tags.

This covers the case when we have a source file that has deeply nested structures that we would like to flatten.

This type of problem can be seen when we want to clean documents that were exported as html from various text editors.

In our example the useful information is buried deep inside nested <div> that have various classes. To make matters worse some information is also buried inside <font> elements.

We would like to remove all superfluous <div> elements and all <font> elements while preserving the content inside them.

Let's say we have the following file that we want to clean up:

Input

<!DOCTYPE html>
<html>
<body>
<div>
    <div class="useless-1 bold">
        First line <font><font> inside double font</font></font> outside font.
        <div class="useless-2">
            <p>
                Second line
            </p>
        </div>
    </div>
    <div>
        <p>
            Third line
        </p>
    </div>
</div>
<div>
    Forth line <font> inside font </font> <span>end.</span>
</div>
</body>
</html>

We noticed that the file has some empty <font> tags that we want to remove if they do not contain any text. For that we can use the filter_factory processor.

In the processor we need to pass a predicate that checks if an element is a <font> or if it is a `

with a marker class that designates it as not necessary.

Luckily we can construct this predicate from already available building blocks.

To construct our predicate we have the following code:

from bs_processors import and_pf, has_name_pf, or_pf, has_class_pf

should_uwrap_p = or_pf(
    has_name_pf('font'),
    and_pf(
        has_name_pf('div'),
        has_class_pf(['useless-1', 'useless-2'])
    )
)

After constructing the predicate all that remains is to use it to create our processor

from bs_processors import unwrap_factory

remove_unnecessary_wrappers = unwrap_factory(should_uwrap_p)

Now we are ready to pass it our loaded soup and we are done

import util
from bs_processors.utils.file_util import process_file

def main():
    doc_name = util.relative_to_absolute_path_name(__file__, "input/deeply_nested.html")
    result_name = util.relative_to_absolute_path_name(__file__, "output/deeply_nested_result.html")
    process_file(remove_unnecessary_wrappers, 'html.parser', doc_name, result_name)

The result is:

<!DOCTYPE html>
<html>
 <body>
  <div>
   First line
   inside double font
   outside font.
   <p>
    Second line
   </p>
   <div>
    <p>
     Third line
    </p>
   </div>
  </div>
  <div>
   Forth line
   inside font
   <span>
    end.
   </span>
  </div>
 </body>
</html>