

How to use file utilities

Overview

In this example we will use the process_directory utility function to process a directory.

In order to simplify the application of filters bs-processor contains utilities for processing documents saved in files.

The file_util module has two main entry points: process_file and process_directory.

process_file takes as parameters the processor and the input and output file names.

process_directory is slightly more complex and it will be used in this example.

In this example we will use the input directory used by all other examples and we will output the result to a new directory output2 at the same level with input.

To keep the focus on the directory function we will keep things simple and use a filter processor that filters empty tags.

from bs_processors.predicate import is_empty_p
from bs_processors import filter_factory
filter_empty = filter_factory(is_empty_p)

In order to do our processing we'll use the following code:

import util
from bs_processors.utils.file_util import process_directory

def main():
    input_dir = util.relative_to_absolute_path_name(__file__, "input")
    output_dir = util.relative_to_absolute_path_name(__file__, "output2")
    process_directory(filter_empty,'html.parser', input_dir, output_dir,"*.html")

This will run all files that match *.html through the filter_empty processor and put the result in the output2 directory.

process_directory will recreate the structure of the input directory in the output directory ( subdirectories from input will become subdirectories in output).