Tuesday

Converting my HTML from one form to another

2:06 PM beautifulsoup, html, html-parsing, parsing, regex No comments

Issue

I am just going through some poor HTML markup on my old web-pages. I've noticed a few mistakes to my markup which are recurring.

I was hoping to fix these with a program but am not sure what API or language would help me accomplish this.

My HTML is of this form :

<td class="bulletPoint" align="right" valign="top" height="100%" width="100%">This is text</td>

which I want to replace with

<td class="bulletPoint" align="right" valign="top" height="100%" width="100%"><h2>This is text</h2></td>

I also have this kind of form (class/colspan/href can vary) :

<td class='original' colspan=4><a id='id12345' class='content' href='#note'">This is the text</a>

And want to convert it to this :

<font SIZE="3"  COLOR="#222222"  FACE="Verdana"  STYLE="background-color:#ffffff;font-weight: bold;"><h2>This is the text</h2></font>

What's the best way to programmatically do this when I have over 1,000 .html files to perform this operation on?

Solution

“What is the best way to programatically” do it, depends on which tools you know best. I would do it with python and beautifulsoup. Other people may vouch for sed and regex. See my approach:

Create two separate directories, one with a "copy" of your original .html files and another where your modified files will go (not a subdirectory of the original ones).

Run the following python3 program in one run or separate runs according to what you have. You are not altering the original files, and you can always erase the modified ones and try again.

You can alter the selections of class_, colspan, href, etc... as you see fit, as well as create several programs, one for every case you may run into.

import os
from bs4 import BeautifulSoup

do = dir_with_original_files = '/path/to/your_original_files'
dm = dir_with_modified_files = '/path/to/your_modified_files'
for root, dirs, files in os.walk(do):
    for f in files:
        if f.endswith('~'): #you don't want to process backups
            continue
        original_file = os.path.join(root, f)
        mf = f.split('.')
        mf = ''.join(mf[:-1])+'_mod.'+mf[-1] # you can keep the same name 
                                             # if you omit the last two lines.
                                             # They are in separate directories
                                             # anyway. In that case, mf = f
        modified_file = os.path.join(dm, mf)
        with open(original_file, 'r') as orig_f, \
             open(modified_file, 'w') as modi_f:
            soup = BeautifulSoup(orig_f.read())
            for t in soup.find_all('td', class_='bulletPoint'):
                t.string.wrap(soup.new_tag('h2'))
            # The following loop could belong to a separate python progam
            # which would follow the same general structure.
            for t in soup.find_all('td', class_='original'):
                font = soup.new_tag('font')
                font['size'] = '3'
                font['color'] = '#222222'
                font['face'] = 'Verdana'
                font['style'] = 'background-color:#ffffff;font-weight: bold;'
                t.string.wrap(soup.new_tag('h2')).wrap(font)
            # This is where you create your new modified file.
            modi_f.write(soup.prettify())

Answered By - chapelo

This Answer collected from stackoverflow and tested by AngularFix community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday

Converting my HTML from one form to another

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels