Issue
I am just going through some poor HTML markup on my old web-pages. I've noticed a few mistakes to my markup which are recurring.
I was hoping to fix these with a program but am not sure what API or language would help me accomplish this.
My HTML is of this form :
<td class="bulletPoint" align="right" valign="top" height="100%" width="100%">This is text</td>
which I want to replace with
<td class="bulletPoint" align="right" valign="top" height="100%" width="100%"><h2>This is text</h2></td>
I also have this kind of form (class/colspan/href can vary) :
<td class='original' colspan=4><a id='id12345' class='content' href='#note'">This is the text</a>
And want to convert it to this :
<font SIZE="3" COLOR="#222222" FACE="Verdana" STYLE="background-color:#ffffff;font-weight: bold;"><h2>This is the text</h2></font>
What's the best way to programmatically do this when I have over 1,000 .html files to perform this operation on?
Solution
“What is the best way to programatically” do it, depends on which tools you know best. I would do it with python and beautifulsoup. Other people may vouch for sed and regex. See my approach:
Create two separate directories, one with a "copy" of your original .html files and another where your modified files will go (not a subdirectory of the original ones).
Run the following python3 program in one run or separate runs according to what you have. You are not altering the original files, and you can always erase the modified ones and try again.
You can alter the selections of class_, colspan, href, etc... as you see fit, as well as create several programs, one for every case you may run into.
import os
from bs4 import BeautifulSoup
do = dir_with_original_files = '/path/to/your_original_files'
dm = dir_with_modified_files = '/path/to/your_modified_files'
for root, dirs, files in os.walk(do):
for f in files:
if f.endswith('~'): #you don't want to process backups
continue
original_file = os.path.join(root, f)
mf = f.split('.')
mf = ''.join(mf[:-1])+'_mod.'+mf[-1] # you can keep the same name
# if you omit the last two lines.
# They are in separate directories
# anyway. In that case, mf = f
modified_file = os.path.join(dm, mf)
with open(original_file, 'r') as orig_f, \
open(modified_file, 'w') as modi_f:
soup = BeautifulSoup(orig_f.read())
for t in soup.find_all('td', class_='bulletPoint'):
t.string.wrap(soup.new_tag('h2'))
# The following loop could belong to a separate python progam
# which would follow the same general structure.
for t in soup.find_all('td', class_='original'):
font = soup.new_tag('font')
font['size'] = '3'
font['color'] = '#222222'
font['face'] = 'Verdana'
font['style'] = 'background-color:#ffffff;font-weight: bold;'
t.string.wrap(soup.new_tag('h2')).wrap(font)
# This is where you create your new modified file.
modi_f.write(soup.prettify())
Answered By - chapelo
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.