Wednesday

Rendering htm with Python by retaining indents and lines

Issue

I have html text and trying to convert it to readable text with Python. I can convert it to plain text using BeautifulSoup however I want to retain the format (lines and indents):

This is the code I have:

from bs4 import BeautifulSoup

html='<div class="ExternalClassAA9E2ABA21084151AA5D8D15E7F3F1D1"><ul><li>We need more details.&#160;</li><li><span style="font-size&#58;11pt;"><span><span><span><span>The trget goal was met for this purpose. Comparing the result of multiple analysis, during 2021.</span></span></span></span></span><ul><li><span style="font-size&#58;11pt;"><span><span><span><span>Exceed on profit baseline for three Qs in rows</span></span></span></span></span></li><li><span style="font-size&#58;11pt;"><span><span><span><span>Meet the budget limit for all departments</span></span></span></span></span></li><li><span style="font-size&#58;11pt;"><span><span><span><span>Minimize attrition across the org</span></span></span></span></span></li></ul></li><li><span style="font-size&#58;11pt;"><span><span><span><span>Last Q was not target and we are analysing the roo cause.</span></span></span></span></span></li></ul></div>'

cleantext = BeautifulSoup(str(html), 'html.parser').text
cleantext = ' '.join(cleantext.split())
print(cleantext)

and it returns the plain text:

We need more details. The target goal was met for this purpose. Comparing the result of multiple analysis, during 2021.Exceed on profit baseline for three Qs in rowsMeet the budget limit for all departmentsMinimize attrition across the orgLast Q was not target and we are analysing the roo cause.

however, my goal is to get something similar to the below which keeps the format:

We need more details. 
The target goal was met for this purpose. Comparing the result of multiple analyses, during 2021.
        Exceed on profit baseline for three Qs in rows
        Meet the budget limit for all departments
        Minimize attrition across the org
Last Q was not target and we are analyzing the root cause.

In which the lines and indents are retained. Any help would be appreciated.

Solution

It seems that BeautifulSoup is primarily a scraping library, not a rendering library. If it's not mandatory to use BeautifulSoup, html2text is another one that might be better suited.

Example:

import html2text
html = open("test_file.htm").read()
print html2text.html2text(html)

Output:

Heading 1

  * list item 1
  * list item 2
  * list item 3

Heading 2

  * list item A
  * list item B

Answered By - brobers

This Answer collected from stackoverflow and tested by AngularFix community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday

Rendering htm with Python by retaining indents and lines

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels