Issue
I have html text and trying to convert it to readable text with Python. I can convert it to plain text using BeautifulSoup however I want to retain the format (lines and indents):
This is the code I have:
from bs4 import BeautifulSoup
html='<div class="ExternalClassAA9E2ABA21084151AA5D8D15E7F3F1D1"><ul><li>We need more details. </li><li><span style="font-size:11pt;"><span><span><span><span>The trget goal was met for this purpose. Comparing the result of multiple analysis, during 2021.</span></span></span></span></span><ul><li><span style="font-size:11pt;"><span><span><span><span>Exceed on profit baseline for three Qs in rows</span></span></span></span></span></li><li><span style="font-size:11pt;"><span><span><span><span>Meet the budget limit for all departments</span></span></span></span></span></li><li><span style="font-size:11pt;"><span><span><span><span>Minimize attrition across the org</span></span></span></span></span></li></ul></li><li><span style="font-size:11pt;"><span><span><span><span>Last Q was not target and we are analysing the roo cause.</span></span></span></span></span></li></ul></div>'
cleantext = BeautifulSoup(str(html), 'html.parser').text
cleantext = ' '.join(cleantext.split())
print(cleantext)
and it returns the plain text:
We need more details. The target goal was met for this purpose. Comparing the result of multiple analysis, during 2021.Exceed on profit baseline for three Qs in rowsMeet the budget limit for all departmentsMinimize attrition across the orgLast Q was not target and we are analysing the roo cause.
however, my goal is to get something similar to the below which keeps the format:
We need more details.
The target goal was met for this purpose. Comparing the result of multiple analyses, during 2021.
Exceed on profit baseline for three Qs in rows
Meet the budget limit for all departments
Minimize attrition across the org
Last Q was not target and we are analyzing the root cause.
In which the lines and indents are retained. Any help would be appreciated.
Solution
It seems that BeautifulSoup is primarily a scraping library, not a rendering library. If it's not mandatory to use BeautifulSoup, html2text is another one that might be better suited.
Example:
import html2text
html = open("test_file.htm").read()
print html2text.html2text(html)
Output:
Heading 1
* list item 1
* list item 2
* list item 3
Heading 2
* list item A
* list item B
Answered By - brobers
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.