Issue
I have the following string:
<td class="mytest" title="testfile" style="width:20%">0</td>
How do I get value within the td elements by using awk? In my case, it is 0.
I am very new to Linux, any help is appreciated!
Solution
If you are allowed to select your tool I would suggest using hxselect
(from html-xml-utils
), then if you have file.txt
holding
<td class="mytest" title="testfile" style="width:20%">0</td>
it would be as simple as
cat file.txt | hxselect -i -c td
output
0
Explanation: -i
match case insensitive, -c
print content only, td
is CSS selector. Disclaimer: there is not newline after 0
as there is not newline inside tag.
However if you are coerced into using installed base, then if linux machine you are using have installed python
(which if I am not mistaken, recent Ubuntu
versions do have by default), you might harness html.parser
as follows, create tdextract.py
file with following content
import sys
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def __init__(self):
self.insidetd = False
super().__init__()
def handle_starttag(self, tag, attrs):
if tag == "td":
self.insidetd = True
def handle_endtag(self, tag):
if tag == "td":
self.insidetd = False
def handle_data(self, data):
if self.insidetd:
sys.stdout.write(data)
parser = MyHTMLParser()
parser.feed(sys.stdin.read())
then do
cat file.txt | python tdextract.py
which will give same output as hxselect
described earlier. Be warned that python
use indentation for marking blocks, so it is crucially important to keep number of leading spaces.
Answered By - Daweo
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.