Saturday

How to parse the HTML of a website with PowerShell

8:31 PM dom, html, html-parsing, powershell No comments

Issue

I am trying to retrieve some information about a website, I want to look for a specific tag/class and then return the contained text value (innerHTML). This is what I have so far

$request = Invoke-WebRequest -Uri $url -UseBasicParsing
$HTML = New-Object -Com "HTMLFile"
$src = $request.RawContent
$HTML.write($src)


foreach ($obj in $HTML.all) { 
    $obj.getElementsByClassName('some-class-name') 
}

I think there is a problem with converting the HTML into the HTML object, since I see a lot of undefined properties and empty results when I'm trying to "Select-Object" them.

So after spending two days, how am I supposed to parse HTML with Powershell?

I can't use IHTMLDocument2 methods, since I don't have Office installed (Unable to use IHTMLDocument2)
I can't use the Invoke-Webrequest without -UseBasicParsing since the Powershell hangs and spawns additional windows while accessing the ParsedHTML property (parsedhtml doesnt respond anymore and Using Invoke-Webrequest in PowerShell 3.0 spawns a Windows Security Warning)

So since parsing HTML with regex is such a big no-no, how do I do it otherwise? Nothing seems to work.

Solution

If installing a third-party module is an option:

The PSParseHTML module wraps the HTML Agility Pack,^[1] and the AngleSharp .NET libraries (NuGet packages); you can use either for HTML parsing; the latter requires -Engine AngleSharp as an opt-in; as for their respective DOMs (object models):
- The HTML Agility Pack, which is used by default, provides an object model this similar to similar to the XML DOM provided by the standard System.Xml.XmlDocument NET type ([xml]). See this answer for an example of its use.
- AngleSharp, which requires opt-in via -Engine AngleSharp, is built upon the official W3C specification and therefore provides a HTML DOM as available in web browsers. Notably, this means that its .QuerySelector() and .QuerySelectorAll() methods can be used with the usual CSS selectors, such as shown below.
An added advantage of using this module is that it is not just cross-edition, but also cross-platform; that is, you can use it in Windows PowerShell as well as in PowerShell (Core) 7+, and via the latter also on Unix-like platforms.

A self-contained example based on the AngleSharp engine that parses the home page of the English Wikipedia and extracts all HTML elements whose class attribute value is vector-menu-content-list:

# Install the PSParseHTML module on demand
If (-not (Get-Module -ErrorAction Ignore -ListAvailable PSParseHTML)) {
  Write-Verbose "Installing PSParseHTML module for the current user..."
  Install-Module -Scope CurrentUser PSParseHTML -ErrorAction Stop
}

# Using the AngleSharp engine, parse the home page of the English Wikipedia
# into an HTML DOM.
$htmlDom = ConvertFrom-Html -Engine AngleSharp -Url https://en.wikipedia.org

# Extract all HTML elements with a 'class' attribute value of 'vector-menu-content-list'
# and output their text content (.TextContent)
$htmlDom.QuerySelectorAll('.vector-menu-content-list').TextContent

Answered By - mklement0

This Answer collected from stackoverflow and tested by AngularFix community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday

How to parse the HTML of a website with PowerShell

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels