Issue
I want to extract URLs from a webpage that contains multiple URLs in it and save the extracted to a txt file.
The URLs in the webpage starts '127.0.0.1' but i wanted to remove '127.0.0.1' from them and extract only the URLs. When i run the ps script below, it only saves '127.0.0.1'. Any help to fix this please.
$threatFeedUrl = "https://raw.githubusercontent.com/DandelionSprout/adfilt/master/Alternate versions Anti-Malware List/AntiMalwareHosts.txt"
# Download the threat feed data
$threatFeedData = Invoke-WebRequest -Uri $threatFeedUrl
# Define a regular expression pattern to match URLs starting with '127.0.0.1'
$pattern = '127\.0\.0\.1(?:[^\s]*)'
# Use the regular expression to find matches in the threat feed data
$matches = [regex]::Matches($threatFeedData.Content, $pattern)
# Create a list to store the matched URLs
$urlList = @()
# Populate the list with matched URLs
foreach ($match in $matches) {
$urlList += $match.Value
}
# Specify the output file path
$outputFilePath = "output.txt"
# Save the URLs to the output file
$urlList | Out-File -FilePath $outputFilePath
Write-Host "URLs starting with '127.0.0.1' extracted from threat feed have been saved to $outputFilePath."
Solution
Preface:
The target URL happens to be a (semi-structured) plain-text resource, so regex-based processing is appropriate.
In general, however, with HTML content, using a dedicated parser is preferable, given that regexes aren't capable of parsing HTML robustly.[1] See this answer for an example of extracting links from an HTML document.
'127\.0\.0\.1(?:[^\s]*)'
You're mistakenly using a non-capturing group (
(?:…)
) rather than a capturing one ((…)
)In the downloaded content, there is a space after
127.0.0.1
Therefore use the following regex instead (
\S
is the simpler equivalent of[^\s]
+
only matches only a non-empty run of non-whitespace characters):'127\.0\.0\.1 (\S+)'
$matches = …
- While it technically doesn't cause a problem here,
$matches
is the name of the automatic$Matches
variable, and therefore shouldn't be used for custom purposes.
$match.Value
$match.Value
is the whole text that your regex matched, whereas you only want the text of the capture group.Use
$match.Groups[1].Value
instead.
$urlList +=
- Building an array iteratively, with
+=
is inefficient, because a new array must be allocated behind the scenes in every iteration; simply use theforeach
statement as an expression, and let PowerShell collect the results for you. See this answer for more information.
Invoke-WebRequest -Uri $threatFeedUrl
- Since you're only interested in the text content of the response, it is simpler to use
Invoke-RestMethod
rather thanInvoke-WebRequest
; the former returns the content directly (no need to access a.Content
property).
To put it all together:
$threatFeedUrl = 'https://raw.githubusercontent.com/DandelionSprout/adfilt/master/Alternate versions Anti-Malware List/AntiMalwareHosts.txt'
# Download the threat feed data
$threatFeedData = Invoke-RestMethod -Uri $threatFeedUrl
# Define a regular expression pattern to match URLs starting with '127.0.0.1'
$pattern = '127\.0\.0\.1 (\S+)'
# Use the regular expression to find matches in the threat feed data
$matchList = [regex]::Matches($threatFeedData, $pattern)
# Create and populate the list with matched URLs
$urlList =
foreach ($match in $matchList) {
$match.Groups[1].Value
}
# Specify the output file path
$outputFilePath = 'output.txt'
# Save the URLs to the output file
$urlList | Out-File -FilePath $outputFilePath
Write-Host "URLs starting with '127.0.0.1' extracted from threat feed have been saved to $outputFilePath."
[1] See this blog post for background information.
Answered By - mklement0
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.