Issue
I want to extract URLs from a webpage that contains multiple URLs in it and save the extracted to a txt file.
The URLs in the webpage starts '127.0.0.1' but i wanted to remove '127.0.0.1' from them and extract only the URLs. When i run the ps script below, it only saves '127.0.0.1'. Any help to fix this please.
$threatFeedUrl = "https://raw.githubusercontent.com/DandelionSprout/adfilt/master/Alternate versions Anti-Malware List/AntiMalwareHosts.txt"
    
    # Download the threat feed data
    $threatFeedData = Invoke-WebRequest -Uri $threatFeedUrl
    
    # Define a regular expression pattern to match URLs starting with '127.0.0.1'
    $pattern = '127\.0\.0\.1(?:[^\s]*)'
    
    # Use the regular expression to find matches in the threat feed data
    $matches = [regex]::Matches($threatFeedData.Content, $pattern)
    
    # Create a list to store the matched URLs
    $urlList = @()
    
    # Populate the list with matched URLs
    foreach ($match in $matches) {
        $urlList += $match.Value
    }
    
    # Specify the output file path
    $outputFilePath = "output.txt"
    
    # Save the URLs to the output file
    $urlList | Out-File -FilePath $outputFilePath
    
    Write-Host "URLs starting with '127.0.0.1' extracted from threat feed have been saved to $outputFilePath."
Solution
Preface:
- The target URL happens to be a (semi-structured) plain-text resource, so regex-based processing is appropriate. 
- In general, however, with HTML content, using a dedicated parser is preferable, given that regexes aren't capable of parsing HTML robustly.[1] See this answer for an example of extracting links from an HTML document. 
'127\.0\.0\.1(?:[^\s]*)'
- You're mistakenly using a non-capturing group ( - (?:…)) rather than a capturing one (- (…))
- In the downloaded content, there is a space after - 127.0.0.1
- Therefore use the following regex instead ( - \Sis the simpler equivalent of- [^\s]- +only matches only a non-empty run of non-whitespace characters):- '127\.0\.0\.1 (\S+)'
$matches = …
- While it technically doesn't cause a problem here, $matchesis the name of the automatic$Matchesvariable, and therefore shouldn't be used for custom purposes.
$match.Value
- $match.Valueis the whole text that your regex matched, whereas you only want the text of the capture group.
- Use - $match.Groups[1].Valueinstead.
$urlList +=
- Building an array iteratively, with +=is inefficient, because a new array must be allocated behind the scenes in every iteration; simply use theforeachstatement as an expression, and let PowerShell collect the results for you. See this answer for more information.
Invoke-WebRequest -Uri $threatFeedUrl
- Since you're only interested in the text content of the response, it is simpler to use Invoke-RestMethodrather thanInvoke-WebRequest; the former returns the content directly (no need to access a.Contentproperty).
To put it all together:
$threatFeedUrl = 'https://raw.githubusercontent.com/DandelionSprout/adfilt/master/Alternate versions Anti-Malware List/AntiMalwareHosts.txt'
    
# Download the threat feed data
$threatFeedData = Invoke-RestMethod -Uri $threatFeedUrl
    
# Define a regular expression pattern to match URLs starting with '127.0.0.1'
$pattern = '127\.0\.0\.1 (\S+)'
    
# Use the regular expression to find matches in the threat feed data
$matchList = [regex]::Matches($threatFeedData, $pattern)
    
# Create and populate the list with matched URLs
$urlList = 
  foreach ($match in $matchList) {
    $match.Groups[1].Value
  }
    
# Specify the output file path
$outputFilePath = 'output.txt'
    
# Save the URLs to the output file
$urlList | Out-File -FilePath $outputFilePath
    
Write-Host "URLs starting with '127.0.0.1' extracted from threat feed have been saved to $outputFilePath."
[1] See this blog post for background information.
Answered By - mklement0
 
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.