I am trying to make some web scrapers using PowerShell since I recently discovered that this can be done without much trouble.
A good starting point is simply to fetch the HTML, use the Get-Member and see what I can do from there, for example:
$html = Invoke-WebRequest "https://www.google.com"
$html.ParsedHtml | Get-Member
The methods available to me for getting certain elements are as follows:
getElementById()
getElementsByName()
getElementsByTagName()
For example, I can get the first IMG tag in a document as follows:
$html.ParsedHtml.getElementsByTagName("img")[0]
However, after I did some research into whether I can use the CSS selector or XPath, I found that unregistered methods are available, since we just use the HTML document object here :
querySelector()
querySelectorAll()
So instead:
$html.ParsedHtml.getElementsByTagName("img")[0]
I can do:
$html.ParsedHtml.querySelector("img")
, , :
$html.ParsedHtml.querySelectorAll("img")
..., IMG. , , Google, , . , (0xc0000374).
PowerShell 5 Windows 10 x64. Win10 x64 VM, . Win7 x64, PowerShell 5. PowerShell 5, , , , - .
- ? . SelectorAll? , , , , , , // .
P.S. InternetExplorer.Application COM PowerShell, , , PowerShell Internet Explorer. , :
$ie = New-Object -ComObject InternetExplorer.Application
$ie.Visible = $true
$ie.Navigate("https://www.google.com")
Do { Start-Sleep -m 100 } Until (!$ie.Busy)
$ie.document.getElementsByTagName("img")[0]
$ie.document.querySelector("img")
$ie.document.querySelectorAll("img")
$ie.Quit()
, , , .
UPDATE
PowerShell. v2-v4 InternetExplorer.Application COM. v3-4 Invoke-WebRequest, v2 .