Using querySelectorAll in mshtml.HTMLDocumentClass in PowerShell Fails

Question

Using querySelectorAll in mshtml.HTMLDocumentClass in PowerShell Fails

I am trying to make some web scrapers using PowerShell since I recently discovered that this can be done without much trouble.

A good starting point is simply to fetch the HTML, use the Get-Member and see what I can do from there, for example:

$html = Invoke-WebRequest "https://www.google.com"
$html.ParsedHtml | Get-Member

The methods available to me for getting certain elements are as follows:

getElementById()
getElementsByName()
getElementsByTagName()

For example, I can get the first IMG tag in a document as follows:

$html.ParsedHtml.getElementsByTagName("img")[0]

However, after I did some research into whether I can use the CSS selector or XPath, I found that unregistered methods are available, since we just use the HTML document object here :

querySelector()
querySelectorAll()

So instead:

$html.ParsedHtml.getElementsByTagName("img")[0]

I can do:

$html.ParsedHtml.querySelector("img")

, , :

$html.ParsedHtml.querySelectorAll("img")

..., IMG. , , Google, , . , (0xc0000374).

PowerShell 5 Windows 10 x64. Win10 x64 VM, . Win7 x64, PowerShell 5. PowerShell 5, , , , - .

- ? . SelectorAll? , , , , , , // .

P.S. InternetExplorer.Application COM PowerShell, , , PowerShell Internet Explorer. , :

# create browser object
$ie = New-Object -ComObject InternetExplorer.Application

# make browser visible for debugging, otherwise this isn't necessary for function
$ie.Visible = $true

# browse to page
$ie.Navigate("https://www.google.com")
# wait till browser is not busy
Do { Start-Sleep -m 100 } Until (!$ie.Busy)

# this works
$ie.document.getElementsByTagName("img")[0]

# this works as well
$ie.document.querySelector("img")

# blow it up
$ie.document.querySelectorAll("img")

# we wanna quit the process, but since we blew it up we don't really make it here
$ie.Quit()

, , , .

UPDATE

PowerShell. v2-v4 InternetExplorer.Application COM. v3-4 Invoke-WebRequest, v2 .

+4

mshtml powershell selectors-api com powershell-v5.0

TheKojukinator 12 '16 20:12

2

midnightfreddie · Answer 1 · 2016-06-06T17:44:55+0000

, reddit. , , Powershell HTML DOM NodeList, querySelectorAll(). childNodes(), PS, , .ParsedHtml.childNodes , .ParsedHtml.querySelectorAll(). Intellisense, .

! DOM .item() .length node PowerShell. /r/Powershell, querySelectorAll(), DOM Powershell.

$Result = Invoke-WebRequest -Uri "https://www.reddit.com/r/PowerShell/new/"

$NodeList = $Result.ParsedHtml.querySelectorAll("#siteTable div div p.title a")

$PsNodeList = @()
for ($i = 0; $i -lt $NodeList.Length; $i++) { 
    $PsNodeList += $NodeList.item($i)
}

$PsNodeList | ForEach-Object {
    $_.InnerHtml
}

.length, , . , DOM , , , - . , CSS (self.PowerShell ), CSS, querySelectorAll(). , querySelectorAll() , DOM. , , , .InnerHtml.

2: -:

function Get-FixedQuerySelectorAll {
    param (
        $HtmlWro,
        $CssSelector
    )
    # After assignment, $NodeList will crash powershell if enumerated in any way including Intellisense-completion while coding!
    $NodeList = $HtmlWro.ParsedHtml.querySelectorAll($CssSelector)

    for ($i = 0; $i -lt $NodeList.length; $i++) {
        Write-Output $NodeList.item($i)
    }
}

$HtmlWro - - HTML, Invoke-WebReqest. .ParsedHtml, . Powershell.

Dark Daskin · Answer 2 · 2016-12-06T18:30:58+0000

@midnightfreddie , Exception from HRESULT: 0x80020101 $NodeList.item($i).

:

function Invoke-QuerySelectorAll($node, [string] $selector)
{
    $nodeList = $node.querySelectorAll($selector)
    $nodeListType = $nodeList.GetType()
    $result = @()
    for ($i = 0; $i -lt $nodeList.length; $i++)
    {
        $result += $nodeListType.InvokeMember("item", [System.Reflection.BindingFlags]::InvokeMethod, $null, $nodeList, $i)
    }
    return $result
}

New-Object -ComObject InternetExplorer.Application.

Using querySelectorAll in mshtml.HTMLDocumentClass in PowerShell Fails

More articles: