Get text in HTML using powershell

Question

Get text in HTML using powershell

In this html code:

<div id="ajaxWarningRegion" class="infoFont"></div> <span id="ajaxStatusRegion"></span> <form enctype="multipart/form-data" method="post" name="confIPBackupForm" action="/cgi-bin/utilserv/confIPBackup/w_confIPBackup" id="confIPBackupForm" > <pre> Creating a new ZIP of IP Phone files from HTTP/PhoneBackup and HTTPS/PhoneBackup </pre> <pre> /tmp/IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip</pre> <pre>Reports Success</pre> <pre></pre> <a href = /tmp/IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip> Download the new ZIP of IP Phone files </a> </div>

I want to get the text IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip or just the date and time between IP_PHONE_BACKUP- and .zip

How can i do this?

+4

html regex powershell

Littlefish Jul 25 '12 at 13:58

source share

4 answers

Michael sorens · Answer 1 · 2012-07-25T18:40:43+0000

What makes this question so interesting is that HTML looks and smells just like XML, the latter being much more programmable because of its well-thought out and orderly structure. In an ideal world, HTML would be a subset of XML, but HTML in the real world is categorically not XML. If you set the example in the question to any XML parser, it will violate various violations. In this case, the desired result can be achieved with a single line of PowerShell. This returns all href text:

 Select-NodeContent $doc.DocumentNode "//a/@href"

And this extracts the desired substring:

 Select-NodeContent $doc.DocumentNode "//a/@href" "IP_PHONE_BACKUP-(.*)\.zip"

The trick, however, is for business purposes, in order to be able to run a single line of code. You need:

Install HtmlAgilityPack to make HTML parsing look like XML parsing.
Install PowerShell Community Extensions if you want to analyze a live web page.
Understand XPath in order to be able to build a navigation path to your node target.
Understand regular expressions in order to be able to extract a substring from your target node.

Subject to these requirements, you can add the HTMLAgilityPath type to your environment and define the Select-NodeContent , as shown below. At the very end of the code, we show how you assign the value to the $doc variable used in the above single-line ones. I show how to download HTML from a file or from the Internet, depending on your needs.

 Set-StrictMode -Version Latest $HtmlAgilityPackPath = [System.IO.Path]::Combine((Get-Item $PROFILE).DirectoryName, "bin\HtmlAgilityPack.dll") Add-Type -Path $HtmlAgilityPackPath function Select-NodeContent( [HtmlAgilityPack.HtmlNode]$node, [string] $xpath, [string] $regex, [Object] $default = "") { if ($xpath -match "(.*)/@(\w+)$") { # If standard XPath to retrieve an attribute is given, # map to supported operations to retrieve the attribute text. ($xpath, $attribute) = $matches[1], $matches[2] $resultNode = $node.SelectSingleNode($xpath) $text = ?: { $resultNode } { $resultNode.Attributes[$attribute].Value } { $default } } else { # retrieve an element text $resultNode = $node.SelectSingleNode($xpath) $text = ?: { $resultNode } { $resultNode.InnerText } { $default } } # If a regex is given, use it to extract a substring from the text if ($regex) { if ($text -match $regex) { $text = $matches[1] } else { $text = $default } } return $text } $doc = New-Object HtmlAgilityPack.HtmlDocument $result = $doc.Load("tmp\temp.html") # Use this to load a file #$result = $doc.LoadHtml((Get-HttpResource $url)) # Use this PSCX cmdlet to load a live web page

Joey · Answer 2 · 2012-07-26T05:12:55+0000

Actually, the HTML surrounding your file name doesn't matter here. You can easily delete the date with the following regular expression (which doesn’t even care if you extract it from email to an HTML page or a CSV file):

 (?<=/tmp/IP_PHONE_BACKUP-)[^.]+(?=\.zip)

Quick test:

 PS> [regex]::Match($html, '(?<=/tmp/IP_PHONE_BACKUP-)[^.]+(?=\.zip)') Groups : {2012-Jul-25_15:47:47} Success : True Captures : {2012-Jul-25_15:47:47} Index : 391 Length : 20 Value : 2012-Jul-25_15:47:47

poussma · Answer 3 · 2012-07-25T14:08:33+0000

Group (2) and group (3) of the following regular expression accept the date and time:

 /IP_PHONE_BACKUP-((.*)_(.*)).zip/

Here is a link to extract a value from a regular expression in powershell.

Is there a shorter way to infer groups from a Powershell regex?

E.I.V.

Ocaso protal · Answer 4 · 2012-07-25T14:12:56+0000

Without regex:

 $a = '<div id="ajaxWarningRegion" class="infoFont"></div><span id="ajaxStatusRegion"></span><form enctype="multipart/form-data" method="post" name="confIPBackupForm" action="/cgi-bin/utilserv/confIPBackup/w_confIPBackup" id="confIPBackupForm" ><pre>Creating a new ZIP of IP Phone files from HTTP/PhoneBackup and HTTPS/PhoneBackup</pre><pre> /tmp/IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip</pre><pre>Reports Success</pre><pre></pre><a href = /tmp/IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip>Download the new ZIP of IP Phone files</a></div>' $a.Substring($a.IndexOf("IP_PHONE_BACKUP")+"IP_PHONE_BACKUP".length+1, $a.IndexOf(".zip")-$a.IndexOf("IP_PHONE_BACKUP")-"IP_PHONE_BACKUP".length-1)

Substring gets part of the original string. The first parameter is the starting position of the substring, and the second part is the length of the substring desiered. So, now you only need to calculate the beginning and length using a little IndexOf - and Length -magic.

Get text in HTML using powershell

More articles: