Convert html to plain text in VBA

I have an excel sheet with cells containing html. How can I convert them to plain text? There are so many useless tags and styles at the moment. I want to write this from scratch, but it will be much easier if I can get plain text.

I can write a script to convert html to plain text in PHP, so if you cannot come up with a solution in VBA, perhaps you can tell how I can transfer the cell data to the website and return the data back.

+9
html vba parsing html-parsing
source share
5 answers

Click the link to the Microsoft HTML Object Library.

Function HtmlToText(sHTML) As String Dim oDoc As HTMLDocument Set oDoc = New HTMLDocument oDoc.body.innerHTML = sHTML HtmlToText = oDoc.body.innerText End Function 

Tim

+16
source share

A very simple way to extract text is to scan the HTML character by character and copy the characters outside the angle brackets to a new line.

 Function StripTags(ByVal html As String) As String Dim text As String Dim accumulating As Boolean Dim n As Integer Dim c As String text = "" accumulating = True n = 1 Do While n <= Len(html) c = Mid(html, n, 1) If c = "<" Then accumulating = False ElseIf c = ">" Then accumulating = True Else If accumulating Then text = text & c End If End If n = n + 1 Loop StripTags = text End Function 

This can leave a lot of extraneous spaces, but it will help to remove tags.

+4
source share

Tim's solution was great, worked loved the charm.

I would like to contribute: use this code to add the Microsoft HTML Object Library at runtime:

 Set ID = ThisWorkbook.VBProject.References ID.AddFromGuid "{3050F1C5-98B5-11CF-BB82-00AA00BDCE0B}", 2, 5 

It worked on Windows XP and Windows 7.

+3
source share

The answer to the question is excellent. However, a small adjustment may be added to avoid a predictable error response.

  Function HtmlToText(sHTML) As String Dim oDoc As HTMLDocument If IsNull(sHTML) Then HtmlToText = "" Exit Function End-If Set oDoc = New HTMLDocument oDoc.body.innerHTML = sHTML HtmlToText = oDoc.body.innerText End Function 
0
source share

Yes! I managed to solve my problem. Thanks to everyone /

In my case, I had this type of input:

 <p>Lorem ipsum dolor sit amet.</p> <p>Ut enim ad minim veniam.</p> <p>Duis aute irure dolor in reprehenderit.</p> 

And I did not want the result to be glued together without tears.

Therefore, I first divided my input data for each <p> into an array of "paragraphs", then for each element I used the answer "Tim" to get the text from html (a very nice answer by the way).

In addition, I linked each cleared “paragraph” to this Crh(10) character for VBA / Excel.

Final code:

 Public Function HtmlToText(ByVal sHTML As String) As String Dim oDoc As HTMLDocument Dim result As String Dim paragraphs() As String If IsNull(sHTML) Then HtmlToText = "" Exit Function End If result = "" paragraphs = Split(sHTML, "<p>") For Each paragraph In paragraphs Set oDoc = New HTMLDocument oDoc.body.innerHTML = paragraph result = result & Chr(10) & Chr(10) & oDoc.body.innerText Next paragraph HtmlToText = result End Function 
0
source share

All Articles