0

I have an Excel sheet that I want to remove some HTML tags. My problem is that some of the tags are not just simple <div> tags, but have extra characters like <div class="ExternalClassEA74AB3F178E48EDAD3BDE4FC90B1182"> Replace from <div until we reach the ending > of the tag. How do I replace string parts like this with "". Thanks.

2 Answers 2

2
Sub RemoveDivs()
    Dim html$
    html = "Some other text<div class=""ExternalClassEA74AB3F178E48EDAD3BDE4FC90B1182""> and here too"
    With CreateObject("VBScript.RegExp")
        .Pattern = "<div.*?>": .Global = True
        html = .Replace(html, "")
    End With
    MsgBox html
End Sub
Sign up to request clarification or add additional context in comments.

Comments

0

This is not a perfect solution, and most likely fail to do what you expect it if you have < or > as an actual text you want to keep.

It should give you a starting point, and should be able to make some changes to achieve the "perfection".

See the comments for more details:

Sub htmlStrip()

Dim ws As Worksheet: Set ws = ActiveWorkbook.Sheets("Sheet Name Here") '<- set sheet name
Dim lRow As Long: lRow = ws.Cells(Rows.Count, 1).End(xlUp).Row 'get last row
Dim lCol As Long: lCol = ws.Cells(1, Columns.Count).End(xlToLeft).Column 'get last column

Dim arrData As Variant: arrData = ws.Range("A1").Resize(lRow, lCol) 'get the data into an array
Dim R As Long, C As Long, X As Long, Z As Long

Const sS As String = "<" 'html tag start
Const sE As String = ">" 'html tag end

Dim lS As Long, lE As Long, lCnt As Long, lECnt As Long
Dim strTmpS As String, strTmpE As String, strTemp As String, strReplace As String


For R = LBound(arrData) To UBound(arrData) 'iterate through all rows of data
    For C = LBound(arrData, 2) To UBound(arrData, 2) 'iterate through all columns of data
        lS = InStr(1, arrData(R, C), sS) 'get the location of the first tag
        lE = InStr(1, arrData(R, C), sE) 'get the location of the last tag

        If lS > 0 And lE > 0 Then 'if at least one of each found
            If lE < lS Then lE = InStr(lS + 1, arrData(R, C), sE) 'prevent a case when the first tag is the ending one

            lSCnt = Len(arrData(R, C)) - Len(Replace(arrData(R, C), sS, "")) 'check how many times we have the first tag
            lECnt = Len(arrData(R, C)) - Len(Replace(arrData(R, C), sE, "")) 'check how many times we have the last tag

            Z = WorksheetFunction.Min(lSCnt, lECnt) 'avoid a situation when we have some opening or closing tags, but not the matching ones

            For X = 1 To Z 'iterate through the number of times at least both tags are in
                strReplace = Mid(arrData(R, C), lS, lE - lS + 1) 'get the string to replace
                arrData(R, C) = Replace(arrData(R, C), strReplace, "") 'remove the tag found

                lS = InStr(1, arrData(R, C), sS) 'get the location of the first tag (again)
                lE = InStr(1, arrData(R, C), sE) 'get the location of the last tag (again)

                If lS = 0 Or lE = 0 Then Exit For 'either we reached the end of the loop, or did a multi replace - so exit here
            Next X
        End If
    Next C
Next R

ws.Range("A1").Resize(lRow, lCol).Offset(0, lCol) = arrData 'put the data back on the spreadsheet, at the right of the original data

End Sub

10 Comments

Hi FAB. Thanks for your help. I'm getting there, it is finding the strings and the tags, but I'm getting an error on the line "strReplace = Mid(arrData(R, C), lS, lE - lS + 1) 'get the string to replace" The error is "Invalid procedure call or argument". I believe it is the arguments. When I debug I see IE=367 and IS=437. I guess that is causing a negative number in the argument? Any ideas?
I can imagine the that could only happen if you have > before <... I'll check when I get to a computer.
@L33TLE I've updated the answer to check for such scenarios, it should be better now.
Hi FAB. Can you tell why the code is failing on this html string? <li>A defined change (bullet# e, g; Pg.3)</li><li>Managers (bullet# h, Pg.3). Also, respective (section 5.0, Pg.3)</li></ul><li><strong style="text-decoration&#58;underline;">Document </strong><span style="text-decoration&#58;underline;"><strong>'Sample Report' </strong></span>&#160;also shows (a training platform)</li></ul><p>The following&#58;</p>
it seems like it does not like parenthesizes. If I take them out it works. And ideas how to avoid this? ~<li></li><li></li></ul><li><strong></strong><span><strong>(a training platform)</strong></span></li></ul><p></p>~
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.