I have an Excel sheet that I want to remove some HTML tags. My problem is that some of the tags are not just simple <div> tags, but have extra characters like <div class="ExternalClassEA74AB3F178E48EDAD3BDE4FC90B1182"> Replace from <div until we reach the ending > of the tag. How do I replace string parts like this with "". Thanks.
Add a comment
|
2 Answers
This is not a perfect solution, and most likely fail to do what you expect it if you have < or > as an actual text you want to keep.
It should give you a starting point, and should be able to make some changes to achieve the "perfection".
See the comments for more details:
Sub htmlStrip()
Dim ws As Worksheet: Set ws = ActiveWorkbook.Sheets("Sheet Name Here") '<- set sheet name
Dim lRow As Long: lRow = ws.Cells(Rows.Count, 1).End(xlUp).Row 'get last row
Dim lCol As Long: lCol = ws.Cells(1, Columns.Count).End(xlToLeft).Column 'get last column
Dim arrData As Variant: arrData = ws.Range("A1").Resize(lRow, lCol) 'get the data into an array
Dim R As Long, C As Long, X As Long, Z As Long
Const sS As String = "<" 'html tag start
Const sE As String = ">" 'html tag end
Dim lS As Long, lE As Long, lCnt As Long, lECnt As Long
Dim strTmpS As String, strTmpE As String, strTemp As String, strReplace As String
For R = LBound(arrData) To UBound(arrData) 'iterate through all rows of data
For C = LBound(arrData, 2) To UBound(arrData, 2) 'iterate through all columns of data
lS = InStr(1, arrData(R, C), sS) 'get the location of the first tag
lE = InStr(1, arrData(R, C), sE) 'get the location of the last tag
If lS > 0 And lE > 0 Then 'if at least one of each found
If lE < lS Then lE = InStr(lS + 1, arrData(R, C), sE) 'prevent a case when the first tag is the ending one
lSCnt = Len(arrData(R, C)) - Len(Replace(arrData(R, C), sS, "")) 'check how many times we have the first tag
lECnt = Len(arrData(R, C)) - Len(Replace(arrData(R, C), sE, "")) 'check how many times we have the last tag
Z = WorksheetFunction.Min(lSCnt, lECnt) 'avoid a situation when we have some opening or closing tags, but not the matching ones
For X = 1 To Z 'iterate through the number of times at least both tags are in
strReplace = Mid(arrData(R, C), lS, lE - lS + 1) 'get the string to replace
arrData(R, C) = Replace(arrData(R, C), strReplace, "") 'remove the tag found
lS = InStr(1, arrData(R, C), sS) 'get the location of the first tag (again)
lE = InStr(1, arrData(R, C), sE) 'get the location of the last tag (again)
If lS = 0 Or lE = 0 Then Exit For 'either we reached the end of the loop, or did a multi replace - so exit here
Next X
End If
Next C
Next R
ws.Range("A1").Resize(lRow, lCol).Offset(0, lCol) = arrData 'put the data back on the spreadsheet, at the right of the original data
End Sub
10 Comments
L33TLE
Hi FAB. Thanks for your help. I'm getting there, it is finding the strings and the tags, but I'm getting an error on the line "strReplace = Mid(arrData(R, C), lS, lE - lS + 1) 'get the string to replace" The error is "Invalid procedure call or argument". I believe it is the arguments. When I debug I see IE=367 and IS=437. I guess that is causing a negative number in the argument? Any ideas?
FAB
I can imagine the that could only happen if you have > before <... I'll check when I get to a computer.
FAB
@L33TLE I've updated the answer to check for such scenarios, it should be better now.
L33TLE
Hi FAB. Can you tell why the code is failing on this html string?
<li>A defined change (bullet# e, g; Pg.3)</li><li>Managers (bullet# h, Pg.3). Also, respective (section 5.0, Pg.3)</li></ul><li><strong style="text-decoration:underline;">Document </strong><span style="text-decoration:underline;"><strong>'Sample Report' </strong></span> also shows (a training platform)</li></ul><p>The following:</p>L33TLE
it seems like it does not like parenthesizes. If I take them out it works. And ideas how to avoid this? ~<li></li><li></li></ul><li><strong></strong><span><strong>(a training platform)</strong></span></li></ul><p></p>~
|