2

I have a long string in one attribute. I want to be able to extract a small portion of it into a new attribute. This is an example of the attribute:

"< html xmlns:fo="http://www.w3.org/1999/XSL/Format" xmlns:msxsl="urn:schemas-microsoft-com:xslt"> < meta http-equiv="content-type" content="text/html; charset=UTF-8" > < /head > < body style="margin:0px 0px 0px 0px;overflow:auto;background:#FFFFFF;" > < table style="font-family:Arial,Verdana,Times;font-size:12px;text-align:left;width:100%;border-collapse:collapse;padding:3px 3px 3px 3px" > < tr style="text-align:center;font-weight:bold;background:#9CBCE2" > < td >Fly500< /td > < /tr > < tr > < td > < table style="font-family:Arial,Verdana,Times;font-size:12px;text-align:left;width:100%;border-spacing:0px; padding:3px 3px 3px 3px" > < tr > < td >Name< /td > < td >Fly500< /td > < /tr > < tr bgcolor="#D4E4F3" > < td >Notes< /td > < td >< /td > < /tr > < tr > < td >Source< /td > < td >< /td > < /tr > < tr bgcolor="#D4E4F3" > < td >Duplicate< /td > < td > < Null >< /td > < /tr > < tr > < td > Type < /td > < td > Wildlife Sensitive Area< /td > < /tr > < tr bgcolor="#D4E4F3" > < td >Start_Date< /td > < td >May 1< /td > < /tr > < tr > < td >End_Date< /td > < td

In bold is what I want to extract. I tried splitting using Right and Left, But it is not always located at the same number of characters from either side. What I would need to do is take everything right of < td > Type < /td > and extract what is between the next "< td > < /td >" area.

3
  • Is all that text in one field? Commented Apr 27, 2018 at 18:40
  • Yes, and there is more I just cut it so it was faster to go through. Commented Apr 27, 2018 at 18:43
  • Looks like a parsing HTML question with a HTMLParser module. Commented Apr 28, 2018 at 3:19

1 Answer 1

5

This will work if it looks exactly like your example. Use Python parser:

def extract(textfield):
    return textfield.split('Type')[-1].split('< td >')[1].split('<')[0].strip()

Call the function on your new field with:

extract(!Textfield!)

Change !Textfield! to match the name of your field.

You will very likely need to adapt the code, take a look at Common string operations, for example split and strip. This: [0], [-1] is indexing, see Python Lists.

Example in python console:

text = '"< html xmlns:fo="http://www.w3.org/1999/XSL/Format" xmlns:msxsl="urn:schemas-microsoft-com:xslt"> < meta http-equiv="content-type" content="text/html; charset=UTF-8" > < /head > < body style="margin:0px 0px 0px 0px;overflow:auto;background:#FFFFFF;" > < table style="font-family:Arial,Verdana,Times;font-size:12px;text-align:left;width:100%;border-collapse:collapse;padding:3px 3px 3px 3px" > < tr style="text-align:center;font-weight:bold;background:#9CBCE2" > < td >Fly500< /td > < /tr > < tr > < td > < table style="font-family:Arial,Verdana,Times;font-size:12px;text-align:left;width:100%;border-spacing:0px; padding:3px 3px 3px 3px" > < tr > < td >Name< /td > < td >Fly500< /td > < /tr > < tr bgcolor="#D4E4F3" > < td >Notes< /td > < td >< /td > < /tr > < tr > < td >Source< /td > < td >< /td > < /tr > < tr bgcolor="#D4E4F3" > < td >Duplicate< /td > < td > < Null >< /td > < /tr > < tr > < td > Type < /td > < td > Wildlife Sensitive Area< /td > < /tr > < tr bgcolor="#D4E4F3" > < td >Start_Date< /td > < td >May 1< /td > < /tr > < tr > < td >End_Date< /td > < td'
def extract(textfield):
    return textfield.split('Type')[-1].split('< td >')[1].split('<')[0].strip()

extract(text)
'Wildlife Sensitive Area'
1
  • 1
    I ran into problem just like yours and decided to create a usable library (still developing). It extracts stream, so it can extract data if your data is to big to put in to memory. So, if you've ever needed an alternative: github.com/Mandelag/util/tree/master/StreamExtractor Commented Apr 28, 2018 at 11:44

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.