Revisions to Extract strings from between tags and save to new text file

deleted 24 characters in body

Source Link

edited Dec 8, 2018 at 15:00

58k
28
156
237

I added your example content to a disk file named file and then show the contents of file using the cat command. Then using awk on file I essentially strip out the parts you're callings tags and printed out the two pieces of data in a tab delimited format. Is this what you're looking for?

$ cat file
1731 0 obj
<</Page 250/Type/Annot/Subtype/Highlight/Rotate 0/Rect[ 95.4715 347.644 337.068 362.041]/NM(929cd95c-f962-4fa3-b734-2e0e67d7b321)/T(iPad)/CreationDate(D:20160818145053Z00'00')/M(D:20160818145204Z00'00')/C[ 0.454902 0.501961 0.988235]/CA 1/QuadPoints[ 95.4715 362.041 337.068 362.041 95.4715 347.644 337.068 347.644]/Contents(EXAMPLE OF TEXT TO BE EXTRACTED)/F 4/Subj(Highlight)>>
endobj
$ awk '{sub(/\<\<\//, "")};{sub(/\/Type.*\/Contents\(/, "\t")};{sub(/\)\/F.*$/, "")};/Page [0-9]/{print}' file
Page 250    EXAMPLE OF TEXT TO BE EXTRACTED
$

What the awk program is doing:

/Page [0-9]/ is searching for lines that contain, i.e., 'Page' a space, followed by a number, e.g.,Page 250. I'm assuming that EXAMPLE OF TEXT TO BE EXTRACTED will not contain that pattern. Not that I think it matters although the code can be easily modified to accommodate that.
sub(/\<\<\//, "") strips: <</
sub(/\/Type.*\/Contents\(/, "\t") is replacing everything between /Type and /Contents( with a tab.
sub(/\)\/F.*$/, "") is stripping everything from )/F to the end of the line.

So what's left is what's printed out. The two pieces of wanted data separated by a tab.

I know this doesn't cover all aspects you mentioned, however your other requirements are not clear enough. Is it just one file your need to process or multiplied files. In either case, do you want all extracted data into a single file and the data sorted how exactly, etc.

So if you could clarify things I could probably write a bash script to cover it.

Obviously with the awk program I've provided, you can just redirect the output to an outfile and continue to process it with the sort command. awk can do sorting too however I'm an awk noob, so getting the target data in one command line was what I could offer with awk, at this point.

I added your example content to a disk file named file and then show the contents of file using the cat command. Then using awk on file I essentially strip out the parts you're callings tags and printed out the two pieces of data in a tab delimited format. Is this what you're looking for?

$ cat file
1731 0 obj
<</Page 250/Type/Annot/Subtype/Highlight/Rotate 0/Rect[ 95.4715 347.644 337.068 362.041]/NM(929cd95c-f962-4fa3-b734-2e0e67d7b321)/T(iPad)/CreationDate(D:20160818145053Z00'00')/M(D:20160818145204Z00'00')/C[ 0.454902 0.501961 0.988235]/CA 1/QuadPoints[ 95.4715 362.041 337.068 362.041 95.4715 347.644 337.068 347.644]/Contents(EXAMPLE OF TEXT TO BE EXTRACTED)/F 4/Subj(Highlight)>>
endobj
$ awk '{sub(/\<\<\//, "")};{sub(/\/Type.*\/Contents\(/, "\t")};{sub(/\)\/F.*$/, "")};/Page [0-9]/{print}' file
Page 250    EXAMPLE OF TEXT TO BE EXTRACTED
$

What the awk program is doing:

/Page [0-9]/ is searching for lines that contain, i.e., 'Page' a space, followed by a number, e.g.,Page 250. I'm assuming that EXAMPLE OF TEXT TO BE EXTRACTED will not contain that pattern. Not that I think it matters although the code can be easily modified to accommodate that.
sub(/\<\<\//, "") strips: <</
sub(/\/Type.*\/Contents\(/, "\t") is replacing everything between /Type and /Contents( with a tab.
sub(/\)\/F.*$/, "") is stripping everything from )/F to the end of the line.

So what's left is what's printed out. The two pieces of wanted data separated by a tab.

I know this doesn't cover all aspects you mentioned, however your other requirements are not clear enough. Is it just one file your need to process or multiplied files. In either case, do you want all extracted data into a single file and the data sorted how exactly, etc.

So if you could clarify things I could probably write a bash script to cover it.

Obviously with the awk program I've provided, you can just redirect the output to an outfile and continue to process it with the sort command. awk can do sorting too however I'm an awk noob, so getting the target data in one command line was what I could offer with awk, at this point.

I added your example content to a disk file named file and then show the contents of file using the cat command. Then using awk on file I essentially strip out the parts you're callings tags and printed out the two pieces of data in a tab delimited format. Is this what you're looking for?

$ cat file
1731 0 obj
<</Page 250/Type/Annot/Subtype/Highlight/Rotate 0/Rect[ 95.4715 347.644 337.068 362.041]/NM(929cd95c-f962-4fa3-b734-2e0e67d7b321)/T(iPad)/CreationDate(D:20160818145053Z00'00')/M(D:20160818145204Z00'00')/C[ 0.454902 0.501961 0.988235]/CA 1/QuadPoints[ 95.4715 362.041 337.068 362.041 95.4715 347.644 337.068 347.644]/Contents(EXAMPLE OF TEXT TO BE EXTRACTED)/F 4/Subj(Highlight)>>
endobj
$ awk '{sub(/\<\<\//, "")};{sub(/\/Type.*\/Contents\(/, "\t")};{sub(/\)\/F.*$/, "")};/Page [0-9]/{print}' file
Page 250    EXAMPLE OF TEXT TO BE EXTRACTED
$

What the awk program is doing:

/Page [0-9]/ is searching for lines that contain, i.e., 'Page' a space, followed by a number, e.g.,Page 250. I'm assuming that EXAMPLE OF TEXT TO BE EXTRACTED will not contain that pattern. Not that I think it matters although the code can be easily modified to accommodate that.
sub(/\<\<\//, "") strips: <</
sub(/\/Type.*\/Contents\(/, "\t") is replacing everything between /Type and /Contents( with a tab.
sub(/\)\/F.*$/, "") is stripping everything from )/F to the end of the line.

So what's left is what's printed out. The two pieces of wanted data separated by a tab.

I know this doesn't cover all aspects you mentioned, however your other requirements are not clear enough. Is it just one file your need to process or multiplied files. In either case, do you want all extracted data into a single file and the data sorted how exactly, etc.

So if you could clarify things I could probably write a bash script to cover it.

Obviously with the awk program I've provided, you can just redirect the output to an outfile and continue to process it with the sort command. awk can do sorting too however getting the target data in one command line was what I could offer with awk, at this point.

added 663 characters in body

Source Link

edited Sep 4, 2016 at 7:09

user3439894

111
3

I added your example content to a disk file named file and then show the contents of file using the cat command. Then using awk on file I essentially strip out the parts you're callings tags and printed out the two pieces of data in a tab delimited format. Is this what you're looking for?

$ cat file
1731 0 obj
<</Page 250/Type/Annot/Subtype/Highlight/Rotate 0/Rect[ 95.4715 347.644 337.068 362.041]/NM(929cd95c-f962-4fa3-b734-2e0e67d7b321)/T(iPad)/CreationDate(D:20160818145053Z00'00')/M(D:20160818145204Z00'00')/C[ 0.454902 0.501961 0.988235]/CA 1/QuadPoints[ 95.4715 362.041 337.068 362.041 95.4715 347.644 337.068 347.644]/Contents(EXAMPLE OF TEXT TO BE EXTRACTED)/F 4/Subj(Highlight)>>
endobj
$ awk '{sub(/\<\<\//, "")};{sub(/\/Type.*\/Contents\(/, "\t")};{sub(/\)\/F.*$/, "")};/Page [0-9]/{print}' file
Page 250    EXAMPLE OF TEXT TO BE EXTRACTED
$

What the awk program is doing:

/Page [0-9]/ is searching for lines that contain, i.e., 'Page' a space, followed by a number, e.g.,Page 250. I'm assuming that EXAMPLE OF TEXT TO BE EXTRACTED will not contain that pattern. Not that I think it matters although the code can be easily modified to accommodate that.

sub(/\<\<\//, "") strips: <</

sub(/\/Type.*\/Contents\(/, "\t") is replacing everything between /Type and /Contents( with a tab.

sub(/\)\/F.*$/, "") is stripping everything from )/F to the end of the line.

So what's left is what's printed out. The two pieces of wanted data separated by a tab.

I know this doesn't cover all aspects you mentioned, however your other requirements are not clear enough. Is it just one file your need to process or multiplied files. In either case, do you want all extracted data into a single file and the data sorted how exactly, etc.

So if you could clarify things I could probably write a bash script to cover it.

Obviously with the awk program I've provided, you can just redirect the output to an outfile and continue to process it with the sort command. awk can do sorting too however I'm an awk noob, so getting the target data in one command line was what I could offer with itawk, at this point.

I added your example content to a disk file named file and then show the contents of file using the cat command. Then using awk on file I essentially strip out the parts you're callings tags and printed out the two pieces of data in a tab delimited format. Is this what you're looking for?

$ cat file
1731 0 obj
<</Page 250/Type/Annot/Subtype/Highlight/Rotate 0/Rect[ 95.4715 347.644 337.068 362.041]/NM(929cd95c-f962-4fa3-b734-2e0e67d7b321)/T(iPad)/CreationDate(D:20160818145053Z00'00')/M(D:20160818145204Z00'00')/C[ 0.454902 0.501961 0.988235]/CA 1/QuadPoints[ 95.4715 362.041 337.068 362.041 95.4715 347.644 337.068 347.644]/Contents(EXAMPLE OF TEXT TO BE EXTRACTED)/F 4/Subj(Highlight)>>
endobj
$ awk '{sub(/\<\<\//, "")};{sub(/\/Type.*\/Contents\(/, "\t")};{sub(/\)\/F.*$/, "")};/Page [0-9]/{print}' file
Page 250    EXAMPLE OF TEXT TO BE EXTRACTED
$

I know this doesn't cover all aspects you mentioned, however your other requirements are not clear enough. Is it just one file your need to process or multiplied files. In either case, do you want all extracted data into a single file and the data sorted how exactly, etc.

So if you could clarify things I could probably write a bash script to cover it.

Obviously with the awk program I've provided, you can just redirect the output to an outfile and continue to process it with the sort command. awk can do sorting too however I'm an awk noob, so getting the target data in one command line was what I could offer with it at this point.

I added your example content to a disk file named file and then show the contents of file using the cat command. Then using awk on file I essentially strip out the parts you're callings tags and printed out the two pieces of data in a tab delimited format. Is this what you're looking for?

$ cat file
1731 0 obj
<</Page 250/Type/Annot/Subtype/Highlight/Rotate 0/Rect[ 95.4715 347.644 337.068 362.041]/NM(929cd95c-f962-4fa3-b734-2e0e67d7b321)/T(iPad)/CreationDate(D:20160818145053Z00'00')/M(D:20160818145204Z00'00')/C[ 0.454902 0.501961 0.988235]/CA 1/QuadPoints[ 95.4715 362.041 337.068 362.041 95.4715 347.644 337.068 347.644]/Contents(EXAMPLE OF TEXT TO BE EXTRACTED)/F 4/Subj(Highlight)>>
endobj
$ awk '{sub(/\<\<\//, "")};{sub(/\/Type.*\/Contents\(/, "\t")};{sub(/\)\/F.*$/, "")};/Page [0-9]/{print}' file
Page 250    EXAMPLE OF TEXT TO BE EXTRACTED
$

What the awk program is doing:

/Page [0-9]/ is searching for lines that contain, i.e., 'Page' a space, followed by a number, e.g.,Page 250. I'm assuming that EXAMPLE OF TEXT TO BE EXTRACTED will not contain that pattern. Not that I think it matters although the code can be easily modified to accommodate that.

sub(/\<\<\//, "") strips: <</

sub(/\/Type.*\/Contents\(/, "\t") is replacing everything between /Type and /Contents( with a tab.

sub(/\)\/F.*$/, "") is stripping everything from )/F to the end of the line.

So what's left is what's printed out. The two pieces of wanted data separated by a tab.

I know this doesn't cover all aspects you mentioned, however your other requirements are not clear enough. Is it just one file your need to process or multiplied files. In either case, do you want all extracted data into a single file and the data sorted how exactly, etc.

So if you could clarify things I could probably write a bash script to cover it.

Obviously with the awk program I've provided, you can just redirect the output to an outfile and continue to process it with the sort command. awk can do sorting too however I'm an awk noob, so getting the target data in one command line was what I could offer with awk, at this point.

Post Migrated Here from apple.stackexchange.com (revisions)

occurred Sep 4, 2016 at 7:02

Source Link

answered Sep 4, 2016 at 6:42

user3439894

111
3

I added your example content to a disk file named file and then show the contents of file using the cat command. Then using awk on file I essentially strip out the parts you're callings tags and printed out the two pieces of data in a tab delimited format. Is this what you're looking for?

$ cat file
1731 0 obj
<</Page 250/Type/Annot/Subtype/Highlight/Rotate 0/Rect[ 95.4715 347.644 337.068 362.041]/NM(929cd95c-f962-4fa3-b734-2e0e67d7b321)/T(iPad)/CreationDate(D:20160818145053Z00'00')/M(D:20160818145204Z00'00')/C[ 0.454902 0.501961 0.988235]/CA 1/QuadPoints[ 95.4715 362.041 337.068 362.041 95.4715 347.644 337.068 347.644]/Contents(EXAMPLE OF TEXT TO BE EXTRACTED)/F 4/Subj(Highlight)>>
endobj
$ awk '{sub(/\<\<\//, "")};{sub(/\/Type.*\/Contents\(/, "\t")};{sub(/\)\/F.*$/, "")};/Page [0-9]/{print}' file
Page 250    EXAMPLE OF TEXT TO BE EXTRACTED
$

I know this doesn't cover all aspects you mentioned, however your other requirements are not clear enough. Is it just one file your need to process or multiplied files. In either case, do you want all extracted data into a single file and the data sorted how exactly, etc.

So if you could clarify things I could probably write a bash script to cover it.

Obviously with the awk program I've provided, you can just redirect the output to an outfile and continue to process it with the sort command. awk can do sorting too however I'm an awk noob, so getting the target data in one command line was what I could offer with it at this point.

Stack Exchange Network

Return to Answer