Skip to main content
deleted 24 characters in body
Source Link
Rui F Ribeiro
  • 58k
  • 28
  • 156
  • 237

I added your example content to a disk file named file and then show the contents of file using the cat command. Then using awk on file I essentially strip out the parts you're callings tags and printed out the two pieces of data in a tab delimited format. Is this what you're looking for?

$ cat file
1731 0 obj
<</Page 250/Type/Annot/Subtype/Highlight/Rotate 0/Rect[ 95.4715 347.644 337.068 362.041]/NM(929cd95c-f962-4fa3-b734-2e0e67d7b321)/T(iPad)/CreationDate(D:20160818145053Z00'00')/M(D:20160818145204Z00'00')/C[ 0.454902 0.501961 0.988235]/CA 1/QuadPoints[ 95.4715 362.041 337.068 362.041 95.4715 347.644 337.068 347.644]/Contents(EXAMPLE OF TEXT TO BE EXTRACTED)/F 4/Subj(Highlight)>>
endobj
$ awk '{sub(/\<\<\//, "")};{sub(/\/Type.*\/Contents\(/, "\t")};{sub(/\)\/F.*$/, "")};/Page [0-9]/{print}' file
Page 250    EXAMPLE OF TEXT TO BE EXTRACTED
$

What the awk program is doing:

  • /Page [0-9]/ is searching for lines that contain, i.e., 'Page' a space, followed by a number, e.g.,Page 250. I'm assuming that EXAMPLE OF TEXT TO BE EXTRACTED will not contain that pattern. Not that I think it matters although the code can be easily modified to accommodate that.

  • sub(/\<\<\//, "") strips: <</

  • sub(/\/Type.*\/Contents\(/, "\t") is replacing everything between /Type and /Contents( with a tab.

  • sub(/\)\/F.*$/, "") is stripping everything from )/F to the end of the line.

So what's left is what's printed out. The two pieces of wanted data separated by a tab.

I know this doesn't cover all aspects you mentioned, however your other requirements are not clear enough. Is it just one file your need to process or multiplied files. In either case, do you want all extracted data into a single file and the data sorted how exactly, etc.

So if you could clarify things I could probably write a bash script to cover it.

Obviously with the awk program I've provided, you can just redirect the output to an outfile and continue to process it with the sort command. awk can do sorting too however I'm an awk noob, so getting the target data in one command line was what I could offer with awk, at this point.

I added your example content to a disk file named file and then show the contents of file using the cat command. Then using awk on file I essentially strip out the parts you're callings tags and printed out the two pieces of data in a tab delimited format. Is this what you're looking for?

$ cat file
1731 0 obj
<</Page 250/Type/Annot/Subtype/Highlight/Rotate 0/Rect[ 95.4715 347.644 337.068 362.041]/NM(929cd95c-f962-4fa3-b734-2e0e67d7b321)/T(iPad)/CreationDate(D:20160818145053Z00'00')/M(D:20160818145204Z00'00')/C[ 0.454902 0.501961 0.988235]/CA 1/QuadPoints[ 95.4715 362.041 337.068 362.041 95.4715 347.644 337.068 347.644]/Contents(EXAMPLE OF TEXT TO BE EXTRACTED)/F 4/Subj(Highlight)>>
endobj
$ awk '{sub(/\<\<\//, "")};{sub(/\/Type.*\/Contents\(/, "\t")};{sub(/\)\/F.*$/, "")};/Page [0-9]/{print}' file
Page 250    EXAMPLE OF TEXT TO BE EXTRACTED
$

What the awk program is doing:

  • /Page [0-9]/ is searching for lines that contain, i.e., 'Page' a space, followed by a number, e.g.,Page 250. I'm assuming that EXAMPLE OF TEXT TO BE EXTRACTED will not contain that pattern. Not that I think it matters although the code can be easily modified to accommodate that.

  • sub(/\<\<\//, "") strips: <</

  • sub(/\/Type.*\/Contents\(/, "\t") is replacing everything between /Type and /Contents( with a tab.

  • sub(/\)\/F.*$/, "") is stripping everything from )/F to the end of the line.

So what's left is what's printed out. The two pieces of wanted data separated by a tab.

I know this doesn't cover all aspects you mentioned, however your other requirements are not clear enough. Is it just one file your need to process or multiplied files. In either case, do you want all extracted data into a single file and the data sorted how exactly, etc.

So if you could clarify things I could probably write a bash script to cover it.

Obviously with the awk program I've provided, you can just redirect the output to an outfile and continue to process it with the sort command. awk can do sorting too however I'm an awk noob, so getting the target data in one command line was what I could offer with awk, at this point.

I added your example content to a disk file named file and then show the contents of file using the cat command. Then using awk on file I essentially strip out the parts you're callings tags and printed out the two pieces of data in a tab delimited format. Is this what you're looking for?

$ cat file
1731 0 obj
<</Page 250/Type/Annot/Subtype/Highlight/Rotate 0/Rect[ 95.4715 347.644 337.068 362.041]/NM(929cd95c-f962-4fa3-b734-2e0e67d7b321)/T(iPad)/CreationDate(D:20160818145053Z00'00')/M(D:20160818145204Z00'00')/C[ 0.454902 0.501961 0.988235]/CA 1/QuadPoints[ 95.4715 362.041 337.068 362.041 95.4715 347.644 337.068 347.644]/Contents(EXAMPLE OF TEXT TO BE EXTRACTED)/F 4/Subj(Highlight)>>
endobj
$ awk '{sub(/\<\<\//, "")};{sub(/\/Type.*\/Contents\(/, "\t")};{sub(/\)\/F.*$/, "")};/Page [0-9]/{print}' file
Page 250    EXAMPLE OF TEXT TO BE EXTRACTED
$

What the awk program is doing:

  • /Page [0-9]/ is searching for lines that contain, i.e., 'Page' a space, followed by a number, e.g.,Page 250. I'm assuming that EXAMPLE OF TEXT TO BE EXTRACTED will not contain that pattern. Not that I think it matters although the code can be easily modified to accommodate that.

  • sub(/\<\<\//, "") strips: <</

  • sub(/\/Type.*\/Contents\(/, "\t") is replacing everything between /Type and /Contents( with a tab.

  • sub(/\)\/F.*$/, "") is stripping everything from )/F to the end of the line.

So what's left is what's printed out. The two pieces of wanted data separated by a tab.

I know this doesn't cover all aspects you mentioned, however your other requirements are not clear enough. Is it just one file your need to process or multiplied files. In either case, do you want all extracted data into a single file and the data sorted how exactly, etc.

So if you could clarify things I could probably write a bash script to cover it.

Obviously with the awk program I've provided, you can just redirect the output to an outfile and continue to process it with the sort command. awk can do sorting too however getting the target data in one command line was what I could offer with awk, at this point.

added 663 characters in body
Source Link

I added your example content to a disk file named file and then show the contents of file using the cat command. Then using awk on file I essentially strip out the parts you're callings tags and printed out the two pieces of data in a tab delimited format. Is this what you're looking for?

$ cat file
1731 0 obj
<</Page 250/Type/Annot/Subtype/Highlight/Rotate 0/Rect[ 95.4715 347.644 337.068 362.041]/NM(929cd95c-f962-4fa3-b734-2e0e67d7b321)/T(iPad)/CreationDate(D:20160818145053Z00'00')/M(D:20160818145204Z00'00')/C[ 0.454902 0.501961 0.988235]/CA 1/QuadPoints[ 95.4715 362.041 337.068 362.041 95.4715 347.644 337.068 347.644]/Contents(EXAMPLE OF TEXT TO BE EXTRACTED)/F 4/Subj(Highlight)>>
endobj
$ awk '{sub(/\<\<\//, "")};{sub(/\/Type.*\/Contents\(/, "\t")};{sub(/\)\/F.*$/, "")};/Page [0-9]/{print}' file
Page 250    EXAMPLE OF TEXT TO BE EXTRACTED
$

What the awk program is doing:

  • /Page [0-9]/ is searching for lines that contain, i.e., 'Page' a space, followed by a number, e.g.,Page 250. I'm assuming that EXAMPLE OF TEXT TO BE EXTRACTED will not contain that pattern. Not that I think it matters although the code can be easily modified to accommodate that.

  • sub(/\<\<\//, "") strips: <</

  • sub(/\/Type.*\/Contents\(/, "\t") is replacing everything between /Type and /Contents( with a tab.

  • sub(/\)\/F.*$/, "") is stripping everything from )/F to the end of the line.

So what's left is what's printed out. The two pieces of wanted data separated by a tab.

I know this doesn't cover all aspects you mentioned, however your other requirements are not clear enough. Is it just one file your need to process or multiplied files. In either case, do you want all extracted data into a single file and the data sorted how exactly, etc.

So if you could clarify things I could probably write a bash script to cover it.

Obviously with the awk program I've provided, you can just redirect the output to an outfile and continue to process it with the sort command. awk can do sorting too however I'm an awk noob, so getting the target data in one command line was what I could offer with itawk, at this point.

I added your example content to a disk file named file and then show the contents of file using the cat command. Then using awk on file I essentially strip out the parts you're callings tags and printed out the two pieces of data in a tab delimited format. Is this what you're looking for?

$ cat file
1731 0 obj
<</Page 250/Type/Annot/Subtype/Highlight/Rotate 0/Rect[ 95.4715 347.644 337.068 362.041]/NM(929cd95c-f962-4fa3-b734-2e0e67d7b321)/T(iPad)/CreationDate(D:20160818145053Z00'00')/M(D:20160818145204Z00'00')/C[ 0.454902 0.501961 0.988235]/CA 1/QuadPoints[ 95.4715 362.041 337.068 362.041 95.4715 347.644 337.068 347.644]/Contents(EXAMPLE OF TEXT TO BE EXTRACTED)/F 4/Subj(Highlight)>>
endobj
$ awk '{sub(/\<\<\//, "")};{sub(/\/Type.*\/Contents\(/, "\t")};{sub(/\)\/F.*$/, "")};/Page [0-9]/{print}' file
Page 250    EXAMPLE OF TEXT TO BE EXTRACTED
$

I know this doesn't cover all aspects you mentioned, however your other requirements are not clear enough. Is it just one file your need to process or multiplied files. In either case, do you want all extracted data into a single file and the data sorted how exactly, etc.

So if you could clarify things I could probably write a bash script to cover it.

Obviously with the awk program I've provided, you can just redirect the output to an outfile and continue to process it with the sort command. awk can do sorting too however I'm an awk noob, so getting the target data in one command line was what I could offer with it at this point.

I added your example content to a disk file named file and then show the contents of file using the cat command. Then using awk on file I essentially strip out the parts you're callings tags and printed out the two pieces of data in a tab delimited format. Is this what you're looking for?

$ cat file
1731 0 obj
<</Page 250/Type/Annot/Subtype/Highlight/Rotate 0/Rect[ 95.4715 347.644 337.068 362.041]/NM(929cd95c-f962-4fa3-b734-2e0e67d7b321)/T(iPad)/CreationDate(D:20160818145053Z00'00')/M(D:20160818145204Z00'00')/C[ 0.454902 0.501961 0.988235]/CA 1/QuadPoints[ 95.4715 362.041 337.068 362.041 95.4715 347.644 337.068 347.644]/Contents(EXAMPLE OF TEXT TO BE EXTRACTED)/F 4/Subj(Highlight)>>
endobj
$ awk '{sub(/\<\<\//, "")};{sub(/\/Type.*\/Contents\(/, "\t")};{sub(/\)\/F.*$/, "")};/Page [0-9]/{print}' file
Page 250    EXAMPLE OF TEXT TO BE EXTRACTED
$

What the awk program is doing:

  • /Page [0-9]/ is searching for lines that contain, i.e., 'Page' a space, followed by a number, e.g.,Page 250. I'm assuming that EXAMPLE OF TEXT TO BE EXTRACTED will not contain that pattern. Not that I think it matters although the code can be easily modified to accommodate that.

  • sub(/\<\<\//, "") strips: <</

  • sub(/\/Type.*\/Contents\(/, "\t") is replacing everything between /Type and /Contents( with a tab.

  • sub(/\)\/F.*$/, "") is stripping everything from )/F to the end of the line.

So what's left is what's printed out. The two pieces of wanted data separated by a tab.

I know this doesn't cover all aspects you mentioned, however your other requirements are not clear enough. Is it just one file your need to process or multiplied files. In either case, do you want all extracted data into a single file and the data sorted how exactly, etc.

So if you could clarify things I could probably write a bash script to cover it.

Obviously with the awk program I've provided, you can just redirect the output to an outfile and continue to process it with the sort command. awk can do sorting too however I'm an awk noob, so getting the target data in one command line was what I could offer with awk, at this point.

Post Migrated Here from apple.stackexchange.com (revisions)
Source Link

I added your example content to a disk file named file and then show the contents of file using the cat command. Then using awk on file I essentially strip out the parts you're callings tags and printed out the two pieces of data in a tab delimited format. Is this what you're looking for?

$ cat file
1731 0 obj
<</Page 250/Type/Annot/Subtype/Highlight/Rotate 0/Rect[ 95.4715 347.644 337.068 362.041]/NM(929cd95c-f962-4fa3-b734-2e0e67d7b321)/T(iPad)/CreationDate(D:20160818145053Z00'00')/M(D:20160818145204Z00'00')/C[ 0.454902 0.501961 0.988235]/CA 1/QuadPoints[ 95.4715 362.041 337.068 362.041 95.4715 347.644 337.068 347.644]/Contents(EXAMPLE OF TEXT TO BE EXTRACTED)/F 4/Subj(Highlight)>>
endobj
$ awk '{sub(/\<\<\//, "")};{sub(/\/Type.*\/Contents\(/, "\t")};{sub(/\)\/F.*$/, "")};/Page [0-9]/{print}' file
Page 250    EXAMPLE OF TEXT TO BE EXTRACTED
$

I know this doesn't cover all aspects you mentioned, however your other requirements are not clear enough. Is it just one file your need to process or multiplied files. In either case, do you want all extracted data into a single file and the data sorted how exactly, etc.

So if you could clarify things I could probably write a bash script to cover it.

Obviously with the awk program I've provided, you can just redirect the output to an outfile and continue to process it with the sort command. awk can do sorting too however I'm an awk noob, so getting the target data in one command line was what I could offer with it at this point.