0

We extract data from a database to a flat file, we use a pipe for column delimiter and double quotes for text delimiter.

Data file looks Something like this:

"164829" | "collection 1" | "wood plank 2" x 4" long" | "23.5" 
"2017"|"S"|"221318"|"WE"|"20170118"|"Someones name"|"20170215"|"1785"|"009"|"20170215"|"182339"|"99536"|"00090"|"LOCAL00"|"930N"|"2017"|"6100"|"0000880"|1.000|0.000|"EA"|" "|" "|" "|" "|"005"|"00000000"|" "|" "|"1785"|"50228"|"R"|"2017"|"NMT CAUTION| 5" X 3" NAT ON BLK"|" "|" "|"USD"|"7444"|" "|"000"|"COIN"|"04"|35.00|"00"

What command/script could find every case where the data (not including the text delimiter) can be escaped with a second double quote?

The end result should look like this:

"164829" | "collection 1" | "wood plank 2"" x 4"" long" | "23.5"
"2017"|"S"|"221318"|"WE"|"20170118"|"Someones name"|"20170215"|"1785"|"009"|"20170215"|"182339"|"99536"|"00090"|"LOCAL00"|"930N"|"2017"|"6100"|"0000880"|1.000|0.000|"EA"|" "|" "|" "|" "|"005"|"00000000"|" "|" "|"1785"|"50228"|"R"|"2017"|"NMT CAUTION| 5"" X 3"" NAT ON BLK"|" "|" "|"USD"|"7444"|" "|"000"|"COIN"|"04"|35.00|"00"
11
  • What do you mean by escaping the data? Why not just use the pipe by itself as a delimiter as that would separate the text into each column just as a space, tab, or comma would? You are complicating things where it isn't necessary by using double quotes when you already have the pipe as a delimiter. If it's just text within a file then you don't need to escape it unless you are using a command such as sed to alter the file and need to work with the text on the command line. If that's the case, you can just escape it with backslashes or single quotes. Commented Apr 4, 2018 at 2:48
  • The pipe can be found within data as well ie “2433” | “Concatenate the fields with | in your summary” | “5’6” tall” Commented Apr 4, 2018 at 3:43
  • Hence we have to use column and text delimiters. I apolologize if my statement was t clear when I said escaping the data... I meant to say escaping the data which includes “ within the data. Commented Apr 4, 2018 at 3:45
  • We’re after a way to escape the text delimiter which in our this case is “ (our only choice for text delimiter) that is found in the data ie 5’6” Commented Apr 4, 2018 at 3:47
  • The pipe is already a delimiter for both of them. By its very definition, it separates all of the text into different columns. I can understand why you need double quotes to specify height, width, and length but it really isn't necessary for the rest. Even when using double quotes to specify height, width, and length, the pipe acts as a delimiter for each field no matter what value or characters are contained inside which makes the other double quotes unnecessary and redundant. It's only a suggestion on my part as it would make your database a lot easier to manage. Commented Apr 4, 2018 at 11:52

4 Answers 4

0

This is a little tricky with just regular expressions, but it can be done in several steps. Here's the perl script that I used for this (sed can't be used because I used lookaheads):

perl -pe 's/(?:(?:\||^)[ ]*"(.*?)\"[ ]*(?=\||$))/~~\1~~/gm;s/"/""/g;s/~~(.*?)~~/~~"\1"~~/g;s/~~~~/|/g;s/~~//g' inputfile.txt

(use perl -pi -e if you want to edit the input file)

The script does the following steps:

  1. Look for everything that either is inside of |"{...}"|, (start of line)"{...}"|, or |"{...}"(end of line), ignoring spaces outside of the text. Replace the outside bits with ~~ (I used something that is known not to be inside of the text)
  2. Replace all remaining quotation marks with doubled-up ones
  3. Replace all internal ~~{...}~~ sequences with ~~"{...}"~~
  4. Replace all ~~~~ sequences (which will all be internal) to |
  5. Remove all remaining ~~ sequences (which will be at the beginning and end of the lines)

Running each step given the following test text:

"164829" | "collection 1" | "wood plank 2" x 4" long" | "23.5"
"939485"|"collect "name""|"more items with | " and ""|"294.5""

We get the following output after each step:

$ perl -pe 's/(?:(?:\||^)[ ]*"(.*?)\"[ ]*(?=\||$))/~~\1~~/gm;' testinput.txt                       
~~164829~~~~collection 1~~~~wood plank 2" x 4" long~~~~23.5~~
~~939485~~~~collect "name"~~~~more items with | " and "~~~~294.5"~~

$ perl -pe 's/(?:(?:\||^)[ ]*"(.*?)\"[ ]*(?=\||$))/~~\1~~/gm;s/"/""/g;' testinput.txt
~~164829~~~~collection 1~~~~wood plank 2"" x 4"" long~~~~23.5~~
~~939485~~~~collect ""name""~~~~more items with | "" and ""~~~~294.5""~~

$ perl -pe 's/(?:(?:\||^)[ ]*"(.*?)\"[ ]*(?=\||$))/~~\1~~/gm;s/"/""/g;s/~~(.*?)~~/~~"\1"~~/g;' testinput.txt
~~"164829"~~~~"collection 1"~~~~"wood plank 2"" x 4"" long"~~~~"23.5"~~
~~"939485"~~~~"collect ""name"""~~~~"more items with | "" and """~~~~"294.5"""~~

$ perl -pe 's/(?:(?:\||^)[ ]*"(.*?)\"[ ]*(?=\||$))/~~\1~~/gm;s/"/""/g;s/~~(.*?)~~/~~"\1"~~/g;s/~~~~/|/g;' testinput.txt
~~"164829"|"collection 1"|"wood plank 2"" x 4"" long"|"23.5"~~
~~"939485"|"collect ""name"""|"more items with | "" and """|"294.5"""~~

$ perl -pe 's/(?:(?:\||^)[ ]*"(.*?)\"[ ]*(?=\||$))/~~\1~~/gm;s/"/""/g;s/~~(.*?)~~/~~"\1"~~/g;s/~~~~/|/g;s/~~//g' testpipe.txt
"164829"|"collection 1"|"wood plank 2"" x 4"" long"|"23.5"
"939485"|"collect ""name"""|"more items with | "" and """|"294.5"""
0

Your double quote are not real double quote (“ vs ").
With real double quote " , You can try this sed (assume there is no @ in your data)

sed 's/" | "/@/g;s/"/""/g;s/^"//;s/"$//;s/@/" | "/g' infile
0

Bahrat from this forum posted an answer that was very close to resolving the issue, somehow it was removed from this post (code below). The only issue is it cannot handle a pipe | within a string (ie "25"|"to concatenate data a | must be used"|"addendum 1"|20|"last entry") He was supposed to post today but the thread was deleted ??

awk -v FS='|' -v OFS='|' '{for(i=1;i<=NF;i++){gsub(/"/,"\"\"",$i);sub(/"/,"",$i);sub(/"[^"]*$/,"",$i)}print}' myfile > myfile3
0

Input -

 "2017"|"S"|"221318"|"WE"|"20170118"|"Someones name"|"20170215"|"1785"|"009"|"20170215"|"182339"|"99536"|"00090"|"LOCAL00"|"930N"|"2017"|"6100"|"0000880"|1.000|0.000|"EA"|" "|" "|" "|" "|"005"|"00000000"|" "|" "|"1785"|"50228"|"R"|"2017"|"NMT CAUTION| 5" X 3" NAT ON BLK"|" "|" "|"USD"|"7444"|" "|"000"|"COIN"|"04"|35.00|"00" 

CMD:

  awk -v RS='[[:blank:]]*[[:blank:]]*[|][[:blank:]]*|[[:blank:]]*[\n][[:blank:]]*' '{ if ($0 !~ /(^"([^"]|"")*"$)/) { gsub(/\"/,"\"\"");sub(/^"/,"");sub(/"$/,"") } printf "%s%s", $0, RT}' file.txt

Output -

 "2017"|"S"|"221318"|"WE"|"20170118"|"Someones name"|"20170215"|"1785"|"009"|"20170215"|"182339"|"99536"|"00090"|"LOCAL00"|"930N"|"2017"|"6100"|"0000880"|1.000|0.000|"EA"|" "|" "|" "|" "|"005"|"00000000"|" "|" "|"1785"|"50228"|"R"|"2017"|"NMT CAUTION| 5"" X 3"" NAT ON BLK"|" "|" "|"USD"|"7444"|" "|"000"|"COIN"|"04"|35.00|"00" 

10
  • Moved follow-up discussion to chat. Commented Apr 4, 2018 at 20:50
  • Bharat, it's working great for escaping pipes within text except for some unexpected quotes, here is a before: "2017"|"S"|"221318"|"WE"|"20170118"|"Someones name"|"20170215"|"1785"|"009"|"20170215"|"182339"|"99536"|"00090"|"LOCAL00"|"930N"|"2017"|"6100"|"0000880"|1.000|0.000|"EA"|" "|" "|" "|" "|"005"|"00000000"|" "|" "|"1785"|"50228"|"R"|"2017"|"NMT CAUTION| 5" X 3" NAT ON BLK"|" "|" "|"USD"|"7444"|" "|"000"|"COIN"|"04"|35.00|"00" Commented Apr 9, 2018 at 14:36
  • Here' s the after: "2017"|"S"|"221318"|"WE"|"20170118"|"Someones name"|"20170215"|"1785"|"009"|"20170215"|"182339"|"99536"|00090"|"LOCAL00"|"930N"|"2017"|"6100"|"00000880""|1.000|0.000|""EA"|" "|" "|" "|" "|"005" |"00000000"|" "|" "|"1785"|"50228"|"R"|"2017"|"NMT CAUTION| 5"" X 3"" NAT ON BLK"|" "|" "|"USD"|"7444"|" "|"000"| COIN"|"04""|35.00|""00" Commented Apr 9, 2018 at 14:36
  • Please notice the missing quote on |00090"| Commented Apr 9, 2018 at 14:36
  • please notice extra quote on |""EA"| and towards the end of the string at "04""|35.00|""00" Commented Apr 9, 2018 at 14:37

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.