Large file manipulation

Question

I have a Garmin Nuvi which uses OpenStreet maps. Garmin do postcodes, but is usually 2-3 years out of date for Scotland. OSM does not do British postcodes, but the Post office does, and it can be downloaded for free. This file is just under 1GB. It has 16 columns, of which I only want the first 3.

I used cut to remove the extraneous columns, so I now have Postcode, Latitude and Longitude. Unfortunately the POI file is to be Latitude, Longitude and Postcode, i.e. column 1 is to be column 3. To add to the problem, the postcode must be in quotes e.g EH9 1QG and SW12 1AB has to be "EH9 1QG" and "SW12 1AB".

I used awk rather awkwardly (see what I did there?) with:

awk 'BEGIN {FS="\t"; OFS=","} {print $2, $3, $1}' pc0.csv > pc.csv

and all it did was add 2 empty columns to the front.
It would be nice to be able to use a spreadsheet on it but there are over 3 million rows.

Any ideas?

This is what I get from the output of cut - pc0.csv.
The awk command gives the same but with two commas at the front of each row to give 2 empty columns.

Postcode Latitude Longitude
AB1 0AA,57.101474,-2.242851  
AB1 0AB,57.102554,-2.246308  
AB1 0AD,57.100556,-2.248342  
AB1 0AE,57.084444,-2.255708  
AB1 0AF,57.096656,-2.258102  
AB1 0AG,57.097085,-2.267513  
AB1 0AJ,57.099011,-2.252854  
AB1 0AL,57.101765,-2.254688

So using the "cut" file above, which is now only 73MB, I need to convert it to:

Latitude,Longitude,Postcode
57.101474,-2.242851,"AB1 0AA"  
57.102554,-2.246308,"AB1 0AB"  
57.100556,-2.248342,"AB1 0AD"  
57.084444,-2.255708,"AB1 0AE"  
57.096656,-2.258102,"AB1 0AF"  
57.097085,-2.267513,"AB1 0AG"  
57.099011,-2.252854,"AB1 0AJ"  
57.101765,-2.254688,"AB1 0AL"

Now I had to remove the tabs to display these lines, so that is another problem as there can only be commas and nothing else - not even spaces unless inside quotes.

P.S. Linux (Ubuntu Mate) 22.04 LTS

edit your post to include a sample of the input data and how you want it to look. Just a couple of lines is enough, as long as they're representative of the data. — ilkkachu
– ilkkachu, Commented Aug 29, 2024 at 19:57
also update the question to show the (wrong) output generated by your code — markp-fuso
– markp-fuso, Commented Aug 29, 2024 at 20:18
using this dummy record - SW12_1AB long1 lat1 (tabs between columns) - your script generates - long1,lat1,SW12_1AB - when the file has unix line endings (\n); when the file has windows line endings (\r\n) your script generates - SW12_1AB1; I'm guessing your file may have windows line endings in which case you'll either want to remove them (eg, dos2unix pc0.csv) or modify the awk script to factor in the trailing \r character; providing us with sample inputs and (expected, wrong) outputs will help us to determine the issue(s) and how to proceed — markp-fuso
– markp-fuso, Commented Aug 29, 2024 at 20:25
Several solutions suggest themselves, using perl, awk, or even bash, but without a fuller description of your system, and a short example of your input (are there TABs?) and the desired output, I can't pick a silution. — waltinator
– waltinator, Commented Aug 29, 2024 at 20:42
Please add to you post some columns of the input file. In you first awk command you write {FS="\t"; OFS=","} , then the input file is not a CSV, it's a tsv. Am I right? — aborruso
– aborruso, Commented Aug 30, 2024 at 13:53

Ed Morton · Accepted Answer · 2024-09-16 11:20:31Z

4

This, using any awk, will convert the first sample file you show to the second sample file you show:

$ awk '
    BEGIN { FS=OFS="," }
    { sub(/ +$/, "") }
    NR == 1 { gsub(/ /, FS) }
    NR > 1  { $1 = "\"" $1 "\"" }
    { print $2, $3, $1 }
' file

e.g. given this input:

$ cat file
Postcode Latitude Longitude
AB1 0AA,57.101474,-2.242851
AB1 0AB,57.102554,-2.246308
AB1 0AD,57.100556,-2.248342
AB1 0AE,57.084444,-2.255708
AB1 0AF,57.096656,-2.258102
AB1 0AG,57.097085,-2.267513
AB1 0AJ,57.099011,-2.252854
AB1 0AL,57.101765,-2.254688

it produces this output:

$ awk 'BEGIN{FS=OFS=","} {sub(/ +$/,"")} NR==1{gsub(/ /,FS)} NR>1{$1="\""$1"\""} {print $2, $3, $1}' file
Latitude,Longitude,Postcode
57.101474,-2.242851,"AB1 0AA"
57.102554,-2.246308,"AB1 0AB"
57.100556,-2.248342,"AB1 0AD"
57.084444,-2.255708,"AB1 0AE"
57.096656,-2.258102,"AB1 0AF"
57.097085,-2.267513,"AB1 0AG"
57.099011,-2.252854,"AB1 0AJ"
57.101765,-2.254688,"AB1 0AL"

If that's not what you need then clarify your question.

edited Sep 16, 2024 at 11:20

answered Aug 29, 2024 at 23:10

Ed Morton

35.9k6 gold badges25 silver badges60 bronze badges

Huh, looking at this made me notice something; the trailing spaces you're removing, which I thought were just a refuse in OP's example, may actually be the "tabs" I now notice they're mentioning at the end of the question. +1 as this is the correct solution given the proposed input, however OP should clarify this.

kos
– kos

2024-08-30 05:12:04 +00:00
Commented Aug 30, 2024 at 5:12
1

Thank you. awk 'BEGIN{FS=OFS=","} {sub(/ +$/,"")} NR==1{gsub(/ /,FS)} NR>1{$1="\""$1"\""} {print $2, $3, $1}' pc0.csv > pc1.csv worked perfectly.

user256787
– user256787

2024-08-30 17:54:25 +00:00
Commented Aug 30, 2024 at 17:54

Add a comment |

lihao · Accepted Answer · 2024-08-30 04:34:12Z

1

This might work for your Ubuntu box:

gawk -v OFS=, '{print $2,$3,q $1 q}NR==1{FS=","; q="\""}' pc0.csv

The idea is to reset FS="," and specify the quotes after the first line is processed.

answered Aug 30, 2024 at 4:34

lihao

1213 bronze badges

Nice use of the first-line delay in setting FS and q. FYI you don't need gawk or Ubuntu for that, it'll behave the same way in any awk in any shell on every Unix box.

Ed Morton
– Ed Morton

2024-08-31 12:39:26 +00:00
Commented Aug 31, 2024 at 12:39

Add a comment |

aborruso · Accepted Answer · 2024-08-30 13:40:58Z

1

duckdb cli is a great tool to run this kind of task for large file.

If the input file is in example this TSV

Postcode    Latitude    Longitude   Lorem   Ipsum
AB1 0AA 57.101474   -2.242851   e   2
AB1 0AB 57.102554   -2.246308   r   4
AB1 0AD 57.100556   -2.248342   g   5
AB1 0AE 57.084444   -2.255708   r   7
AB1 0AF 57.096656   -2.258102   x   4
AB1 0AG 57.097085   -2.267513   c   3
AB1 0AJ 57.099011   -2.252854   g   2
AB1 0AL 57.101765   -2.254688   g   5

you can run

duckdb --csv -c "select #2,#3,#1 from read_csv('input.tsv',delim='\t', header=True)"

to get this CSV output

Latitude,Longitude,Postcode
57.101474,-2.242851,"AB1 0AA"
57.102554,-2.246308,"AB1 0AB"
57.100556,-2.248342,"AB1 0AD"
57.084444,-2.255708,"AB1 0AE"
57.096656,-2.258102,"AB1 0AF"
57.097085,-2.267513,"AB1 0AG"
57.099011,-2.252854,"AB1 0AJ"
57.101765,-2.254688,"AB1 0AL"

answered Aug 30, 2024 at 13:40

aborruso

3,00514 silver badges31 bronze badges

1

Thank you. All done. It was eventually a waste of time as GPSBabel wasn't able load such a large file, and even if it did, apparently I would have to scroll down the 3.7 million postcodes as the "Favourites" page on the Garmin does not allow search. Back to square 1...

user256787
– user256787

2024-09-30 17:38:40 +00:00
Commented Sep 30, 2024 at 17:38
@user256787 ok, all done. But what tool did you use?

aborruso
– aborruso

2024-09-30 21:02:40 +00:00
Commented Sep 30, 2024 at 21:02
1

I uses the awk answer awk 'BEGIN{FS=OFS=","} {sub(/ +$/,"")} NR==1{gsub(/ /,FS)} NR>1{$1="\""$1"\""} {print $2, $3, $1}' pc0.csv > pc1.csv

user256787
– user256787

2024-10-01 23:44:07 +00:00
Commented Oct 1, 2024 at 23:44

Add a comment |

Stack Exchange Network

Large file manipulation

3 Answers 3

You must log in to answer this question.

Hot Network Questions

Large file manipulation

3 Answers 3

You must log in to answer this question.

Related

Hot Network Questions