3

I have a Garmin Nuvi which uses OpenStreet maps. Garmin do postcodes, but is usually 2-3 years out of date for Scotland. OSM does not do British postcodes, but the Post office does, and it can be downloaded for free. This file is just under 1GB. It has 16 columns, of which I only want the first 3.

I used cut to remove the extraneous columns, so I now have Postcode, Latitude and Longitude. Unfortunately the POI file is to be Latitude, Longitude and Postcode, i.e. column 1 is to be column 3. To add to the problem, the postcode must be in quotes e.g EH9 1QG and SW12 1AB has to be "EH9 1QG" and "SW12 1AB".

I used awk rather awkwardly (see what I did there?) with:

awk 'BEGIN {FS="\t"; OFS=","} {print $2, $3, $1}' pc0.csv > pc.csv

and all it did was add 2 empty columns to the front.
It would be nice to be able to use a spreadsheet on it but there are over 3 million rows.

Any ideas?


This is what I get from the output of cut - pc0.csv.
The awk command gives the same but with two commas at the front of each row to give 2 empty columns.

Postcode Latitude Longitude
AB1 0AA,57.101474,-2.242851  
AB1 0AB,57.102554,-2.246308  
AB1 0AD,57.100556,-2.248342  
AB1 0AE,57.084444,-2.255708  
AB1 0AF,57.096656,-2.258102  
AB1 0AG,57.097085,-2.267513  
AB1 0AJ,57.099011,-2.252854  
AB1 0AL,57.101765,-2.254688

So using the "cut" file above, which is now only 73MB, I need to convert it to:

Latitude,Longitude,Postcode
57.101474,-2.242851,"AB1 0AA"  
57.102554,-2.246308,"AB1 0AB"  
57.100556,-2.248342,"AB1 0AD"  
57.084444,-2.255708,"AB1 0AE"  
57.096656,-2.258102,"AB1 0AF"  
57.097085,-2.267513,"AB1 0AG"  
57.099011,-2.252854,"AB1 0AJ"  
57.101765,-2.254688,"AB1 0AL"

Now I had to remove the tabs to display these lines, so that is another problem as there can only be commas and nothing else - not even spaces unless inside quotes.

P.S. Linux (Ubuntu Mate) 22.04 LTS

8
  • 3
    edit your post to include a sample of the input data and how you want it to look. Just a couple of lines is enough, as long as they're representative of the data. Commented Aug 29, 2024 at 19:57
  • 1
    also update the question to show the (wrong) output generated by your code Commented Aug 29, 2024 at 20:18
  • using this dummy record - SW12_1AB long1 lat1 (tabs between columns) - your script generates - long1,lat1,SW12_1AB - when the file has unix line endings (\n); when the file has windows line endings (\r\n) your script generates - SW12_1AB1; I'm guessing your file may have windows line endings in which case you'll either want to remove them (eg, dos2unix pc0.csv) or modify the awk script to factor in the trailing \r character; providing us with sample inputs and (expected, wrong) outputs will help us to determine the issue(s) and how to proceed Commented Aug 29, 2024 at 20:25
  • 1
    Several solutions suggest themselves, using perl, awk, or even bash, but without a fuller description of your system, and a short example of your input (are there TABs?) and the desired output, I can't pick a silution. Commented Aug 29, 2024 at 20:42
  • 1
    Please add to you post some columns of the input file. In you first awk command you write {FS="\t"; OFS=","} , then the input file is not a CSV, it's a tsv. Am I right? Commented Aug 30, 2024 at 13:53

3 Answers 3

4

This, using any awk, will convert the first sample file you show to the second sample file you show:

$ awk '
    BEGIN { FS=OFS="," }
    { sub(/ +$/, "") }
    NR == 1 { gsub(/ /, FS) }
    NR > 1  { $1 = "\"" $1 "\"" }
    { print $2, $3, $1 }
' file

e.g. given this input:

$ cat file
Postcode Latitude Longitude
AB1 0AA,57.101474,-2.242851
AB1 0AB,57.102554,-2.246308
AB1 0AD,57.100556,-2.248342
AB1 0AE,57.084444,-2.255708
AB1 0AF,57.096656,-2.258102
AB1 0AG,57.097085,-2.267513
AB1 0AJ,57.099011,-2.252854
AB1 0AL,57.101765,-2.254688

it produces this output:

$ awk 'BEGIN{FS=OFS=","} {sub(/ +$/,"")} NR==1{gsub(/ /,FS)} NR>1{$1="\""$1"\""} {print $2, $3, $1}' file
Latitude,Longitude,Postcode
57.101474,-2.242851,"AB1 0AA"
57.102554,-2.246308,"AB1 0AB"
57.100556,-2.248342,"AB1 0AD"
57.084444,-2.255708,"AB1 0AE"
57.096656,-2.258102,"AB1 0AF"
57.097085,-2.267513,"AB1 0AG"
57.099011,-2.252854,"AB1 0AJ"
57.101765,-2.254688,"AB1 0AL"

If that's not what you need then clarify your question.

2
  • Huh, looking at this made me notice something; the trailing spaces you're removing, which I thought were just a refuse in OP's example, may actually be the "tabs" I now notice they're mentioning at the end of the question. +1 as this is the correct solution given the proposed input, however OP should clarify this. Commented Aug 30, 2024 at 5:12
  • 1
    Thank you. awk 'BEGIN{FS=OFS=","} {sub(/ +$/,"")} NR==1{gsub(/ /,FS)} NR>1{$1="\""$1"\""} {print $2, $3, $1}' pc0.csv > pc1.csv worked perfectly. Commented Aug 30, 2024 at 17:54
1

This might work for your Ubuntu box:

gawk -v OFS=, '{print $2,$3,q $1 q}NR==1{FS=","; q="\""}' pc0.csv

The idea is to reset FS="," and specify the quotes after the first line is processed.

1
  • Nice use of the first-line delay in setting FS and q. FYI you don't need gawk or Ubuntu for that, it'll behave the same way in any awk in any shell on every Unix box. Commented Aug 31, 2024 at 12:39
1

duckdb cli is a great tool to run this kind of task for large file.

If the input file is in example this TSV

Postcode    Latitude    Longitude   Lorem   Ipsum
AB1 0AA 57.101474   -2.242851   e   2
AB1 0AB 57.102554   -2.246308   r   4
AB1 0AD 57.100556   -2.248342   g   5
AB1 0AE 57.084444   -2.255708   r   7
AB1 0AF 57.096656   -2.258102   x   4
AB1 0AG 57.097085   -2.267513   c   3
AB1 0AJ 57.099011   -2.252854   g   2
AB1 0AL 57.101765   -2.254688   g   5

you can run

duckdb --csv -c "select #2,#3,#1 from read_csv('input.tsv',delim='\t', header=True)"

to get this CSV output

Latitude,Longitude,Postcode
57.101474,-2.242851,"AB1 0AA"
57.102554,-2.246308,"AB1 0AB"
57.100556,-2.248342,"AB1 0AD"
57.084444,-2.255708,"AB1 0AE"
57.096656,-2.258102,"AB1 0AF"
57.097085,-2.267513,"AB1 0AG"
57.099011,-2.252854,"AB1 0AJ"
57.101765,-2.254688,"AB1 0AL"
3
  • 1
    Thank you. All done. It was eventually a waste of time as GPSBabel wasn't able load such a large file, and even if it did, apparently I would have to scroll down the 3.7 million postcodes as the "Favourites" page on the Garmin does not allow search. Back to square 1... Commented Sep 30, 2024 at 17:38
  • @user256787 ok, all done. But what tool did you use? Commented Sep 30, 2024 at 21:02
  • 1
    I uses the awk answer awk 'BEGIN{FS=OFS=","} {sub(/ +$/,"")} NR==1{gsub(/ /,FS)} NR>1{$1="\""$1"\""} {print $2, $3, $1}' pc0.csv > pc1.csv Commented Oct 1, 2024 at 23:44

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.