How to split one column into two based on a certain character?

Question

I'm looking to split one column in a dataset into 2 columns, whilst still preserving all other columns/data in the dataset.

For example my data looks like (... resprenting many more columns, the dataset is very large):

Gene   qval    ...  Chromosome Position
ACE    0.3748  ...    1:234689650
NOS    0.2     ...    2:374896578
BRCA   0.345   ...    12:897655323

I want to divide the chromosome position column based on the : in the chromosome position to become:

Gene   qval    ...   Chromosome    Position
ACE    0.3748  ...    1            234689650
NOS    0.2     ...    2            374896578
BRCA   0.345   ...    12           897655323

Currently what I've tried seems either not make a new column or to disrupt the rest of the dataset so it gets jumbled/out of place plus made the first Chromsome column appear with its singular chromosome numbers, but the larger Position number (the second column I'm trying to create) disappears.

For example I've tried 3 ways:

awk 'sub(/\:/," "){$1=$1}1' OFS="\t" file1.txt > file2.txt #displaces columns and removes position column

tr ':' $' ' < file1.txt > file2.txt  #removes : but doesn't divide into 2 columns

sed 's/:/ /g' < file1.txt > file2.txt  #removes : but doesn't divide into 2 columns

I've tried code like this based on similar questions, but most questions want to cut one column in half and move the bottom half into a new column, as opposed to having a dividing character split one column.

My data is in a tab delimited file. I'm new to linux, so probably wrong, but is it that for my sed or tr code they also need a command stating the split is complete when considering the space newly placed between the numbers as a tab delimited, therefore making them be considered new columns?

Is the Chromosome Position field the only one that may contain a : ? — steeldriver
– steeldriver, Commented Feb 3, 2020 at 14:05

aborruso · Accepted Answer · 2020-02-03 14:14:22Z

Using Miller (https://github.com/johnkerl/miller) and running

mlr --tsv nest --explode --values --across-fields --nested-fs ":" -f "Chromosome Position" \
then rename "Chromosome Position_1",Chromosome,"Chromosome Position_2",Position input.tsv >output.tsv

you will have

+------+--------+------------+-----------+
| Gene | qval   | Chromosome | Position  |
+------+--------+------------+-----------+
| ACE  | 0.3748 | 1          | 234689650 |
| NOS  | 0.2    | 2          | 374896578 |
| BRCA | 0.345  | 12         | 897655323 |
+------+--------+------------+-----------+

glenn jackman · Accepted Answer · 2020-02-03 14:05:05Z

1

If your data is tab-delimited, then replace colon with tab:

tr : $'\t' < file

That uses bash's ANSI-C Quoting

answered Feb 3, 2020 at 14:05

glenn jackman

88.5k16 gold badges124 silver badges179 bronze badges

Add a comment |

RudiC · Accepted Answer · 2020-02-03 14:43:44Z

1

What's wrong with your first approach (excapt that you replace the colon with a space in lieu of a <TAB>)? Try an adaption:

awk '{sub (/:/, OFS)} 1' OFS="\t" file

answered Feb 3, 2020 at 14:43

RudiC

9,0492 gold badges12 silver badges22 bronze badges

Add a comment |

Stack Exchange Network

How to split one column into two based on a certain character?

3 Answers 3

You must log in to answer this question.

Hot Network Questions

How to split one column into two based on a certain character?

3 Answers 3

You must log in to answer this question.

Related

Hot Network Questions