The challenges with this are that the data is not a simple CSV type file, where the first line is column names, and the rest of the lines are column data by row.
Here you have column_name=column_data, delimited by ; characters. My solution would be to use a language like Python to read the file in line by line. I would create dict() from each line, and a K:V pair for each field. Then I would append that dict to a list() of all the lines.
Once I had that, I could process the list. If I'm on the first line, I'll print the Column Names, then the values, other wise I will only print the values.
I think the method would be similar whatever language you're using, but it's definitely doable.
Here's a quick example in Python that uses OrderedDicts to preserve "column" order:
#!/usr/bin/python
''' a quick example of a script to parse '=' delimited fields in
';' delimited columns of a text file.
prints tab delimited columnar data with headers to STDOUT
'''
from collections import OrderedDict
with open('data', 'rb') as infile:
FLINES = infile.read().split()
DATA = []
for line in FLINES:
fields = line.split(';')
d = OrderedDict()
for field in fields:
if '=' in field:
col, value = field.split('=')
d.update({col: value})
DATA.append(d)
L = 0
for D in DATA:
if L == 0:
print '\t'.join(D.keys())
print '\t'.join(D.values())
L += 1
- This example assumes that all your lines will have the same columns, because it will only print the col_names for the first entry it gets out of the list.
echo -e "AC\tAC_AFR\tAC_AMR\tAC_Adj\tAC_EAS\tAC_FIN\tAC_Het" && sed -e 's/[Aa-Zz _=]*//g' datacolum | sed -e 's/;/\t/g', but it is required that in your original file all the lines have the same fields. As I have read in your comments that some lines wouldn't have fields that are in other ones, I have edited your question with that relevant info.