script to translate xml to json

Question

I have 5000 questions in txt file like this:

<quiz>
        <que>The question her</que>
        <ca>text</ca>
        <ia>text</ia>
        <ia>text</ia>
        <ia>text</ia>
    </quiz>

I want to write a script in Ubuntu to convert all questions like this:

  {
   "text":"The question her",
   "answer1":"text",
   "answer2":"text",
   "answer3":"text",
   "answer4":"text"
  },

that's better be done in python perl etc.. than in pure bash. so OS is irrelevant, it's more stack overflow question — Denis
– Denis, Commented Mar 7, 2019 at 10:11
What kind of assumptions can you make on the input? Are questions and answers always on separate lines, are the <quiz> and </quiz> tags always on their own on a line. Could the text have some encoding (CDATA, {, &eacute....), could it contain double quotes or backslashes? Is the XML file encoded in UTF-8? — Stéphane Chazelas
– Stéphane Chazelas, Commented Mar 7, 2019 at 10:14

Dmitry · Accepted Answer · 2019-03-07 15:20:59Z

in fact, you could get away here even w/o Python programming, just using 2 unix utilities:

jtm - that one allows xml <-> json lossless conversion
jtc - that one allows manipulating JSONs

Thus, assuming your xml is in file.xml, jtm would convert it to the following json:

bash $ jtm file.xml 
[
   {
      "quiz": [
         {
            "que": "The question her"
         },
         {
            "ca": "text"
         },
         {
            "ia": "text"
         },
         {
            "ia": "text"
         },
         {
            "ia": "text"
         }
      ]
   }
]
bash $

and then, applying a series of JSON transformations, you can arrive to a desired result:

bash $ jtm file.xml | jtc -w'<quiz>l:[1:][-2]' -ei echo { '"answer[-]"': {} }\; -i'<quiz>l:[1:]' | jtc -w'<quiz>l:[-1][:][0]' -w'<quiz>l:[-1][:]' -s | jtc -w'<quiz>l:' -w'<quiz>l:[0]' -s | jtc -w'<quiz>l: <>v' -u'"text"'
[
   {
      "answer1": "text",
      "answer2": "text",
      "answer3": "text",
      "answer4": "text",
      "text": "The question her"
   }
]
bash $

Though, due to involved shell scripting (echo command), it'll be slower than Python's - for 5000 questions, I'd expect it would run around a minute. (In the future version of the jtc I plan to allow interpolations even in statically specified JSONs, so that for templating no external shell-scripting would be required, then the operations will be blazing fast)

if you're curious about jtc syntax, you could find a user guide here: https://github.com/ldn-softdev/jtc/blob/master/User%20Guide.md

Your solution looks really good. I didn't have time to check your concern regarding performance. But I'm curious: isn't it possible to replace multiple echo invocations with piping through sed? — Victor Yarema
– Victor Yarema, Commented Mar 27, 2020 at 13:04

Kusalananda · Accepted Answer · 2021-07-12 10:30:25Z

The xq tool from https://kislyuk.github.io/yq/ turns your XML into

{
  "quiz": {
    "que": "The question her",
    "ca": "text",
    "ia": [
      "text",
      "text",
      "text"
    ]
  }
}

by just using the identity filter (xq . file.xml).

We can massage this into something closer the form you want using

xq '.quiz | { text: .que, answers: .ia }' file.xml

which outputs

{
  "text": "The question her",
  "answers": [
    "text",
    "text",
    "text"
  ]
}

To fix up the answers bit so that you get your enumerated keys:

xq '.quiz |
    { text: .que } +
    (
        [
            range(.ia|length) as $i | { key: "answer\($i+1)", value: .ia[$i] }
        ] | from_entries
    )' file.xml

This adds the enumerated answer keys with the values from the ia nodes by iterating over the ia nodes and manually producing a set of keys and values. These are then turned into real key-value pairs using from_entries and added to the proto-object we've created ({ text: .que }).

The output:

{
  "text": "The question her",
  "answer1": "text",
  "answer2": "text",
  "answer3": "text"
}

If your XML document contains multiple quiz nodes under some root node, then change .quiz in the jq expression above to .[].quiz[] to transform each of them, and you may want to put the resulting objects into an array:

xq '.[].quiz[] |
    [ { text: .que } +
    (
        [
            range(.ia|length) as $i | { key: "answer\($i+1)", value: .ia[$i] }
        ] | from_entries
    ) ]' file.xml

Denis · Accepted Answer · 2019-03-07 11:03:18Z

I assume your Ubuntu has python installed

#!/usr/bin/python3
import io
import json
import xml.etree.ElementTree

d = """<quiz>
        <que>The question her</que>
        <ca>text</ca>
        <ia>text</ia>
        <ia>text</ia>
        <ia>text</ia>
    </quiz>
"""

s = io.StringIO(d)
# root = xml.etree.ElementTree.parse("filename_here").getroot()
root = xml.etree.ElementTree.parse(s).getroot()
out = {}
i = 1
for child in root:
    name, value = child.tag, child.text
    if name == 'que':
        name = 'question'
    else:
        name = 'answer%s' % i
        i += 1
    out[name] = value

print(json.dumps(out))

save it and chmod to executable you can easily modify to take a file as input instead of just text

EDIT Okey, this is a more complete script:

#!/usr/bin/python3
import json
import sys
import xml.etree.ElementTree


def read_file(filename):
    root = xml.etree.ElementTree.parse(filename).getroot()
    return root


# assule we have a list of <quiz>, contained in some other element
def parse_quiz(quiz_element, out):
    i = 1
    tmp = {}
    for child in quiz_element:

        name, value = child.tag, child.text
        if name == 'que':
            name = 'question'
        else:
            name = 'answer%s' % i
            i += 1
        tmp[name] = value
    out.append(tmp)


def parse_root(root_element, out):
    for child in root_element:
        if child.tag == 'quiz':
            parse_quiz(child, out)


def convert_xml_to_json(filename):
    root = read_file(filename)
    out = []
    parse_root(root, out)
    print(json.dumps(out))


if __name__ == '__main__':
    if len(sys.argv) > 1:
        convert_xml_to_json(sys.argv[1])
    else:
        print("Usage: script <filename_with_xml>")

I made a file with following, which I named xmltest:

<questions>
    <quiz>
        <que>The question her</que>
        <ca>text</ca>
        <ia>text</ia>
        <ia>text</ia>
        <ia>text</ia>
    </quiz>
     <quiz>
            <que>Question number 1</que>
            <ca>blabla</ca>
            <ia>stuff</ia>
    </quiz>
</questions>

So you have a list of quiz inside some other container.

Now, I launch it like this: $ chmod u+x scratch.py, then scratch.py filenamewithxml

This gives me the answer:

$ ./scratch4.py xmltest
[{"answer3": "text", "answer2": "text", "question": "The question her", "answer4": "text", "answer1": "text"}, {"answer2": "stuff", "question": "Question number 1", "answer1": "blabla"}]

That only works if the input has just one <quiz>

Stéphane Chazelas
– Stéphane Chazelas

2019-03-07 10:36:29 +00:00
Commented Mar 7, 2019 at 10:36 — Stéphane Chazelas
– Stéphane Chazelas, Commented Mar 7, 2019 at 10:36

Fumisky Wells · Accepted Answer · 2022-03-24 02:02:17Z

Here ruby answer is.

script xml2json.rb:

require 'nokogiri'
require 'json'
    
in_doc = File.open(ARGV[0]) do |f|
  Nokogiri::XML(f)
end

for quiz in in_doc.xpath('/contents/quiz') do
  result = {
    text:     quiz.xpath('que').text,
    answer1:  quiz.xpath('ca').text
  }
  quiz.xpath('ia').each_with_index do |ia, ix|
    result["answer#{ix + 2}"] = ia.text
  end
  print result.to_json, ",\n"
end

input quiz.xml:

<contents>
  <quiz>
    <que>The question her</que>
    <ca>text a1</ca>
    <ia>text a2</ia>
    <ia>text a3</ia>
    <ia>text a4</ia>
  </quiz>
  <quiz>
    <que>The question foo</que>
    <ca>text b1</ca>
    <ia>text b2</ia>
    <ia>text b3</ia>
    <ia>text b4</ia>
  </quiz>
</contents>

execution:

$ ./xml2json.rb quiz.xml

silver · Accepted Answer · 2019-03-07 16:11:40Z

Thank you dgan, but your code: 1- print the output in the screen not in json file, and not support encoding= utf-8, so i change it:

 ##!/usr/bin/python3
import json, codecs
import sys
import xml.etree.ElementTree


def read_file(filename):
    root = xml.etree.ElementTree.parse(filename).getroot()
    return root


# assule we have a list of <quiz>, contained in some other element
def parse_quiz(quiz_element, out):
    i = 1
    tmp = {}
    for child in quiz_element:

        name, value = child.tag, child.text
        if name == 'que':
            name = 'question'
        else:
            name = 'answer%s' % i
            i += 1
        tmp[name] = value
    out.append(tmp)


def parse_root(root_element, out):
    for child in root_element:
        if child.tag == 'quiz':
            parse_quiz(child, out)


def convert_xml_to_json(filename):
    root = read_file(filename)
    out = []
    parse_root(root, out)
    with open('data.json', 'w') as outfile:
        json.dump(out, codecs.getwriter('utf-8')(outfile), sort_keys=True, ensure_ascii=False)


if __name__ == '__main__':
    if len(sys.argv) > 1:
        convert_xml_to_json(sys.argv[1])
    else:
        print("Usage: script <filename_with_xml>")

`

MUY Belgium · Accepted Answer · 2023-02-08 11:55:21Z

-1

Simply use perl. You can change q{-} (STDIN) for q{my_xml_filepath}.

cat some.xml | perl -MXML::Simple -MJSON -e 'print encode_json XMLin q{-}'

answered Feb 8, 2023 at 11:55

MUY Belgium

1,2922 gold badges16 silver badges32 bronze badges

Can't locate XML/Simple.pm in @INC...

Richard Barraclough
– Richard Barraclough

2025-03-17 18:15:25 +00:00
Commented Mar 17 at 18:15
Then possibly, you can install some package such as for RHEL perl-XML-Simple.

MUY Belgium
– MUY Belgium

2025-03-18 09:32:45 +00:00
Commented Mar 18 at 9:32

Add a comment |

Stack Exchange Network

script to translate xml to json

6 Answers 6

You must log in to answer this question.

Hot Network Questions

script to translate xml to json

6 Answers 6

You must log in to answer this question.

Related

Hot Network Questions