2

I have 5000 questions in txt file like this:

<quiz>
        <que>The question her</que>
        <ca>text</ca>
        <ia>text</ia>
        <ia>text</ia>
        <ia>text</ia>
    </quiz>

I want to write a script in Ubuntu to convert all questions like this:

  {
   "text":"The question her",
   "answer1":"text",
   "answer2":"text",
   "answer3":"text",
   "answer4":"text"
  },
9
  • that's better be done in python perl etc.. than in pure bash. so OS is irrelevant, it's more stack overflow question Commented Mar 7, 2019 at 10:11
  • Is this XML to JSON? Commented Mar 7, 2019 at 10:13
  • yes it is - Ark Commented Mar 7, 2019 at 10:14
  • What kind of assumptions can you make on the input? Are questions and answers always on separate lines, are the <quiz> and </quiz> tags always on their own on a line. Could the text have some encoding (CDATA, &#123;, &eacute....), could it contain double quotes or backslashes? Is the XML file encoded in UTF-8? Commented Mar 7, 2019 at 10:14
  • encoding="UTF-8" - all Questions yes - Stephane. Commented Mar 7, 2019 at 10:16

6 Answers 6

6

in fact, you could get away here even w/o Python programming, just using 2 unix utilities:

  1. jtm - that one allows xml <-> json lossless conversion
  2. jtc - that one allows manipulating JSONs

Thus, assuming your xml is in file.xml, jtm would convert it to the following json:

bash $ jtm file.xml 
[
   {
      "quiz": [
         {
            "que": "The question her"
         },
         {
            "ca": "text"
         },
         {
            "ia": "text"
         },
         {
            "ia": "text"
         },
         {
            "ia": "text"
         }
      ]
   }
]
bash $ 

and then, applying a series of JSON transformations, you can arrive to a desired result:

bash $ jtm file.xml | jtc -w'<quiz>l:[1:][-2]' -ei echo { '"answer[-]"': {} }\; -i'<quiz>l:[1:]' | jtc -w'<quiz>l:[-1][:][0]' -w'<quiz>l:[-1][:]' -s | jtc -w'<quiz>l:' -w'<quiz>l:[0]' -s | jtc -w'<quiz>l: <>v' -u'"text"'
[
   {
      "answer1": "text",
      "answer2": "text",
      "answer3": "text",
      "answer4": "text",
      "text": "The question her"
   }
]
bash $ 

Though, due to involved shell scripting (echo command), it'll be slower than Python's - for 5000 questions, I'd expect it would run around a minute. (In the future version of the jtc I plan to allow interpolations even in statically specified JSONs, so that for templating no external shell-scripting would be required, then the operations will be blazing fast)

if you're curious about jtc syntax, you could find a user guide here: https://github.com/ldn-softdev/jtc/blob/master/User%20Guide.md

1
  • Your solution looks really good. I didn't have time to check your concern regarding performance. But I'm curious: isn't it possible to replace multiple echo invocations with piping through sed? Commented Mar 27, 2020 at 13:04
5

The xq tool from https://kislyuk.github.io/yq/ turns your XML into

{
  "quiz": {
    "que": "The question her",
    "ca": "text",
    "ia": [
      "text",
      "text",
      "text"
    ]
  }
}

by just using the identity filter (xq . file.xml).

We can massage this into something closer the form you want using

xq '.quiz | { text: .que, answers: .ia }' file.xml

which outputs

{
  "text": "The question her",
  "answers": [
    "text",
    "text",
    "text"
  ]
}

To fix up the answers bit so that you get your enumerated keys:

xq '.quiz |
    { text: .que } +
    (
        [
            range(.ia|length) as $i | { key: "answer\($i+1)", value: .ia[$i] }
        ] | from_entries
    )' file.xml

This adds the enumerated answer keys with the values from the ia nodes by iterating over the ia nodes and manually producing a set of keys and values. These are then turned into real key-value pairs using from_entries and added to the proto-object we've created ({ text: .que }).

The output:

{
  "text": "The question her",
  "answer1": "text",
  "answer2": "text",
  "answer3": "text"
}

If your XML document contains multiple quiz nodes under some root node, then change .quiz in the jq expression above to .[].quiz[] to transform each of them, and you may want to put the resulting objects into an array:

xq '.[].quiz[] |
    [ { text: .que } +
    (
        [
            range(.ia|length) as $i | { key: "answer\($i+1)", value: .ia[$i] }
        ] | from_entries
    ) ]' file.xml
3

I assume your Ubuntu has python installed

#!/usr/bin/python3
import io
import json
import xml.etree.ElementTree

d = """<quiz>
        <que>The question her</que>
        <ca>text</ca>
        <ia>text</ia>
        <ia>text</ia>
        <ia>text</ia>
    </quiz>
"""

s = io.StringIO(d)
# root = xml.etree.ElementTree.parse("filename_here").getroot()
root = xml.etree.ElementTree.parse(s).getroot()
out = {}
i = 1
for child in root:
    name, value = child.tag, child.text
    if name == 'que':
        name = 'question'
    else:
        name = 'answer%s' % i
        i += 1
    out[name] = value

print(json.dumps(out))

save it and chmod to executable you can easily modify to take a file as input instead of just text

EDIT Okey, this is a more complete script:

#!/usr/bin/python3
import json
import sys
import xml.etree.ElementTree


def read_file(filename):
    root = xml.etree.ElementTree.parse(filename).getroot()
    return root


# assule we have a list of <quiz>, contained in some other element
def parse_quiz(quiz_element, out):
    i = 1
    tmp = {}
    for child in quiz_element:

        name, value = child.tag, child.text
        if name == 'que':
            name = 'question'
        else:
            name = 'answer%s' % i
            i += 1
        tmp[name] = value
    out.append(tmp)


def parse_root(root_element, out):
    for child in root_element:
        if child.tag == 'quiz':
            parse_quiz(child, out)


def convert_xml_to_json(filename):
    root = read_file(filename)
    out = []
    parse_root(root, out)
    print(json.dumps(out))


if __name__ == '__main__':
    if len(sys.argv) > 1:
        convert_xml_to_json(sys.argv[1])
    else:
        print("Usage: script <filename_with_xml>")

I made a file with following, which I named xmltest:

<questions>
    <quiz>
        <que>The question her</que>
        <ca>text</ca>
        <ia>text</ia>
        <ia>text</ia>
        <ia>text</ia>
    </quiz>
     <quiz>
            <que>Question number 1</que>
            <ca>blabla</ca>
            <ia>stuff</ia>
    </quiz>
</questions>

So you have a list of quiz inside some other container.

Now, I launch it like this: $ chmod u+x scratch.py, then scratch.py filenamewithxml

This gives me the answer:

$ ./scratch4.py xmltest
[{"answer3": "text", "answer2": "text", "question": "The question her", "answer4": "text", "answer1": "text"}, {"answer2": "stuff", "question": "Question number 1", "answer1": "blabla"}]
1
  • 1
    That only works if the input has just one <quiz> Commented Mar 7, 2019 at 10:36
1

Here ruby answer is.

script xml2json.rb:

require 'nokogiri'
require 'json'
    
in_doc = File.open(ARGV[0]) do |f|
  Nokogiri::XML(f)
end

for quiz in in_doc.xpath('/contents/quiz') do
  result = {
    text:     quiz.xpath('que').text,
    answer1:  quiz.xpath('ca').text
  }
  quiz.xpath('ia').each_with_index do |ia, ix|
    result["answer#{ix + 2}"] = ia.text
  end
  print result.to_json, ",\n"
end

input quiz.xml:

<contents>
  <quiz>
    <que>The question her</que>
    <ca>text a1</ca>
    <ia>text a2</ia>
    <ia>text a3</ia>
    <ia>text a4</ia>
  </quiz>
  <quiz>
    <que>The question foo</que>
    <ca>text b1</ca>
    <ia>text b2</ia>
    <ia>text b3</ia>
    <ia>text b4</ia>
  </quiz>
</contents>

execution:

$ ./xml2json.rb quiz.xml

0

Thank you dgan, but your code: 1- print the output in the screen not in json file, and not support encoding= utf-8, so i change it:

 ##!/usr/bin/python3
import json, codecs
import sys
import xml.etree.ElementTree


def read_file(filename):
    root = xml.etree.ElementTree.parse(filename).getroot()
    return root


# assule we have a list of <quiz>, contained in some other element
def parse_quiz(quiz_element, out):
    i = 1
    tmp = {}
    for child in quiz_element:

        name, value = child.tag, child.text
        if name == 'que':
            name = 'question'
        else:
            name = 'answer%s' % i
            i += 1
        tmp[name] = value
    out.append(tmp)


def parse_root(root_element, out):
    for child in root_element:
        if child.tag == 'quiz':
            parse_quiz(child, out)


def convert_xml_to_json(filename):
    root = read_file(filename)
    out = []
    parse_root(root, out)
    with open('data.json', 'w') as outfile:
        json.dump(out, codecs.getwriter('utf-8')(outfile), sort_keys=True, ensure_ascii=False)


if __name__ == '__main__':
    if len(sys.argv) > 1:
        convert_xml_to_json(sys.argv[1])
    else:
        print("Usage: script <filename_with_xml>")

`

-1

Simply use perl. You can change q{-} (STDIN) for q{my_xml_filepath}.

cat some.xml | perl -MXML::Simple -MJSON -e 'print encode_json XMLin q{-}'
2
  • Can't locate XML/Simple.pm in @INC... Commented Mar 17 at 18:15
  • Then possibly, you can install some package such as for RHEL perl-XML-Simple. Commented Mar 18 at 9:32

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.