I feel like a basic dataframe use is looking up the equivalent of a 'key' to return a 'value', but I've been searching and trying so many things for days with no success. So I think I'm not trying the right things and would appreciate any help.
I've tried .to_dict() and couldn't figure out how to shape the values into something I could look up. It also seemed inefficient to make a dictionary of a dataframe made from XML. So I'm back to trying .loc[].
Python:
# -*- coding: utf-8 -*-
# Importing the required libraries
import pandas as pd
#Define thing lookup dataframe columns and rows
thing_cols = ["Thing Name", "Thing ID"]
thing_rows = []
# Append rows, create and index the dataframe
thing_rows.append({"Thing Name": "thing 1 name",
"Thing ID": "thing_1_id"})
thing_rows.append({"Thing Name": "thing 2 name",
"Thing ID": "thing_2_id"})
thing_df = pd.DataFrame(thing_rows, columns=thing_cols)
thing_df = thing_df.set_index(list(thing_df.keys())[0])
print(thing_df.loc["thing 1 name"])
Output:
Thing ID thing_1_id
Name: thing 1 name, dtype: object
Desired output:
thing_1_id
While the above focuses on just this one issue, below is a little bigger picture of what I'm trying to do in case you see a simpler or better way to get my related thing IDs out of the XML.
Ultimate desired output:
,Collection item,RELATED-THING-IDs
0,name of Item 1,"thing_1_id, thing_2_id"
Python:
# -*- coding: utf-8 -*-
# Importing the required libraries
import lxml.etree as Xet
import pandas as pd
#Define main collection columns and rows for dataframe
coll_cols = ["Collection item", "RELATED-THING-IDs"]
coll_rows = []
#Define thing lookup dataframe columns and rows
thing_cols = ["Thing Name", "Thing ID"]
thing_rows = []
# Parsing the XML file
xmlparse = Xet.parse('sample.xml')
root = xmlparse.getroot()
for row in root:
# Create thing lookup dataframe
if (row.findtext('type') == "THING"):
thing_id = row.findtext("THING-ID")
thing_name = row.findtext("name")
thing_rows.append({"Thing Name": thing_name,
"Thing ID": thing_id})
thing_df = pd.DataFrame(thing_rows, columns=thing_cols)
thing_df = thing_df.set_index(list(thing_df.keys())[0])
# Find only collection items
if row.findtext('type') != "COLLECTION-ITEM":
continue
# Define values for collection item dataframe
name = row.findtext("name", "Missing name")
relat_thing_items = thing_df.loc[[row.xpath(
"./RELATED-THING/result/row/name/text()")],["THING-ID"]]
if len(relat_thing_items) > 0:
relat_thing_id = ', '.join(relat_thing_items)
else:
relat_thing_id = ""
coll_rows.append({"Collection item": name,
"RELATED-THING-IDs": relat_thing_id
})
coll_df = pd.DataFrame(coll_rows, columns=coll_cols)
# Writing dataframe to csv
coll_df.to_csv('output.csv')
XML:
<?xml version="1.0" encoding="UTF-8"?>
<result size="4321">
<row>
<type>CONTEXT</type>
<name>collections</name>
</row>
<row>
<type>COLLECTION-ITEM</type>
<name>name of Item 1</name>
<ITEM-ID>item_000001</ITEM-ID>
<RELATED-THING>
<result size="2">
<row>
<type>THING</type>
<name>thing name 1</name>
<no>1</no>
</row>
<row>
<type>THING</type>
<name>thing name 2</name>
<no>1</no>
</row>
</result>
</RELATED-THING>
</row>
<row>
<type>THING</type>
<name>thing name 1</name>
<THING-ID>thing_000783</THING-ID>
</row>
<row>
<type>THING</type>
<name>thing name 2</name>
<THING-ID>thing_000803</THING-ID>
</row>
</result>
index = ["Thing Name"]is not the correct way to define indexcsvmodule. If you need to do some level of aggregation or filter based on values from the entirety of the data (such as percentile) then importing to pandas makes sense. The post you link to does a good job of describing how to parse XML, and Karina's answer is an excellent demonstration of how to select values from a dataframe.