0

I feel like a basic dataframe use is looking up the equivalent of a 'key' to return a 'value', but I've been searching and trying so many things for days with no success. So I think I'm not trying the right things and would appreciate any help.

I've tried .to_dict() and couldn't figure out how to shape the values into something I could look up. It also seemed inefficient to make a dictionary of a dataframe made from XML. So I'm back to trying .loc[].

Python:

# -*- coding: utf-8 -*-

# Importing the required libraries
import pandas as pd

#Define thing lookup dataframe columns and rows
thing_cols = ["Thing Name", "Thing ID"]
thing_rows = []

# Append rows, create and index the dataframe
thing_rows.append({"Thing Name": "thing 1 name",
                   "Thing ID": "thing_1_id"})
thing_rows.append({"Thing Name": "thing 2 name",
                   "Thing ID": "thing_2_id"})
thing_df = pd.DataFrame(thing_rows, columns=thing_cols)
thing_df = thing_df.set_index(list(thing_df.keys())[0])
        
print(thing_df.loc["thing 1 name"])

Output:

Thing ID    thing_1_id
Name: thing 1 name, dtype: object

Desired output:

thing_1_id

While the above focuses on just this one issue, below is a little bigger picture of what I'm trying to do in case you see a simpler or better way to get my related thing IDs out of the XML.

Ultimate desired output:

,Collection item,RELATED-THING-IDs
0,name of Item 1,"thing_1_id, thing_2_id"

Python:

# -*- coding: utf-8 -*-

# Importing the required libraries
import lxml.etree as Xet
import pandas as pd

#Define main collection columns and rows for dataframe
coll_cols = ["Collection item", "RELATED-THING-IDs"]
coll_rows = []
#Define thing lookup dataframe columns and rows
thing_cols = ["Thing Name", "Thing ID"]
thing_rows = []

# Parsing the XML file
xmlparse = Xet.parse('sample.xml')
root = xmlparse.getroot()
for row in root:
    # Create thing lookup dataframe
    if (row.findtext('type') == "THING"):
        thing_id = row.findtext("THING-ID")
        thing_name = row.findtext("name")
        thing_rows.append({"Thing Name": thing_name,
                           "Thing ID": thing_id})
        thing_df = pd.DataFrame(thing_rows, columns=thing_cols)
        thing_df = thing_df.set_index(list(thing_df.keys())[0])
    # Find only collection items
    if row.findtext('type') != "COLLECTION-ITEM":
        continue
    # Define values for collection item dataframe
    name = row.findtext("name", "Missing name")
    relat_thing_items = thing_df.loc[[row.xpath(
        "./RELATED-THING/result/row/name/text()")],["THING-ID"]]
    if len(relat_thing_items) > 0:
        relat_thing_id = ', '.join(relat_thing_items)
    else:
        relat_thing_id = ""

    coll_rows.append({"Collection item": name,
                 "RELATED-THING-IDs": relat_thing_id
})

coll_df = pd.DataFrame(coll_rows, columns=coll_cols)

# Writing dataframe to csv
coll_df.to_csv('output.csv')

XML:

<?xml version="1.0" encoding="UTF-8"?>

<result size="4321">
  <row>
    <type>CONTEXT</type>
    <name>collections</name>
  </row>
  <row>
    <type>COLLECTION-ITEM</type>
    <name>name of Item 1</name>
    <ITEM-ID>item_000001</ITEM-ID>
    <RELATED-THING>
      <result size="2">
        <row>
          <type>THING</type>
          <name>thing name 1</name>
          <no>1</no>
        </row>
        <row>
          <type>THING</type>
          <name>thing name 2</name>
          <no>1</no>
        </row>
      </result>
    </RELATED-THING>
  </row>
  <row>
    <type>THING</type>
    <name>thing name 1</name>
    <THING-ID>thing_000783</THING-ID>
  </row>
  <row>
    <type>THING</type>
    <name>thing name 2</name>
    <THING-ID>thing_000803</THING-ID>
  </row>
</result>
6
  • I think I know how to Looking up values in dataframe , but I don't quiet understand your question. can you clarify with a simple example and some random data? Commented Sep 17, 2021 at 9:33
  • index = ["Thing Name"] is not the correct way to define index Commented Sep 17, 2021 at 9:35
  • Do you need a dataframe? Or do you just want to turn your XML into a dictionary? Commented Sep 18, 2021 at 4:26
  • I appreciate the question, and I don't know if the solution needs a dataframe. I need to get XML to CSV and am open to ways to get it there. This post (stackoverflow.com/questions/69098095/…) has more info on the whole project. I recognize signs that I may be taking a more complicated than necessary path, and I am open to simpler solutions. What information would help identify whether a dataframe is an efficient solution? Commented Sep 18, 2021 at 16:58
  • If you need to turn XML into CSV making only simple calculations that depend on constants or the current row then you don't need a dataframe, you should be able to parse the XML and write the CSV row by row using the csv module. If you need to do some level of aggregation or filter based on values from the entirety of the data (such as percentile) then importing to pandas makes sense. The post you link to does a good job of describing how to parse XML, and Karina's answer is an excellent demonstration of how to select values from a dataframe. Commented Sep 18, 2021 at 18:44

2 Answers 2

1

Create dataframe with customized index

A simplified version

# define columns
col = []
for i in range(10):
    col.append(f'col{i}')

# define rows
row = []
for i in range(20):
    row.append(f'row{i}')
    
# create dataframe
df = pd.DataFrame(index=row, columns=col)

# fill dataframe with datas
for i in range(len(col)):
    df[f'{col[i]}'] = np.random.rand(len(row))

# this is how it looks like
print(f'this is df with row as index\n{df}')

# define another column as index, [0] in the code line below means 'col0'
df = df.set_index(list(df.keys())[0])

# this is how it looks like
print(f'\nthis is df with col0 as index\n{df}')

Output:

this is df with row as index
           col0      col1      col2      col3      col4      col5      col6      col7      col8      col9
row0   0.767971  0.469264  0.755685  0.103732  0.615639  0.257989  0.573995  0.776728  0.447322  0.004053
row1   0.261831  0.471375  0.795415  0.205076  0.803030  0.281501  0.876232  0.629721  0.713122  0.485636
row2   0.509729  0.545062  0.834564  0.673355  0.729216  0.207496  0.564352  0.429800  0.277141  0.658879
row3   0.179367  0.233000  0.653200  0.953253  0.895205  0.848281  0.597804  0.187801  0.052220  0.940323
row4   0.390854  0.455636  0.587465  0.672617  0.996579  0.877511  0.536964  0.069080  0.396630  0.992556
row5   0.521829  0.136424  0.506181  0.674257  0.011684  0.885795  0.378833  0.545459  0.101595  0.368110
row6   0.316271  0.775069  0.999092  0.021254  0.369629  0.244191  0.496379  0.000377  0.593364  0.603435
row7   0.884000  0.927754  0.254314  0.904324  0.638690  0.740589  0.149640  0.751099  0.831088  0.092892
row8   0.992638  0.368222  0.341631  0.721588  0.410636  0.690506  0.155053  0.217033  0.945478  0.799279
row9   0.964336  0.443232  0.563750  0.420321  0.088734  0.197364  0.490591  0.850209  0.960568  0.974885
row10  0.183250  0.634418  0.150319  0.689473  0.835240  0.591784  0.783965  0.895664  0.025180  0.459332
row11  0.904488  0.695560  0.556103  0.883556  0.115751  0.321207  0.904646  0.028804  0.814442  0.267377
row12  0.930931  0.660349  0.821123  0.146736  0.539237  0.030024  0.252803  0.547877  0.463775  0.773859
row13  0.709280  0.666561  0.245371  0.726635  0.764280  0.817297  0.504265  0.105712  0.344707  0.592827
row14  0.450645  0.990164  0.811939  0.770661  0.640772  0.929582  0.541931  0.904480  0.195679  0.582443
row15  0.319354  0.193135  0.902301  0.508479  0.332627  0.383111  0.282576  0.326669  0.632635  0.319938
row16  0.831707  0.668559  0.734607  0.829461  0.906679  0.239202  0.648362  0.932863  0.882962  0.893804
row17  0.320345  0.208685  0.571314  0.529310  0.256773  0.569330  0.330747  0.207749  0.968724  0.323970
row18  0.809379  0.780139  0.527647  0.079957  0.762411  0.789256  0.158718  0.029256  0.812138  0.669460
row19  0.299014  0.960852  0.758356  0.222588  0.061735  0.304292  0.893981  0.945643  0.477456  0.807782

this is df with col0 as index
              col1      col2      col3      col4      col5      col6      col7      col8      col9
col0                                                                                              
0.767971  0.469264  0.755685  0.103732  0.615639  0.257989  0.573995  0.776728  0.447322  0.004053
0.261831  0.471375  0.795415  0.205076  0.803030  0.281501  0.876232  0.629721  0.713122  0.485636
0.509729  0.545062  0.834564  0.673355  0.729216  0.207496  0.564352  0.429800  0.277141  0.658879
0.179367  0.233000  0.653200  0.953253  0.895205  0.848281  0.597804  0.187801  0.052220  0.940323
0.390854  0.455636  0.587465  0.672617  0.996579  0.877511  0.536964  0.069080  0.396630  0.992556
0.521829  0.136424  0.506181  0.674257  0.011684  0.885795  0.378833  0.545459  0.101595  0.368110
0.316271  0.775069  0.999092  0.021254  0.369629  0.244191  0.496379  0.000377  0.593364  0.603435
0.884000  0.927754  0.254314  0.904324  0.638690  0.740589  0.149640  0.751099  0.831088  0.092892
0.992638  0.368222  0.341631  0.721588  0.410636  0.690506  0.155053  0.217033  0.945478  0.799279
0.964336  0.443232  0.563750  0.420321  0.088734  0.197364  0.490591  0.850209  0.960568  0.974885
0.183250  0.634418  0.150319  0.689473  0.835240  0.591784  0.783965  0.895664  0.025180  0.459332
0.904488  0.695560  0.556103  0.883556  0.115751  0.321207  0.904646  0.028804  0.814442  0.267377
0.930931  0.660349  0.821123  0.146736  0.539237  0.030024  0.252803  0.547877  0.463775  0.773859
0.709280  0.666561  0.245371  0.726635  0.764280  0.817297  0.504265  0.105712  0.344707  0.592827
0.450645  0.990164  0.811939  0.770661  0.640772  0.929582  0.541931  0.904480  0.195679  0.582443
0.319354  0.193135  0.902301  0.508479  0.332627  0.383111  0.282576  0.326669  0.632635  0.319938
0.831707  0.668559  0.734607  0.829461  0.906679  0.239202  0.648362  0.932863  0.882962  0.893804
0.320345  0.208685  0.571314  0.529310  0.256773  0.569330  0.330747  0.207749  0.968724  0.323970
0.809379  0.780139  0.527647  0.079957  0.762411  0.789256  0.158718  0.029256  0.812138  0.669460
0.299014  0.960852  0.758356  0.222588  0.061735  0.304292  0.893981  0.945643  0.477456  0.807782

Hope it helps you understanding more about pandas DataFrame

--- edit --- In df with row as index:

print(df['col1']['row0']) will give you 0.469264.

In df with 'col0' as index:

print(df['col1'][0.767971]) will give you 0.469264.

So basically it is df[column][row]. If your row name is string, put it as string, if it is float or int, put it as float or int.

--- edit ---

How to look up dataframe

I'll try to make it clearer with another example.

import pandas as pd

df = pd.DataFrame({'col1': ['a', 'b', 'c', 'd', 'e'],
                   'col2': ['f', 'g', 'h', 'i', 'j'],
                   'col3': ['k', 'l', 'm', 'n', 'o'],
                   'col4': ['p', 'q', 'r', 's', 't'],
                   'col5': ['u', 'v', 'w', 'x', 'y']})
print('1. df is like this :\n', df)
print('-----------------------------------')
print("2. df['col1'][0] =", df['col1'][0])
print('-----------------------------------')
print("3. df['col1'] :\n", df['col1'])
print('-----------------------------------')
print("4. list(df['col1']) =", list(df['col1']))
print('-----------------------------------')
print("5. df.loc[0] :\n", df.loc[0])
print('-----------------------------------')
print("6. df.loc[[0]] :\n", df.loc[[0]])
print('-----------------------------------')
print("7. list(df.loc[0]) = ", list(df.loc[0]))
print('-----------------------------------')
print("8. df.iloc[3, 3] =", df.iloc[3, 3])
print('-----------------------------------')
print("9. df.iloc[3, :] :\n", df.iloc[3, :])
print('-----------------------------------')
print("10. list(df.iloc[3, :]) =", list(df.iloc[3, :]))
print('-----------------------------------')
print("Bonus : *list(df.iloc[3, :]) =", *list(df.iloc[3, :]))

This is the output:

1. df is like this :
   col1 col2 col3 col4 col5
0    a    f    k    p    u
1    b    g    l    q    v
2    c    h    m    r    w
3    d    i    n    s    x
4    e    j    o    t    y
-----------------------------------
2. df['col1'][0] = a
-----------------------------------
3. df['col1'] :
 0    a
1    b
2    c
3    d
4    e
Name: col1, dtype: object
-----------------------------------
4. list(df['col1']) = ['a', 'b', 'c', 'd', 'e']
-----------------------------------
5. df.loc[0] :
 col1    a
col2    f
col3    k
col4    p
col5    u
Name: 0, dtype: object
-----------------------------------
6. df.loc[[0]] :
   col1 col2 col3 col4 col5
0    a    f    k    p    u
-----------------------------------
7. list(df.loc[0]) =  ['a', 'f', 'k', 'p', 'u']
-----------------------------------
8. df.iloc[3, 3] = s
-----------------------------------
9. df.iloc[3, :] :
 col1    d
col2    i
col3    n
col4    s
col5    x
Name: 3, dtype: object
-----------------------------------
10. list(df.iloc[3, :]) = ['d', 'i', 'n', 's', 'x']
-----------------------------------
Bonus : *list(df.iloc[3, :]) = d i n s x

While there are certainly more ways on how to look up dataframe, the examples I show above should be enough to answer your question (correct me if I'm wrong).

Sign up to request clarification or add additional context in comments.

6 Comments

Thank you. That helps a lot with the index and with helping me isolate the next thing I need to figure out. I added a simplified example and desired output above.
I saw your edit, that's actually not what I meant by simplified example. I have zero knowledge in xml, and as far as I understand from your question, it is unrelated anyway. I meant for example from this line xmlparse = Xet.parse('sample.xml'), I can't go on, since I don't have the file sample.xml. I don't know and can't imagine how your dataframe looks like. Can you provide a snippet of your data in any form possible (other than screenshot). If you can make a copyable dataframe that's ideal, otherwise just as text is also helpful.
You are quite right! Sorry I left the XML bits in there. I've now edited them out. If we look at your example, what I am trying to do is look up 0.767971 (row0 value of col0) and get back 0.469264 (roow0 value of col1)
I add a lot of examples in my answer. Let me know if you have further questions.
Thank you for sticking with this and helping me. Your examples are excellent.
|
0

While I still want to know if there's a simpler way to do this, this get me just the single value that I'm looking for.

So, the solution is to create a dataframe and index it on key 0. That gets rid of the automatic numerical row ids (0, 1, 2, etc). Then use .loc to look up the row of the value ("thing 1 name") in the key 0 column and return the value in the specified column ("Thing ID").

print(thing_df.loc["thing 1 name","Thing ID"])

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.