1

I am trying to strip data from thousands of identical Excel 2007/2010 files. I would prefer to do this using scraping techniques. Is it possible to scrape an Excel file since, as far as I know, the file is basically some sort of XML format.

So, is it possible to convert an Excel file to XML or some other markup format?

4
  • What environment and programming language are you using? Commented Oct 15, 2010 at 18:53
  • In the past, I have used HTML Agility Pack and C# (in an SSIS script taks) to scrape XML data; so i was hoping to convert the Excel files to XML and scrape the data from the various tags. Commented Oct 15, 2010 at 18:56
  • So using Excel with VBA is out of the question? It is a native way of doing things. Commented Oct 15, 2010 at 18:58
  • I prefer to stick with SSIS to load this data into the DB. And I am not a VBA fan. Commented Oct 15, 2010 at 19:03

2 Answers 2

1

The XLSX format is actually a ZIP file, but with a different extension. If you unzip it using your favorite zip program, you'll find that the worksheet data is located inside xl\worksheets. Each worksheet is saved as a separate XML document. You should be able to use XSLT as Michael suggested to extract the data you require.

Sign up to request clarification or add additional context in comments.

Comments

0

Excel 2010 files are in XML, by default. So what file format are your Excel files currently in (i.e., what extension do they have)? Your question is somewhat ambiguous on this matter. If they are already in XML, you can use XSLT to scrape them.

1 Comment

They are in XLSX; so I am just inquiring as to how I would convert them from the common worksheet format to the XML markup. A few years ago, I remember clicking a button in Excel that enabled me to see the markup instead of the regular interface.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.