Large csv's to html report

Question

I am working on a web front end + front end services.

I receive good sized csv files (10k lines). My service processes them and condenses them into one larger csv file (up to 300k lines).

This larger file will be turned into an html/pdf report after some extrapolation.

My questions are:

Taking 17,000 files and turning them into 1 takes FOREVER (18 hours last time I tried it). The current process is to take a line of the csv, parse it to see if it exists in my master array, and either create a new entry or add the data to an existing entry in the array. Is there a better way to do this? It seems the last item would take exponentially longer than the first.
Once this large file is created, parsing it seems to take quite a while as well. Should I move away from writing to a csv output and go with JSON for speed of data massaging? or even a lightweight db?

Are you constrained by any tools or languages? This sounds like something you could do with uniq and cat in seconds. — Paul Hicks
– Paul Hicks, Commented Jul 4, 2014 at 2:16
Can you elaborate on the algorithm you are using to merge the files? Am I understanding correctly that you are merging 17,000 files with 10k records each and, after merging, you are expecting to have on the order of 300k records? — user2313838
– user2313838, Commented Jul 4, 2014 at 3:55
What do you need to test to see if it's a duplicate? You should use an indexed data structure for this, not an array. Use a dictionary and if there isn't a single value that decides uniqueness, then hash all relevant values into a single key. That way the dictionary can tell you if the key already exists. Alternately write it into a database with a unique key defined on the relevant fields (there are databases that run in memory only, so no need to persist anything if only needed once) — thorsten müller
– thorsten müller, Commented Jul 4, 2014 at 7:19
1. I am using python. It will be on a linux system in prod, but I'm developing in a windows system... I know... I could use ming or something to make system calls? 2. I'll post the psuedocode for the algorithm momentarily. 3. I could possibly use a dictionary. Are the dictionary's keys indexed? Do they support multi key indexes? — Jeff
– Jeff, Commented Jul 4, 2014 at 12:05
So a couple quick thoughts: Is this something you will be doing once, or quite often? If it is quite often, you may definitely want to think about the database suggestions below because for a growing data set that's probably the ultimate solution. However, if this is more of a one-off, the dictionary approach should work quite well as long as each line isn't too big. I second the comments of those above that this should take seconds or a low number of minutes, not hours - unless there's a lot going on per line. You may have to make your own dictionary key if it's a multipart key. — J Trana
– J Trana, Commented Jul 5, 2014 at 4:30

BobDalgleish · Accepted Answer · 2014-07-04 21:53:52Z

I believe you are trying to recreate the concept of a database management system the hard way. File I/O combined with parsing and re-parsing your data is what kills your performance.

Option 1: Handle the merging yourself

a) Put your master "array" into a database as a set of rows in (one or more) table(s).

b) Read in your files, and merge the results in to the tables.

Option 2: Let the database handle the merging

a) Put your master "array" into one or more tables in a database (table A). Construct your indices.

b) Import a file into your database into tables similar in form to the master tables, but separate and temporary (table B).

c) Merge the master and imported data using LEFT JOIN to produce a temporary update table C. (INNER JOIN)

d) Antimerge the master and imported data by finding all records that are not in the master table and putting them into a temporary table D. (RIGHT EXCLUDING JOIN)

e) Perform an update from table C into table A. Then, add all records from table D to table A.

(For an excellent view of the JOIN terminology, I use this set of diagrams and code: Visual Representation)

Leopold Asperger · Accepted Answer · 2014-07-04 15:14:42Z

0

Using JSON instead isn't obvious, the format has greater complexity than CSV and is best used for exchanging petty data structures, preferably containing technical information. To speed stuff up, ensure that the master collection has a realistic intial capacity to contain that many rows, because resizing large collections is very expensive. Second, order the collection so that if the candidate row is greater than the last element, no further iteration is required. This won't work with Guid's.

answered Jul 4, 2014 at 15:14

Leopold Asperger

3711 silver badge8 bronze badges

Add a comment |

Stack Exchange Network

Large csv's to html report

2 Answers 2

Hot Network Questions

Large csv's to html report

2 Answers 2

Related

Hot Network Questions