Comparing large MySQL data sets with PHP

Question

I have a set of approximately 1.1 million unique IDs and I need to determine which do not have a corresponding record in my application's database. The set of IDs comes from a database as well, but not the same one. I am using PHP and MySQL and have plenty of memory - PHP is running on a server with 15GB RAM and MySQL runs on its own server which has 7.5GB RAM.

Normally I'd simply load all the IDs in one query and then use them with the IN clause of a SELECT query to do the comparison in one shot.

So far my attempts have resulted in scripts that either take an unbearably long time or that spike the CPU to 100%.

What's the best way to load such a large data set and do this comparison?

You should configure your MySQL instance so it can load the dataset in the memory (1.1 mil should easily fit in 7.5gb of ram) and do what nick said, use left join instead of not in. It's much more efficient and the query should be extremely fast. — N.B.
– N.B., Commented Apr 20, 2011 at 20:48

Mark Baker · Accepted Answer · 2011-04-20 20:41:34Z

3

Generate a dump of the IDs from the first database into a file, then re-load it into a temporary table on the second database, and do a join between that temporary table and the second database table to identify those ids that don't have a matching record. Once you've generated that list, you can drop the temporary table.

That way, you're not trying to work with large volumes of data in PHP itself, so you shouldn't have any memory issues.

answered Apr 20, 2011 at 20:41

Mark Baker

213k34 gold badges354 silver badges390 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Community · Accepted Answer · 2017-05-23 12:29:04Z

1

Assuming you can't join the tables since they are not on the same DB server, and that your server can handle this, I would populate an array with all the IDs from one DB, then loop over the IDs from the other and use in_array to see if each one exists in the array.

BTW - according to this, you can make the in_array more efficient.

edited May 23, 2017 at 12:29

CommunityBot

11 silver badge

answered Apr 20, 2011 at 20:26

Galz

6,8525 gold badges35 silver badges39 bronze badges

Collectives™ on Stack Overflow

Comparing large MySQL data sets with PHP

2 Answers 2

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Linked

Related