Is there a better way to implement the 2-SUM algorithm?

Question

Currently, I was trying to create a 2-SUM algorithm that would, given a set of around 1 million integers, find the number of target values t (-10,000 <= t <= 10,000) that are formed by the sum of any two values x,y in the set.

I have no problem with 2-SUM for a single value of t, just by using hash-tables and finding for each hash entry x in the table if there exists another entry t-x. This will run in O(N) time.

But, now I have to find multiple values of t, from -10000 to 10000. If I just use a plain for-loop, then the runtime will now be O(N^2). I have tried this code, which brute-forces through all t from -10000 to 10000, but it simply runs too slow (~1hr. to execute).

So, my question is, are there any hints for better ways to handle the ~20,001 targets without having to brute-force through all 20,001 values?

Here is the code I used for my O(N^2) solution:

for(long long t = -10000; t <= 10000; t++)
{
  for(unordered_set<long long>::iterator it=S.begin(); it != S.end(); ++it)
  {
     long long value = *it;
     if((S.find(t-value) != S.end()) & (t-value != value))
     {
        values++;
        //cout << "Found pair target " << t << " " <<   value << " " << t-value << '\n';
        break;
     }
  }
}

are there any hints for better ways Yes, ensure your dataset is sorted first. — user3437460
– user3437460, Commented Nov 26, 2015 at 20:04
@user3437460 I have tried using the ordered set (cplusplus.com/reference/set/set) , but it has a logarithmic O(log N) access time. Do you know of any better hash-tables? — Patrick Yu
– Patrick Yu, Commented Nov 26, 2015 at 20:08
@PatrickYu If you are allowed to sort your dataset first, then sort it with an efficient algorithm like merge sort/quick sort which is O(n log n). — user3437460
– user3437460, Commented Nov 26, 2015 at 20:11

Amit · Accepted Answer · 2015-11-26 21:48:13Z

A better approach would be to use an ordered set (if values are unique, or ordered array / list if you care for duplicates).

Then, you search for a matching pair for your values using the following method:

For each Val (-10000, -9999, ...)

Let iS be 0

Let iE be length - 1

While (S[iS] + S[iE]) != Val
4.1 (S[iS] + S[iE]) > Val : Binary Search in (iS -> iE - 1) for the maximum value, lower or equal to (Val - S[iS]) and set iE to match.
4.2 (S[iS] + S[iE]) < Val : Binary Search in (iS +1 -> iE) for the minimum value, higher or equal to (Val - S[iE]) and set iS to match.
4.3 If iS > iE, Val doesn't exist.

This gives you O(n log(n)) for sorting, and O(m n) (m is 20001 for -10000 -> 10000) for searching although realistically, the searching will perform much better then O(m n). The entire solution is O(m n) due to m > log(n).

It can be further optimized by using a map of matched values and on each iteration, after a match is found, advance iE till (S[iS] + S[iE]) > maxValue (10000) and marking all sums as found, then there are less iterations in outer loop.

Isn't the while loop O(N) because in worst case,iS and iE will traverse the list until they meet at the midpoint?
@PatrickYu - Yes, I was just thinking about this. The answer is inaccurate - will revise.

ChatterOne · Accepted Answer · 2015-11-27 09:06:30Z

As other people have already suggested, if you want a "best effort" approach (meaning that it may not be the best, but still good enough), you can sort your data and use std::lower_bound for searching.

The std::lower_bound function is implemented as a binary search, which means that in the worst case, for 1000000 integers you'll be having 20 comparisons to find a match. If you do this inside of a -10000..10000 loop you'll get 20000*20 = 400000 comparisons, which should take far less than an hour (my guess is a few minutes, depending on CPU power).

The map::find on an unordered_set is a linear search, that means that in the worst case you're going to have 20000*1000000 = 20000000000 comparisons, which is 50000 times worse.

You could improve on a binary search (e.g. by seeing how "close" you're to your target and switching to linear search from there if you're under a specific difference in value) but I don't think that would speed up the search that much.

There are other ways, probaly faster (maybe you could discard duplicates using 15625 integers with 64 bit precision and setting the bit matching the value in your dataset, giving you and O(n) time for the setup and an O(1) for the lookup, but you're going to need two sets, one for positive values, the other for negative), but they're going to be much more difficult to implement.

Patrick Yu · Accepted Answer · 2015-11-28 01:06:50Z

0

Thanks to everyone who has helped!

I solved the problem by partitioning the input into multiple "buckets", that is, I would sort the dataset and then split it into buckets of intervals of 10,000. So, the smallest 10k numbers go into 1st bucket, next 10k to 2nd, and so forth.... I would split it into so when I have to search for the entry t-x, I will search in my 10,000 numbers rather than all 1,000,000 numbers.

answered Nov 28, 2015 at 1:06

Patrick Yu

9721 gold badge7 silver badges19 bronze badges

Collectives™ on Stack Overflow

Is there a better way to implement the 2-SUM algorithm?

3 Answers 3

2 Comments

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Related