Timeline for Filtering a large (50gb+) JSON lines file matching CIDR's
Current License: CC BY-SA 3.0
18 events
| when toggle format | what | by | license | comment | |
|---|---|---|---|---|---|
| Jun 15, 2017 at 11:20 | comment | added | Matthieu M. | I think you can simplify the build-up of your tree by pre-sorting the CIDRs by mask; if you start putting the largest subnets first, then you never have to remove subnets when adding a larger one. | |
| Jun 15, 2017 at 11:10 | comment | added | Matthieu M. |
Naively, I would have used a more basic structure: I'm thinking about a simple list of 33 (frozen) sets, where the index n corresponds to the number of bits in the mask (/n part of the CIDR). Stashing the CIDRs in the sets is trivial enough (check out the /n part, put into appropriate set), and verifying the inclusion is also simple: start from index 32, going toward 0, mask the IP and check if it's in the set. I'm not sure it would be faster per se, but I would be more confident in the correctness.
|
|
| Jun 14, 2017 at 23:15 | comment | added | TemporalWolf |
Unless I'm misreading your code, your log n is 32, not n. But fair enough, you can eat the constant in big-O notation.
|
|
| Jun 14, 2017 at 21:51 | comment | added | rolfl |
@TemporalWolf - that's not strictly accurate. The reality is that the n in the log n is a constant, 32 (the number of bits in an IP address), not the number of CIDRs, so, being a constant the complexity is just \$O(m)\$
|
|
| Jun 14, 2017 at 20:42 | comment | added | TemporalWolf | Nitpick for the explanation: A binary search tree reduces complexity to O(log n), so the end complexity would be O(m log n). | |
| Jun 14, 2017 at 19:35 | comment | added | rolfl | Just FYI, I crunched some numbers, and it's procesing about 75,000 records per second now, or 190MB/s, which I feel is in the right sort of ball-park for your system. SSD should be a bit faster, but there are some overheads, and I expect, that if you had a concurrent IO/processing system now (computing matches in one thread while blocking on IO in another), that you could reduce the time a bit.... but it's probably not worth it. There's still room for improvement, but it will be hard work.... | |
| Jun 14, 2017 at 18:31 | history | edited | rolfl | CC BY-SA 3.0 |
sum the IP parts in to a single integer
|
| S Jun 14, 2017 at 18:27 | history | suggested | enedil | CC BY-SA 3.0 |
code is now more Pythonic
|
| Jun 14, 2017 at 18:15 | review | Suggested edits | |||
| S Jun 14, 2017 at 18:27 | |||||
| Jun 14, 2017 at 15:01 | comment | added | JMK09 |
I've run it on the complete set of 20,344,457 rows: real 4m27.661s user 3m55.171s sys 0m26.793s That's even faster than my initial attempt of 100,000 records :)
|
|
| Jun 14, 2017 at 14:58 | comment | added | rolfl | I am pleased it's working for you. Make sure you test the results carefully, and adjust the code to use best practices and other library functions in Python (library functions will likely help you with portability from IP4 to IP6 and other benefits). | |
| Jun 14, 2017 at 14:56 | comment | added | JMK09 |
WOW, many thanks. It is insanely fast, 100,000 rows now take only 1.3 seconds instead of 7 minutes: real 0m1.350s user 0m1.174s sys 0m0.146s
|
|
| Jun 14, 2017 at 14:54 | vote | accept | JMK09 | ||
| Jun 14, 2017 at 14:31 | comment | added | rolfl | @JMK09 - See the edit I made to include a custom lookup for IP addresses in CIDR ranges using a much faster preprocessed algorithm. I have not tested the performance when compared to your existing code, obviously | |
| Jun 14, 2017 at 14:29 | history | edited | rolfl | CC BY-SA 3.0 |
Add code example for tree.
|
| Jun 14, 2017 at 13:01 | comment | added | JMK09 | Thanks for your extensive answer @rolfl, much appreciated! Can you point me into some direction which packages/code to use? After reading your answer I've tried implementing Pytricia but couldn't get it to work. Would you recommend for example PySubnetTree? | |
| Jun 14, 2017 at 11:51 | history | edited | rolfl | CC BY-SA 3.0 |
typos
|
| Jun 14, 2017 at 11:15 | history | answered | rolfl | CC BY-SA 3.0 |