0

Looking into solving a very large data set problem by using bit manipulation to reduce compute

Imagine billions of records that look like this(but more than 20k in length):

  • Questions : A,B,C,D
  • Answers : 0 1 0 1

The length of questions and answers is always the same, the answers is stored in true/false as bits if the answer was correct

I get inputs like this (A & B) and I need to find the records that have A and B as true. it can get into the weeds with for ex (( A & B ) ^ C ) | D

My idea in theory would be to store the questions as the bits that would match up with the answer equivalent

  • A 1000
  • B 0100
  • C 0010
  • D 0001

so possibly A & B would be 1100 as required bits to be flipped for an input to be considered true but this also includes these cases as being true

  • 1100
  • 1110
  • 1101
  • 1111

where I start doubting this is when I get into the ors and nots. Cant wrap my head around how to evaluate those.

I am also open to other suggestions. Right now the best idea I have that is more sound would be processing the questions as key:question val:index 1 time and then turn the answers into a similar state and evaluate from there

4
  • SO is not a discussion forum, general design questions like this are not appropriate. Try to implement what you want, and if you have a problem post a minimal reproducible example that demonstrates it. Commented May 29 at 23:24
  • And in general, avoid storing comma-separated lists in relational database columns. See stackoverflow.com/questions/3653462/… Commented May 29 at 23:25
  • I've read this 10 times and I still don't know what you want in the middle+bottom section. So you want to find all sets where A and B are true, don't care about C and D. Then the logic check should be entry & (A | B) == A & B, meaning entry & 1100 == 1100. Just mask the ignored bits. What other logic do you need? Also, nothing here tells how you store this, what database, how do you query? Commented May 30 at 9:01
  • There is nothing to stop you using a language that offers long bitset support and the representation that you describe. If certain questions are likely to be used as keys into the big data then the 64 most useful ones probably want to be placed first. Commented May 30 at 17:02

1 Answer 1

0

You need to rotate answer matrix by 90 degrees, and convert your billions of 20K records to 20K bitcontainers, each has length billion bits. Each bitcontainer related to some question, and contains 1s in the position, related to ID of your answer. Thereafter, to lookup, you need perform boolean operation over bit containers from your request, and resulting bit container will contains IDs of the documents, contains answers for your questions.

Example:

Let you have 5 possible questions: A B C D E

And you have your "documents" - bit containers, contains "1" as "YES" for appropriate questuion.


   A B C D E // Questions
0: 0 1 0 1 0 // This is your  "answer" form your question, ID=0 (o before :)
1: 1 1 1 0 0 // This "document" contains YES for ABC and NO for DE
2: 1 0 0 0 1 //   // This "document" contains YES for AE and NO for BCD

Turn this matrix by 90 degrees:

   0 1 2 // DocID
A: 0 1 1 
B: 1 1 0
C: 0 1 0
D: 1 0 0
E: 0 0 1

For example, your request: Provide me list of DocIDs, contains YES for (A & B).

From database, you extract containers A and B, and perform bot operation AND over bits in the same position in the containers:

   0 1 2 // DocID
A: 0 1 1 
B: 1 1 0
&: 0 1 0

Result: Only document 2 (your answer 2) contains both A&B.

More complex request: (( A & B ) ^ C ) | D

   0 1 2 // DocID
A: 0 1 1 
B: 1 1 0
&: 0 1 0
C: 0 1 0
^: 0 0 0
D: 1 0 0
|: 1 0 0

You can implement bitcontainers yourself, or use opensource library BitMagic.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.