1

Suppose I have two arrays:

a = np.array(
[[0, 1],
 [2, 3],
 [4, 5],
 [6, 7]])

b = np.array(
[[2, 3],
 [6, 7],
 [0, 1],
 [4, 5]])

As you can see, one array is simply a shuffle of the other. I need to combine these two arrays to form a third array, c, such as:

  • the first part of array c (until a random index i) consists of elements from the first part of array a (until index i). Therefore, c[:i] == a[:i] must return True.
  • the rest of the array c is filled by values from array b, that are not already inside array c, in the exact same order they appear in.

Given that index i is set to 2, the desired output for above arrays a and b in the code should be:

> c
[[0, 1],
 [2, 3],
 [6, 7],
 [4, 5]]

Array c must be of the same length as both array b and array a, and there is a possibility that two elements within either array a or array b are the same. Array c must also consist of the same elements that are in a and b, (i.e. it behaves somewhat like a shuffle).

I've tried multiple solutions, but none give the desired result. The closest was this:

a = np.arange(10).reshape(5, 2)
np.random.shuffle(a)

b = np.arange(10).reshape(5, 2)
b_part = b[:4]

temp = []

for part in a:
    if part in b_part:
        continue
    else:
        temp.append(part)

temp = np.array(temp)

c = copy.deepcopy(np.vstack((b_part, temp)))

However, it sometimes results in array c being smaller than arrays a and b, because the elements in either list can sometimes repeat.

10
  • 1
    If elements are not unique your rules imply that c can be shorter: Example a = [(0,1),(2,3),(2,3),(4,5)] b=[(2,3),(4,5),(2,3),(0,1)] i=2 So you'd pick a[:i] which is [(0,1),(2,3)] and from b what has not occurred yet which is (4,5). This c would be [(0,1),(2,3),(4,5)] which is shorter. Commented Mar 29, 2019 at 21:39
  • @PaulPanzer I understand that is what's causing the issue, but I don't know how I can address it myself (which is why I'm asking the question) Commented Mar 29, 2019 at 21:43
  • The first thing would be to decide what your desired answer would be in this case. Commented Mar 29, 2019 at 21:46
  • imho your code is working though (just create a_part instead of b_part and reverse arrays in both loops+ vstack(a_part, temp)) but your issue with size(c) is your problem definition if you have duplicates inside a which is basis for c. Imagine that you split a with index i as resulting c in the way that you take just one duplicate of values inside a so your c cannot have the same size as a or b because you cannot add other duplicate from b Commented Mar 29, 2019 at 22:07
  • 1
    @KomronAripov using your pic example, try index [:4] instead of [:2] and give us result. Commented Mar 29, 2019 at 22:37

4 Answers 4

2

The following should handle duplicates alright.

def mix(a, b, i):                                             
    sa, sb = map(np.lexsort, (a.T, b.T))                      
    mb = np.empty(len(a), '?')                                
    mb[sb] = np.arange(2, dtype='?').repeat((i, len(a)-i))[sa]
    return np.concatenate([a[:i], b[mb]], 0)                             

It

  • indirectly sorts a and b
  • creates a mask which is True at the positions not taken from a, i.e. has i Falses and then len(a)-i Trues.
  • uses the sort orders to map that mask to b
  • filters b with the mask and appends to a[:i]

Example (transposed to save space):

a.T
# array([[2, 2, 0, 2, 3, 0, 2, 0, 0, 1],
#        [0, 1, 2, 0, 1, 0, 3, 0, 0, 0]])
b.T
# array([[0, 0, 2, 1, 0, 0, 2, 2, 2, 3],
#        [0, 0, 0, 0, 2, 0, 1, 3, 0, 1]])
mix(a, b, 6).T
# array([[2, 2, 0, 2, 3, 0, 0, 1, 0, 2],
#        [0, 1, 2, 0, 1, 0, 0, 0, 0, 3]])
Sign up to request clarification or add additional context in comments.

3 Comments

If a = np.array( [[0, 1], [2, 3], [0, 0], [2, 3], [4, 5], [6, 7]]), b = np.array( [[2, 3], [6, 7], [0, 1], [4, 5], [2, 3], [0, 0]]), then for i=3, this solution gives c = [0, 1], [2, 3], [0, 0], [6, 7], [4, 5], [2, 3]]. The last [2,3] probably violates OP's requirement, since it is already picked up from a, and must not be picked up again from b.
@fountainhead No, that's the other [2, 3] ;-] -- More seriously, OP's clarifications as to how to handle dupes are spread out across several comments.
Several comments and several external links too. :-)
2

Here's one solution:

full_len = len(a)

b_not_in_a_part = ~np.all(np.isin(b,a[:i+1]),axis=1)         # Get boolean mask, to apply on b
b_part_len = full_len-i-1                                    # Length of b part of c

c = np.concatenate((a[:i+1], b[b_not_in_a_part,:]), axis=0)  # Contruct c, using the mask for the b part.

Testing it out:

import numpy as np
a = np.array(
[[0, 1],
 [2, 3],
 [0, 0],
 [2, 3],
 [4, 5],
 [6, 7]])
b = np.array(
[[2, 3],
 [6, 7],
 [0, 1],
 [4, 5],
 [2, 3],
 [0, 0]])

i = 2

print ("a is:\n", a)
print ("b is:\n", b)

full_len = len(a)

b_not_in_a_part = ~np.all(np.isin(b,a[:i+1]),axis=1)         # Get boolean mask, to apply on b
b_part_len = full_len-i-1                                    # Length of b part of c

c = np.concatenate((a[:i+1], b[b_not_in_a_part,:]), axis=0)  # Contruct c, using the mask for the b part.
print ("c is:\n", c)

Output:

a is:
 [[0 1]
 [2 3]
 [0 0]
 [2 3]
 [4 5]
 [6 7]]
b is:
 [[2 3]
 [6 7]
 [0 1]
 [4 5]
 [2 3]
 [0 0]]
c is:
 [[0 1]
 [2 3]
 [0 0]
 [6 7]
 [4 5]]

Note: For this example, c has a length of only 5, even though a and b have a length of 6. This is because, due to high duplication in b, there aren't enough values left in b, that are eligible to be used for c.

3 Comments

@KomronAripov: This error is occurring because, due to high duplication in b, there aren't enough values in b that are not already in used from a. I can fix it if you tell me what is your requirement for this scenario.
In such a scenario, is it ok if c has smaller length than a or b? (Since there are not enough values to match the full length)
@KomronAripov, Fixed the error, making the assumption that it's ok for c to have a smaller length, if high duplication in b leaves us with not enough values eligible to go into c
0

Just use numpy.concatenate() and ensure that your index is itself plus 1 (as numpy indexing goes up to but not inclusive of said index value, see below): (Edit: seems like you modified your a, b and c arrays, so I 'll change my code below to accomodate)

import numpy as np

a = np.array(
[[0, 1],
 [2, 3],
 [4, 5],
 [6, 7]])

b = np.array(
[[2, 3],
 [6, 7],
 [0, 1],
 [4, 5]])


i = 2
c = a[0:i]
for k in b:
    if k not in c:
        c = np.concatenate((c, [k]))

print(c)

Output:

[[0 1]
 [2 3]
 [6 7]
 [4 5]]

5 Comments

after a slight modification to array b (now updated in the answer), the output is not as expected.
this results in duplicates (even though there were none to begin with), which is undesired...
Ok I get ya, how about now?
this still does not address the issue where there is a possibility that either array a or array b can contain duplicates.
can you provide an example of what c should look like if a and b contains the duplicates as you have mentioned?
0
  1. For i=2, get your first part of the result:

    c = a[i:]
    
  2. Get "uncommon" elements between b and c:

    diff = np.array([x for x in b if x not in c])
    
  3. Select random elements from diff and concatenate to the original array:

    s = len(a) - i
    np.concatenate([c, diff[np.random.choice(diff.shape[0], size=s, replace=False), :]], axis=0)
    

OUTPUT:

array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7]])

6 Comments

You can add replace=False when you select random rows from b. Updated the code.
TypeError: randint() got an unexpected keyword argument 'replace'
@KomronAripov Sorry, my bad. You are also supposed to change randint to choice. Try now?
ValueError: Cannot take a larger sample than population when 'replace=False'
This is probably your data issue where you are trying to select a larger number of elements than your parent set. It works with the data you have provided in the question.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.