Spark Scala apply function on array of arrays element-wise

Ask Question

Asked 5 years ago

Modified 4 years, 11 months ago

Viewed 531 times

Disclaimer: I'm VERY new to spark and scala. I am working on a document similarity project in Scala with Spark. I have a dataframe which looks like this:

+--------+--------------------+------------------+
|    text|            shingles|   hashed_shingles|
+--------+--------------------+------------------+
|  qwerty|[qwe, wer, ert, rty]|  [-4, -6, -1, -9]|
|qwerasfg|[qwe, wer, era, r...|[-4, -6, 6, -2, 2]|
+--------+--------------------+------------------+

Where I split the document text into shingles and computed a hash value for each one.

Imagine I have a hash_function(integer, seed) -> integer. Now I want to apply n different hash functions of this form to the hashed_shingles arrays. I.e. obtain an array of n arrays such that each array is hash_function(hashed_shingles, seed) with seed from 1 to n.

I'm trying something like this, but I cannot get it to work:

val n = 3
df = df.withColumn("tmp", array_repeat($"hashed_shingles", n)) // Repeat minhashes
val minhash_expr = "transform(tmp,(x,i) -> hash_function(x, i))"
df = df.withColumn("tmp", expr(minhash_expr)) // Apply hash to each array

I know how to do it with a udf, but as I understand they are not optimized and I should try to avoid using them, so I try to do everything with org.apache.spark.sql.functions.

Any ideas on how to approach it without udf?

The udf which achieves the same goal is this:

// Family of hashing functions
class Hasher(seed: Int, max_val : Int, p : Int = 104729) {
  private val random_generator = new scala.util.Random(seed)
  val a = 1 + 2*random_generator.nextInt((p-2)/2)// a odd in [1, p-1]
  val b = 1 + random_generator.nextInt(p - 2) // b in [1, p-1]
  def getHash(x : Int) : Int = ((a*x + b) % p) % max_val
}

// Compute a list of minhashes from a list of hashers given a set of ids
class MinHasher(hashes : List[Hasher]) {
  def getMinHash(set : Seq[Int])(hasher : Hasher) : Int = set.map(hasher.getHash).min
  def getMinHashes(set: Seq[Int]) : Seq[Int] = hashes.map(getMinHash(set))
}

// Minhasher
val minhash_len = 100
val hashes = List.tabulate(minhash_len)(n => new Hasher(n, shingle_bins))
val minhasher = new MinHasher(hashes)

// Compute Minhashes
val minhasherUDF = udf[Seq[Int], Seq[Int]](minhasher.getMinHashes)
df = df.withColumn("minhashes", minhasherUDF('hashed_shingles))

edited Nov 6, 2020 at 17:05

asked Nov 5, 2020 at 16:54

Oleguer

651 silver badge8 bronze badges

1

It would help to understand the exact problem/requirement if you could provide the failed result/error along with expected result.

Leo C
– Leo C

2020-11-05 20:18:55 +00:00
Commented Nov 5, 2020 at 20:18
You are right, I added the udf which achieves the same goal.

Oleguer
– Oleguer

2020-11-06 17:05:38 +00:00
Commented Nov 6, 2020 at 17:05
1

Higher-order functions like transform (or aggregate like in this SO answer) are for transforming data of complex type (e.g. Array) "element-wise" with a user-provided function. In your use case, the entire Array is being used as a whole by your custom function, thus it isn't suitable to use transform. I would go with your UDF approach.

Leo C
– Leo C

2020-11-06 18:43:11 +00:00
Commented Nov 6, 2020 at 18:43
Thank you, good to know.

Oleguer
– Oleguer

2020-11-06 19:44:28 +00:00
Commented Nov 6, 2020 at 19:44

Add a comment |

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Spark Scala apply function on array of arrays element-wise

0

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Linked