Regex in Hive QL (RLIKE)

Question

I'm wondering how/if I can improve the regex I'm using in a query. I have a set of identifiers for certain user groups. They can be in two main format:

X123 or XY12, and
Any two letter combo

A user can belong to multiple groups, in which case different groups are separated by a pound symbol (#). Here's an example:

groups     user     age
X124       john     23
XY22#AB    mike     33
AB         peter    21
X122#XY01  francis  43

I want to count rows in which at least one group in second format appears, i.e. where a user is not exclusively a member of groups in first the format. I'm currently doing it like this:

select 
  count(*)
from 
  users
where
  groups not rlike '^(X[Y1-9][0-9]{2,2})(#X[Y1-9][0-9]{2,2})*$'

Is this bad performance-wise? And how should I approach fixing it?

J_H · Accepted Answer · 2017-08-12 20:28:59Z

1

Looks good to me.

Your specification, "at least one group in second format appears", is stated in the positive, yet your not rlike is stated in the negative. You might consider phrasing it as where groups rlike '(^|#)[A-Z][A-Z](#|$)'.

You asked about performance. These queries will certainly have to read all rows, which costs more CPU cycles than evaluating the regexes. You might consider adding derived boolean column(s) that indicate whether "format1" and "format2" groups are present. Then you could index on such a column.

edited Aug 12, 2017 at 20:28

answered Aug 12, 2017 at 5:23

J_H

42.3k3 gold badges38 silver badges157 bronze badges

Add a comment |

Stack Exchange Network

Regex in Hive QL (RLIKE)

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Regex in Hive QL (RLIKE)

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions