2
\$\begingroup\$

I'm wondering how/if I can improve the regex I'm using in a query. I have a set of identifiers for certain user groups. They can be in two main format:

  • X123 or XY12, and
  • Any two letter combo

A user can belong to multiple groups, in which case different groups are separated by a pound symbol (#). Here's an example:

groups     user     age
X124       john     23
XY22#AB    mike     33
AB         peter    21
X122#XY01  francis  43

I want to count rows in which at least one group in second format appears, i.e. where a user is not exclusively a member of groups in first the format. I'm currently doing it like this:

select 
  count(*)
from 
  users
where
  groups not rlike '^(X[Y1-9][0-9]{2,2})(#X[Y1-9][0-9]{2,2})*$'

Is this bad performance-wise? And how should I approach fixing it?

\$\endgroup\$

1 Answer 1

1
\$\begingroup\$

Looks good to me.

Your specification, "at least one group in second format appears", is stated in the positive, yet your not rlike is stated in the negative. You might consider phrasing it as where groups rlike '(^|#)[A-Z][A-Z](#|$)'.

You asked about performance. These queries will certainly have to read all rows, which costs more CPU cycles than evaluating the regexes. You might consider adding derived boolean column(s) that indicate whether "format1" and "format2" groups are present. Then you could index on such a column.

\$\endgroup\$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.