Regex to extract portions of string about shipment

Question

The Problem

I'll be extracting up to 4 parts of a String that come from user-input. All 4 parts need to be in order i.e. (1-4) below in that order

The first capturing group is required and the last few are not. The named capturing groups are described below

KEY (This can be either CONF|ESD|TRACKING and needs to be followed be either [:;'\s]\*s) -- This group is required
DATA This can be any text except for any of the 2 patterns described below (Should assume multiple whitespace can lead this following KEY
LINE_DATA This is a string in the following kind of format "1,2,3" or "1(2),3(4),5(6)" and should account for spaces in between chars i.e. " 1 ( 2 ) , 3(4 ), 5 6) " This capture can only come after a match of "L[:';\s]\s*]"...but I don't want to capture this part. I just want to capture the "1(2),3(4)..." part (and exclude any trailing and leading whitespace). LINE_DATA is optional
INITIALS This is the last part of the string and would come before a \s*$. It's a pattern that would be *[a-zA-Z]+ i.e. "*sm", "*jdm", "*pL" should all match. Again...this group can be optional too and I don't want any leading/trailing whitespace.

Note: this is all case-insensitive too.

Examples with expectations

INPUT: "CONF: FEDEX 12345 L: 12(2),2(9),32 *SM"

MATCHES [KEY=>'CONF' , DATA => 'FEDEX 12345', LINE_DATA => '12(2),2(9),32', INITIALS=>'*SM']
INPUT: "ESD: 12/12/92"

MATCHES: [KEY: 'ESD', DATA: '12/12/92']
INPUT: "tRacking' my data L: 1,2,3(4) ";

MATCHES: [KEY=>'tRacking', DATA=>'my data' LINE_DATA: '1,2,3(4)']

My regex is below /^(?<KEY>CONF|ESD|TRACKING)[:;'\s]\s*(?<DATA>.*?)\s*(?:L[:;'\s]\s*\K(?<LINE_DATA>[\d\s,]+?))?\s*(?<INITIALS>\*[a-zA-Z]+)?\s*\K$/i

https://regex101.com/r/Sw8UXC/1 is an interactive playground with the regex.

What I'm looking for in review

Are there any flaws with this regex and are there foreseeable issues where it would fail with what I need?
Is there a way to simplify it further?
Any other assumptions I should be making because this comes from user-input?

I'm a little unsure of the '\Ks, but it made it so that I only got the 4 parts I needed (i.e. no whitespace in between groups was also matched on (regardless of being captured)

regex101.com/r/Sw8UXC/4 Don't be fooled by \K restarting the fullstring match. If you don't have any need for the fullstring match, then don't access the [0] element in the generated array of matches. You only need to use \K if you want to "forget/release" previously matched characters -- which you have no need for here. — mickmackusa
– mickmackusa, Commented Feb 10, 2022 at 23:34
Can you please clarify (for me) if the optional state of the 2nd, 3rd, and 4th capture groups are like this: [$1, ?[$2, ?[$3, ?[$4]]] or ?$1, ?$2, ?$3, ?$4? In other words, can the 3rd capture group be satisfied without the 2nd? Can a valid/qualifying string contain only $1 and $4? I'm trying to determine if I have maintained your pattern logic with regex101.com/r/Q8BvaP/1 — mickmackusa
– mickmackusa, Commented Feb 11, 2022 at 7:52
Thanks for clarifying \K. Makes sense. I put it in because it made the result 'cleaner' on the regex101 page. But won't make much difference in actual code. So -- $3 and $4 can only exist if $2 does. We can have following permuatations $1 $2 $1 $2 $3 $1 $2 $3 $4 $1 $2 $4 — user2402616
– user2402616, Commented Feb 11, 2022 at 15:32
I have rolled back Rev 5 → 3. Please see What to do when someone answers. — Sᴀᴍ Onᴇᴌᴀ
– Sᴀᴍ Onᴇᴌᴀ ♦, Commented Feb 11, 2022 at 17:01

mickmackusa · Accepted Answer · 2022-02-12 21:24:53Z

I've gone back to the drawing board with your pattern.

I personally never use named capture groups in PHP because they only bloat the pattern and the output array. If you need named keys, just assign them from the matches array.
When using pattern modifier i, you don't need to list upper and lower case letters in your character class.
There is no benefit to your pattern by inserting \K to restart the fullstring match -- just omit those.
Use non-capturing groups and the zero or one quantifier to make subsequent capture groups optional.
Instead of using .*? to lazily match the LINE data, match non-whitespace characters delimited by one or more whitespaces -- this will improve pattern performance by reducing the amount of backtracking that is necessary.
Instead of loosely validating a predictable LINE_DATA substring pattern with [\d\s,]+?, explicitly validate each delimited segment of that group. This improves the validation strength of your pattern.
Admittedly, my linebreaks, subpattern tabbing, and inline commenting is excessively long and wide -- certainly violating PSR-12 guidelines. This is a sacrifice that I am making to explain in great detail how the pattern works. Few development teams are 100% comprised of regex gurus, so it is important that you aim to inform the weakest regex user who might read your script. I often include a link to a regex101.com demo with a battery of test cases in my professional projects because I want my team to be very sure about how it works and how extensively it was tested.

Working Code (Demo) Regex101 Demo

$regex = <<<REGEX
~
^                                  # start of string anchor
(CONF|ESD|TRACKING)                # start capture group 1 KEY, three literal words
(?:                                # start non-capturing group 1
  \h*[:;'\h]\h*                    # require a listed punctuation or space with optional leading or trailing spaces 
  (\S+(?:\h+\S+)*?)                # start capture group 2 LINE, require one or more non-whitespace characters then lazily match zero or more repetitions of whitespace then non-whitespace substrings
  (?:                              # start non-capturing group 2
    \h*L\h*[:;'\h]\h*              # require literal L then a listed punctuation or space with optional leading or trailing spaces
    (                              # start capture group 3 LINE_DATA
      (?:\d+(?:\(\d+\))?)          # require a number optionally followed by another number in parentheses
      (?:\h*,\h*\d+(?:\(\d+\))?)*  # optionally match zero or more repetitions of the previous expression if separated by an optionally space-padded comma
    )                              # end capture group 3 and make it optional
  )?                               # end non-capturing group 2
  (?:                              # start non-capturing group 3
    \h*                            # match zero or more whitespaces
    (                              # start capture group 4 INITIALS
      \*[.a-z]+                    # match literal asterisk, then one or more dots and letters
    )                              # end capture group 4
  )?                               # end non-capturing group 3 and make it optional
)?                                 # end non-capturing group 2 and make it optional
\h*                                # allow trailing whitespaces 
$                                  # end of string anchor
~ix
REGEX;

$tests = [
    "esd  hedf L:1,2,3   *sm   ",
    "CONF: FEDEX 12345 L: 12(2),2(9),32 *SM",
    "Tracking *cool",
    "ESD: 12/12/92 L: ",
    "tRacking' my data L: 1,2,3(4) ",
    "conf something *asterisk",
    "tracking",
    "ConF''' something '' L: 6",
    "esd test 24(7)",
];

foreach ($tests as $i => $test) {
    if (preg_match($regex, $test, $m, PREG_UNMATCHED_AS_NULL)) {
        var_export([
            "test index" => $i,
            "KEY" => $m[1],
            "LINE" => $m[2] ?? null,
            "LINE_DATA" => $m[3] ?? null,
            "INITIALS" => $m[4] ?? null
        ]);
        echo "\n";
    }
}

Output:

array (
  'test index' => 0,
  'KEY' => 'esd',
  'LINE' => 'hedf',
  'LINE_DATA' => '1,2,3',
  'INITIALS' => '*sm',
)
array (
  'test index' => 1,
  'KEY' => 'CONF',
  'LINE' => 'FEDEX 12345',
  'LINE_DATA' => '12(2),2(9),32',
  'INITIALS' => '*SM',
)
array (
  'test index' => 2,
  'KEY' => 'Tracking',
  'LINE' => '*cool',
  'LINE_DATA' => NULL,
  'INITIALS' => NULL,
)
array (
  'test index' => 3,
  'KEY' => 'ESD',
  'LINE' => '12/12/92 L:',
  'LINE_DATA' => NULL,
  'INITIALS' => NULL,
)
array (
  'test index' => 4,
  'KEY' => 'tRacking',
  'LINE' => 'my data',
  'LINE_DATA' => '1,2,3(4)',
  'INITIALS' => NULL,
)
array (
  'test index' => 5,
  'KEY' => 'conf',
  'LINE' => 'something',
  'LINE_DATA' => NULL,
  'INITIALS' => '*asterisk',
)
array (
  'test index' => 6,
  'KEY' => 'tracking',
  'LINE' => NULL,
  'LINE_DATA' => NULL,
  'INITIALS' => NULL,
)
array (
  'test index' => 7,
  'KEY' => 'ConF',
  'LINE' => '\'\' something \'\'',
  'LINE_DATA' => '6',
  'INITIALS' => NULL,
)
array (
  'test index' => 8,
  'KEY' => 'esd',
  'LINE' => 'test 24(7)',
  'LINE_DATA' => NULL,
  'INITIALS' => NULL,
)

Appreciate all the effort and insights. You gave me what I was looking for. I'm keeping the named capture groups for now but may change upon brushing up on how your #4 suggestion works. I love all your other points, especially #7. Thanks for providing so many tests as well. You covered some I didn't think of on regex101.com/r/9PJkfd/8 . My actual tests in PHP are more of what you have where the assertions are for the full data. Did you have fun on this problem ? — user2402616
– user2402616, Commented Feb 11, 2022 at 20:21
I always have fun with regex challenges... I'm a regex junkie. — mickmackusa
– mickmackusa, Commented Feb 11, 2022 at 23:00

Sᴀᴍ Onᴇᴌᴀ · Accepted Answer · 2022-02-10 19:19:59Z

3

Are there any flaws with this regex and are there foreseeable issues where it would fail with what I need?

Should the INITIALS have a minimum or maximum number of characters, e.g. 2-3? or more? Also, while it likely doesn't happen very often, a person could include a digit in their initials - e.g. RG3 - A.K.A. Robert Griffin III

Is there a way to simplify it further?

(?<INITIALS>\*[a-zA-Z]+)

Because the /i modifier is used, this can be simplified to:

 (?<INITIALS>\*[a-z]+)

This is demonstrated in this playground example.

edited Feb 10, 2022 at 19:19

answered Feb 10, 2022 at 17:55

Sᴀᴍ Onᴇᴌᴀ♦

29.6k16 gold badges46 silver badges203 bronze badges

1

\$\begingroup\$ mm...I thought about adding {2,3} as a quantifier for the letters in INITIALS. I left it as + because you never know. {2,} might be better as just 1 letter won't work unless you're Bono or Madonna haha. Excellent point about RG3..I'll have to think about that one \$\endgroup\$

user2402616
– user2402616

2022-02-10 20:06:09 +00:00
Commented Feb 10, 2022 at 20:06
\$\begingroup\$ I'm gonna alter it to [\.A-Z]+ to account for periods as certain users may enter that \$\endgroup\$

user2402616
– user2402616

2022-02-10 20:20:03 +00:00
Commented Feb 10, 2022 at 20:20
\$\begingroup\$ Ha! DB is showing some people entering INITIALS with 3chars. I'm finding a certain user is entering 'AB 1" or "AB1" too \$\endgroup\$

user2402616
– user2402616

2022-02-10 20:54:43 +00:00
Commented Feb 10, 2022 at 20:54

Add a comment |

Reinderien · Accepted Answer · 2022-02-10 16:18:55Z

1

The biggest change I see as necessary here: your regex is complex enough to effectively be a "subroutine unto itself", and as such needs better whitespace for legibility. Replace your /i with /ix and add ?# comments to be able to write

^
(?<KEY>CONF|ESD|TRACKING)
[:;'\s]\s*
(?<DATA>.*?)
\s*
(?:
    L[:;'\s]\s*\K
    (?<LINE_DATA>[\d\s,\(\)]+?)
)?
\s*
(?<INITIALS>\*[a-zA-Z]+)?
\s*\K
$

I've not shown inline comments above because your playground doesn't support them, but PHP should.

answered Feb 10, 2022 at 16:18

Reinderien

71.1k5 gold badges76 silver badges256 bronze badges

\$\begingroup\$ Can you explain why you've chosen to snuff out the fullstring match with \K? I mean PHP is already going to be bloating the output array with both named and indexed keys. Is there a benefit to using \K in this case? \$\endgroup\$

mickmackusa
– mickmackusa

2022-02-10 23:23:30 +00:00
Commented Feb 10, 2022 at 23:23
\$\begingroup\$ @mickmackusa Ask OP, not me. This is all their content, reformatted. \$\endgroup\$

Reinderien
– Reinderien

2022-02-10 23:24:51 +00:00
Commented Feb 10, 2022 at 23:24
\$\begingroup\$ Ah, the copy-pasta cycle continues. One person uses a pattern. A high rep user copies it. A new reader sees two people using it and assumes it is best practice. I now see that \K is strangely used twice and item #2 in the question asks for clarification. \$\endgroup\$

mickmackusa
– mickmackusa

2022-02-10 23:25:54 +00:00
Commented Feb 10, 2022 at 23:25
2

\$\begingroup\$ @mickmackusa You're more than welcome to submit an answer. I didn't vouch for the quality of the regex itself, only that it needed to be whitespace-untangled. \$\endgroup\$

Reinderien
– Reinderien

2022-02-11 01:34:56 +00:00
Commented Feb 11, 2022 at 1:34

Add a comment |

Stack Exchange Network

Regex to extract portions of string about shipment

The Problem

Examples with expectations

What I'm looking for in review

3 Answers 3

Are there any flaws with this regex and are there foreseeable issues where it would fail with what I need?

Is there a way to simplify it further?

You must log in to answer this question.

Hot Network Questions

Regex to extract portions of string about shipment

The Problem

Examples with expectations

What I'm looking for in review

3 Answers 3

Are there any flaws with this regex and are there foreseeable issues where it would fail with what I need?

Is there a way to simplify it further?

You must log in to answer this question.

Related

Hot Network Questions