Given bash environment variable settings :
 $ declare -g bs=$'\\' bsbs=$'\\\\' q="'";
This Regular Expression will correctly match a sequence of single quote-("'")-ed text , where such text may contain escaped single quotes:
 "[${bs}${q}]((([^${bsbs}]?[^${bs}${q}])|(${bsbs}${bs}${q}))+)[${bs}${q}]"
 $ echo "[${bs}${q}]((([^${bsbs}]?[^${bs}${q}])|(${bsbs}${bs}${q}))+)[${bs}${q}]"
 [\']((([^\\]?[^\'])|(\\\'))+)[\']
(the backtick in "[\']" is not strictly required but is included for clarity, and in case one is trying to encode this value in a singly-quoted string).
The problem lies in how best to generalize this for any escaped quoting character, and how to handle runs of multiple escape sequences ; ONLY if the run of input escape characters is of ODD ((n&1)==1) size (number of bytes), then the last escape is ACTIVE, the last character is INACTIVE (part of string), and otherwise (number of escapes is EVEN ((n&1)==0), then the string contains HALF the number of escapes (n>>1) and the last character is ACTIVE (ie. not escaped) .
Also, in sed and grep / egrep this has some issues :
o The matching sub-groups can occupy subsequent '\1+' group numbers, increasing their number - if any subsequent group does not match -
- ideally, I'd like to be able to express that regexp without any subgroups that can possibly affect subsequent sub-group numbers.
o It doesn't handle Number of Escapes at all, and will fail to
recognize that a quote that is proceeded by an EVEN number of
escapes is not escaped.
So my question is :
How best to solve these issues using only glibc-supported POSIX REs or grep / sed REs ?
ie. allow arbitrary length sequences of escapes of ODD (effective escape) or EVEN (ineffective escape) length to be recognized inside RegExps ?
I really think POSIX REs could benefit from special syntax to handle such questions, like:
 [\\]{1,}\#&1\?$A\:$B
Where '}#&1' means the test 'x & 1' on the number of elements matched by previous [\]{...} group, and ?x:y means "if last test is true, substitute x, otherwise y in RE".
Then one could actually easily express this and safely handle any number of escapes in RegExp parsed strings . How to do that without some new RE syntax like this ?
Very difficult, if not impossible / infeasible , with RegExp exprs alone.
Or am I wrong ?
Is there now an easy way to do arithmetic on run length of previous group in modern POSIX REs ?
Example 1 :
$ declare -g bs=$'\\' bsbs=$'\\\\' q="'";
$ echo "'a quot\\'d string' 42" | sed -r 's/'"[${bs}${q}]((([^${bsbs}]?[^${bs}${q}])|(${bsbs}${bs}${q}))+)[${bs}${q}]"'[[:space:]]([0-9]+)/\1\t:\t\2/'
'a quot'd string    :   g
Example 2 :
$ echo "'a quot\\'d string' 42" | 
  sed -r 's/'"[${q}]((([^${bsbs}]?[^${q}])|(${bsbs}${q}))+)[${q}]"'[[:space:]]([0-9]+)/\1\t:\t\2/'
a quot\'d string    :   g
note how the ${bs}-es @rowboat mentioned are removed, and still the same result , as would using only $bs, not $bsbs :
$ echo "'a quot\\'d string' 42" | sed -r 's/'"[${q}]((([^${bs}]?[^${q}])|(${bs}${q}))+)[${q}]"'[[:space:]]([0-9]+)/\1\t:\t\2/'
a quot\'d string    :   g
Conclusion :
I am developing non-POSIX extensions to the "regex(7) - POSIX.2 regular expressions" library, provided by glibc , and to PCRE, and to PERL, and to cl-ppcre (SBCL's Common Lisp RE library) , and to Emacs's RE library for :
o defining a meaning for any named POSIX character class when suffixed by '-esc' or 'esc', eg. '[[:spaceesc:]]' or '[^[:space-esc:]]' or '[[:quote-esc:]]' , which means: A character that is ordinarily a member of character class 'X', is not a member of character class "${X}esc" (a synonym of "${X}-esc") IFF it is preceded by an ODD NUMBER of Escape Characters ('\':ASCII "\x5c").
 All character sequences that are subject to an :*esc: character
 class test will have legal '\\' , '\xXX', '\0OOO', or '\Uxxxxxx' or
'\uXXXX' sequences replaced by :
 ASCII:\x5c , ASCII:\xXX (where XX are hex digits), 
 ASCII:\OOO (where OOO are Octal digits) ,
 24-bit unicode value with code point xxxxxx (x: hex digit) , and
 16-bit unicode value with code point xxxx (x: hex digit) ,
 respectively.
 Also '[[:quote:]]' and '[[:quoteesc:]]' classes must be
 supported that select characters (or non-escaped chars)
 with the Unicode 'Quotation Mark' binary attribute, and
 '[[:punct:]]' or '[[:punctesc:]]' would similarly apply
 to all (non-escaped) chars which have the Punctuation attribute.
 Perhaps a similar '*cesc' or '*escc' character class suffixes
 could be provided that support also the C escapes:
  '\n','\r','\t','\v','\b','\l'... etc.
 If the /𝕦 (\U1D566) flag is specified / UNICODE_NAMES flag,
 then Unicode Names can also be specified :
  \U1D566 == \U{MATHEMATICAL DOUBLE-STRUCK SMALL U}
          == \U{MATHEMATICAL_DOUBLE_STRUCK_SMALL_U}
          .
 There is no point in doing such an exercise unless UTF-8 names
 also are supported, IMHO .
 There no point in just handling escaped spaces if full escape
 handling is not also enable-able somehow or comes along with it.
 Actually, the SBCL Pure-Common-Lisp implementation is about
 the speediest and nicest to use amongst ANY RE implementation
 I have used , and already supports escaped classes & Unicode
 Names.
 The LIBC regex and glob implementations are EXTRA-ORDINARILY SLOW!
 This slows down BASH and command-line tools and all tools that
 use the POSIX RE library, such as Flex / Bison / Yacc,
 tremendously .
 Perhaps either :
 A) Techniques used in SBCL PPCRE, libppcre, and PERL RE library
    can be ported to LIBC Regex library, in a new 'Fast Regex'
    replacement that can optionally replace old implementation
    on demand ;
 B) LIBC RE library can be made to transparently replace itself
    with libppcre or to a connection to a running SBCL instance
    with CL-PPCRE loaded, or to PERL with full PERL RE support,
    to support a UNIX CMSG Message API for Compiling, 
    Match Against String, or Match against FD / stream API, 
    and for Retrieval of Match Numbered N or with Name N,
    where N is in a set of Group Names or 
    Numbers sent in advance as identifying parenthesis groups,
    and which can contain multiple dimensions (numbers in square
    brackets) to denote sub-expressions.
 Also I think that supporting a char-class LENGTH test, of the
 form:
   ']{x,}\#<test>\?<A>\:<B>' , meaning:
 "  If number of characters in character-class just closed
    satisfies test <test> , then RE fragment <A> is parsed / takes
    effect, else RE <B> takes effect.
 ", would be very useful - for <test> in:
 {=X,>X,<X,<=X,>=X,&X,|X,^X,&~X,|~X} , where X is a decimal number.
 But first, I am working on the escaped char-classes support.
I can't understand why no-one seems to understand what I was suggesting with this question.
I hope the above makes things clearer.
