21
$\begingroup$

The typical syntax for accessing an array (or list, map and similar data structures) at a specific index is a[i]. I believe C first introduced it as syntax sugar, though I wonder why it has stuck around when an alternative syntax could be a mere infix operator (something like a @ i), yielding back one precious pair of brackets for other use cases, such as generics (angle brackets knowingly being a hassle to parse).

This would intuitively make sense to me, as typically mixfix operators are rarely encountered otherwise in programming languages (exceptions being the function call and ternary operator). Yet I don't know a single language that would permit something along the lines of the following (Java-y):

// cities: Map<String, List<Inhabitant>>
cities @ cityName @ inhabitantIdx = new Inhabitant(...);

The only potential problem I can see is grouping (e.g., for accessing a member: (a @ i).field), though languages like Java already have a similar problem with the binary AND and OR operators (i.e. if ((field & mask) != 0) ...), and people get along with that just fine. What might then be the reason that programming languages throughout the spectrum have consequently stuck to this one single syntax for accessing arrays (AFAIK), when they're showing wild exploration in so many other areas?

$\endgroup$
5
  • 1
    $\begingroup$ Relevant. $\endgroup$ Commented Jun 10, 2024 at 14:45
  • 1
    $\begingroup$ Civet also permits e.g. arr.2 for accessing the third element and arr.-1 for accessing the last element, so you only need the bracket syntax for dynamically computed indices. $\endgroup$ Commented Jun 10, 2024 at 15:06
  • 19
    $\begingroup$ Infix is always a little awkward for non-associative operations like this. How should x@y@z be parsed? Both (x@y)@z (two-dimensional array) and x@(y@z) (double indirection) are useful, and it's not intuitive to me which one should be preferred. They might even both be semantically valid if you have weak typing and associative arrays. On the other hand, mix-fix forces you to write either x[y][z] or x[y[z]] and there is no ambiguity. $\endgroup$ Commented Jun 10, 2024 at 16:07
  • 5
    $\begingroup$ There's also a strong argument for making the syntax mirror that for function calls, since mathematically an array is just a function whose domain is the set of integers {0, ... n-1} (assuming zero-based). And function call syntax is ingrained from math, but I suppose it would likewise be interesting to ask what alternatives have been considered there. $\endgroup$ Commented Jun 10, 2024 at 16:11
  • 5
    $\begingroup$ On the other hand, the subscript notation from math might argue for _ as a name for this infix operator. Though in math, typography helps distinguish between (a_i)_j and a_(i_j), as the latter would have the j smaller and placed vertically lower. $\endgroup$ Commented Jun 10, 2024 at 16:13

6 Answers 6

44
$\begingroup$

The answer to "why" questions is often partly historical - unless there is a good reason to be different, following existing conventions helps limit a new language's "strangeness budget", and makes it more likely to succeed.

Using brackets for array dimensions goes back long past C, at least as far as the hugely influential FORTRAN and ALGOL, both designed in the late 1950s.

In these early languages, arrays were not considered a separate data type, and so accessing elements was not conceived as an operation with an array on one side, and a number on the other. Instead, both languages defined "subscripted variables", which could have multiple dimensions; the brackets contained the comma-separated list of "subscripts" into this multi-dimensional structure.

In the 1958 FORTRAN manual describes it thus:

A variable can be made to represent any member of a 1, 2, or 3-dimensional Subscripted array of quantities by appending to it 1, 2, or 3 subscripts; the variable is then a subscripted variable. The subscripts are fixed point quantities whose values determine which member of the array is being referred to.

And gives examples A(I), K(3), and BETA(5* J-2, K +2,L)

The preliminary report of the ALGOL committee from the same year, says:

  1. Subscripted Variables V designate quantities which are components of multidimensional arrays.
    Form: V ~ I [ C ]
    where C ~ E, E, ~. . . . , E is a list of arithmetic expressions as defined below. Each expression E occupies one subscript position of the subscripted variable, and is called a subscript. The complete list of subscripts is enclosed in the subscript brackets [ ].
    The array component referred to by a subscripted variable is specified by the actual numerical value of its subscripts (cf. arithmetic expressions).

In other words, the mathematical notation $x_{a,b,c}$ translated to FORTRAN X(A,B,C) and ALGOL x[a,b,c]

Note also how this fits with using parentheses for function application - in both cases, you have an identifier offset from a list of arguments or subscripts.

It is only in later languages that multi-dimensional arrays are treated as "arrays of arrays" (or, in C's case, as a special case of pointer arithmetic), so that the list of subscripts is replaced by a sequence of operators, x[a][b][c]. By then, "array subscripts appear in brackets" was a well-established convention.

$\endgroup$
5
  • 1
    $\begingroup$ COBOL also uses () syntax. And MacLisp used (arrayname a b c ...). So using function-like syntax has a long tradition. $\endgroup$ Commented Jun 10, 2024 at 17:04
  • $\begingroup$ Good answer, I think it misses the simplest explanation though that arrays are matrices and [] looks like matrix notation $\endgroup$ Commented Jun 11, 2024 at 13:09
  • 2
    $\begingroup$ @ScottishTapWater As far as I know, $[1, 2, 3]$ defines the content of a matrix, but if $x$ was a matrix, $x[1, 2, 3]$ would represent some kind of product, not picking an element from a 3-dimensional matrix, as FORTRAN and ALGOL had it. The use of the term "subscript" in those early specs makes clear that the typeset notation $x_{1,2,3}$ is what they had in mind; the parentheses or brackets appear to just be a way to group the list of subscripts (similarly, the TeX markup I've used here is x_{1,2,3}, because x_1,2,3 would only subscript the 1: $x_1,2,3$). $\endgroup$ Commented Jun 11, 2024 at 13:34
  • 1
    $\begingroup$ @ScottishTapWater If you're aware of any examples of the x[i] syntax being used as an alternative to $x_i$ for a subscript of a vector or a matrix in mathematical writing (or other contexts that aren't programming languages) prior to 1958, please let me know - like you, I suspect that this notation existed in mathematics before Algol 58 used it, but I didn't find any examples. $\endgroup$ Commented Jun 11, 2024 at 14:00
  • 1
    $\begingroup$ @IMSoP, you're correct there, I more just meant the square brackets look familiar so it seems a rational choice $\endgroup$ Commented Jun 11, 2024 at 15:06
32
$\begingroup$

There are a couple of arguments in favour of the x[y] syntax for arrays, over an infix operator.

One is that we want array access to bind the left operand more tightly than the right operand. For example, in x + y[z] we want the access to have higher precedence than +, but in x[y + z] we want the opposite. An infix operator normally has the same precedence on both sides.

Another is that infix operators need to decide on associativity. Neither of x[y][z] or x[y[z]] is clearly a better choice in general for the meaning of x @ y @ z. Probably multidimensional array access x[y][z] is more commonly used, but left-associativity would mean the operator binds slightly looser on the left, contradicting the above. (That is, if @ is left-associative then x @ y binds where y is the right operand, instead of y @ z where y is the left operand.)

Both of these reasons can be summarised by saying that we would often need to use parentheses anyway to disambiguate usages of an infix array access operator; so we might as well make the parentheses part of the syntax, and then we don't need the additional symbol for an infix operator.


Another thing to consider is that in some languages, the "index" in an array access is not necessarily an expression. For example, many languages allow access to multidimensional arrays using commas like x[y, z] even when y, z is not a valid expression in those languages. There are also array slices like Python's x[y:z] or x[::-1], where y:z and ::-1 are not expressions.

In such languages, array access cannot be parsed the same way as a binary infix operator which expects expressions for both operands. However, languages which want to use an infix operator could get around this by allowing tuples and slices to be expressions.

$\endgroup$
10
  • 1
    $\begingroup$ The first point I completely hadn't thought of (you'd need to have different precedences for the left and right of the operator...). Though I do personally believe multi-dimensional array access to be more common, so I'd intuitively say to interpret it as (x @ y) @ z. Do you believe it'd nonetheless be worth it trying this syntax in the context of a hobby language, at the risk of making everything unreadable? $\endgroup$ Commented Jun 10, 2024 at 17:17
  • $\begingroup$ @linux_user36 Multidimensional array access is probably more common overall than indirection in languages which support it, but note that this is contrary to the general rule of binding the left operand more tightly - left-associativity (i.e. where x @ y @ z means (x @ y) @ z) implies a slightly lower precedence on the left than on the right, since the y gets bound on the right instead of on the left. $\endgroup$ Commented Jun 10, 2024 at 17:28
  • 8
    $\begingroup$ It's also worth considering that multidimensional array access might be written like x[y, z] instead, since a language might not implement multidimensional arrays as "arrays of arrays" (this is generally less performant due to more pointer indirection) the syntax x[y][z] might be semantically incorrect in most usages anyway. As for what you should do in a hobby language, the good thing about hobby languages is that you can try something out and if it's bad, nobody except you will be annoyed, and you can change it later without annoying anyone else too. $\endgroup$ Commented Jun 10, 2024 at 17:31
  • 4
    $\begingroup$ A thousand times this. I have spent the last decade of my life writing C and C++, and I still add unnecessary parentheses to logical conditions. Remembering precedence and associativity is extra mental load and unnecessary chances to make a mistake. $\endgroup$ Commented Jun 11, 2024 at 13:33
  • $\begingroup$ Whether array of arrays style is less performant depends on on what "array of arrays" actually means in the language. In C, for example, array of arrays is generally more performant than the primary alternative of arrays of pointers (to arrays). The former is roughly the same as a "true" multi-dimensional array would be. Java, on the other hand, doesn't even have a standard analogue of the former. Its arrays of arrays necessarily require traversing an extra reference for each additional dimension. $\endgroup$ Commented Jun 11, 2024 at 15:24
23
$\begingroup$

it is, sometimes

In Haskell, array access is the binary operator !!.

In APL, one of the ways of doing array access is the dyadic function .

$\endgroup$
3
  • 4
    $\begingroup$ And in C, a[b] is formally equivalent to *((a) + (b)), which is infix with respect to to the +. This also means that a[b] has a property typical of infix operations, but very unusual for array-access expressions in general: that the operands are interchangeable. In C, (a)[b] is wholly equivalent to (b)[a]. $\endgroup$ Commented Jun 12, 2024 at 13:28
  • $\begingroup$ Note that pointer indexing operator + does not convert both operands to a common type (and indeed cannot be used to add two pointers to each other), so there's no inherent reason why the operands should be interchangeable. Further, when the minus token is used as an infix operator with a pointer type as its left operand, it may represent either of two very different operations depending upon the type of the right hand operand. $\endgroup$ Commented Jun 12, 2024 at 21:31
  • $\begingroup$ @JohnBollinger: In turn C inherited this from BCPL, where E1 + E2 is rvalue/integer addition, while E1 ! E2 or E1*[E2] (depending on version) are lvalue/pointer addition. (NB: E1[E2, E3, … En] is a function call.) The manuals say “The representations of Vectors, Lvalues, and integers is such that the following relations are true: / E1*[E2] = rv (E1+E2) / lv E1 * [ E2 ] = E1 + E2” and “Vector subscription (E1!E2 [sic] is implemented using PLUS and RV”. $\endgroup$ Commented Jun 14, 2024 at 0:07
3
$\begingroup$

An single-dimensional array access is syntactically very similar to a single-argument function call; indeed, in some languages the syntax would be identical. The latter doesn't use any actual operator, but merely the juxtaposition of two expressions, the latter of which must be a parenthesized primary expression, without any explicit operator whatsoever. I suspect the use of square brackets in C was probably motivated by their use in earlier languages, but functionally a pair of square brackets behaves like a pair of parentheses preceded by an "add and dereference" infix operator.

In FORTRAN, the requirement that an array subscript appear in parentheses was likely motivated by a couple of considerations:

  1. There needed to be a non-alphanumeric character between the array name and the index, because FOO BAR would have been styntactically equivalent to FOOB AR, FO OBAR, FOOBAR, or any other combination of those letters (in order) and intervening spaces.

  2. The range of available non-alphanumeric characters was extremely limited: ., ,, *, /, +, -, $, ', =, (, ). I'm not sure what dollar sign was used for, and it might have been possible to have rules for the use of ' that would allow it to be used as an index in some contexts and a quoted string literal in others, but all of the other symbols were needed for other purposes.

Once FORTRAN established the convention of indexing by juxtaposition of an identifier and a parentheses-enclosed primary expression without using an intervening operator other than the opening delimiter, other languages likewise dispensed with the use of an intervening operator.

$\endgroup$
7
  • $\begingroup$ "The range of available alphanumeric characters" – is this missing a "non-"? $\endgroup$ Commented Jun 12, 2024 at 23:12
  • $\begingroup$ @PaŭloEbermann: Thanks. Corrected. I was also just thinking that if one views the parentheses in 1/(x+y) as indicating what needs to be written lower (below the bar), but that they would be omitted in contexts where one could actually print things lower, then a(i) as equivalent to putting i in a subscript after a, with no other operator, would make sense, though I have no idea if that figures into the design intention of FORTRAN. $\endgroup$ Commented Jun 13, 2024 at 0:07
  • $\begingroup$ I agree with the reasoning behind FORTRAN's choice but there was no ' in character sets used with the IBM 704: bitsavers.org/pdf/ibm/704/24-6661-2_704_Manual_1955.pdf#page=35 . It's probably why LISP 1.5 had no abbreviation for QUOTE. $\endgroup$ Commented Jun 13, 2024 at 0:41
  • $\begingroup$ That character set has multiple representations for zero, and a few other characters that aren't used in FORTRAN. I don't know the history of character sets, but I'm pretty sure FORTRAN needed an apostrophe for string literals. $\endgroup$ Commented Jun 13, 2024 at 3:51
  • $\begingroup$ I'm not convinced by this interpretation. There's no "juxtaposition of two expressions" going on in early languages, there is a specific syntax with an identifier, some brackets, and a list of expressions. The brackets aren't there to group anything, they are the syntax for marking a function call or subscripted variable. Interpreting as an operator is a post hoc rationalisation once you start treating "array" as a first-class type, and allowing any expression to have a subscript after it. $\endgroup$ Commented Jun 13, 2024 at 11:35
2
$\begingroup$

Another consideration not mentioned so far is that no readily available symbols had the right mnemonic value at the time these subscript syntaxes were devised.

@ (U+0040 at sign) typically meant “each at” or “at a rate of”, that is, a kind of multiplication. You can still find this usage on receipts: 2 apples @ \$1.35 is 2 apples × \$1.35/apple, or \$2.70. The “at a location” sense has become more common because it was repurposed for email addresses, and later Twitter-style username tags.

Likewise, # (U+0023 number sign) was often “pound”, which is still used in contexts like “100# paper” (100 pounds per ream) or the “pound key” on a telephone. If you want “sheep number n”, nowadays sheep # n is a fine choice, but ca. 1960 you would’ve been better off with (U+2116 numero sign) as your glyph for a “number sign”. (In fact you still might—#X can be highlighted as a hashtag where №X gets left alone.)

Because of reliance on mnemonic value, it may also be unclear which operand order to pick:

  • array @ index — in this array, the value at this index
  • name @ host — the user by this name, at this host

Whereas an analogy with an existing notation, such as a mathematical subscript, reduces the chance for confusion.

$\endgroup$
0
$\begingroup$

K has both infix functional indexing and bracket indexing:

a: (1 2; (3 4 5; 6 7))  // nested list
a@0  // 1 2
a[0] // 1 2
a@1  // (3 4 5; 6 7)
a[1] // (3 4 5; 6 7)

Infix ops have equal precedence but evaluation is right-to-left, so if you're using @ on a nested structure often requires parens:

(a@0)@1  // 2, equivalent to a[0][1]

But that would be tedious, so K also has a . operator for "deep indexing" with a right array argument:

a.1 0  // 3 4 5, equivalent to a[1][0]
$\endgroup$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.