Well stated, @TobySpeight.
I agree with you.
We're running this query in a loop, so speed is important.
for ($cluster_hosts as $hostname) {
$res = mysql_query("SELECT *
If this was a real question I were answering,
I would immediately lean on that for loop.
It's an anti-pattern.
The rule is pretty much to avoid making
repeated queries to the DB backend if
you possibly can.
Let the backend do the looping,
rather than the app.
And then of course in a production query
the SELECT * should be SELECT a, b, c ...,
to improve maintainability, and to avoid
retrieving columns the app will just ignore.
Pruning SELECT a, b, c down to SELECT a, b
can alter the query plan, for example
when we have a
covering index
so we needn't consult disk blocks
containing the rows and can rely just
on the index blocks.
But the bigger issue is that we don't actually
care about the query plan for that SELECT *.
Suppose it's typically a 2-second query,
and we have N cluster hosts, so total elapsed
time shall be 2 × N.
The appropriate thing to do is (quickly!)
create an N-row indexed temp table to JOIN against,
or to supply a big ugly IN conjunct in
the WHERE clause.
And then we can analyze what we really care
about: Does the query plan for that
single giant query complete in less
than 2 × N seconds?
There are several bits of intuition behind that:
- The app loop hides information.
- Random I/O seeking takes longer than sequential tablescan.
- Multiple rows can fit into a single disk block (in wide table row blocks and especially in narrow index blocks).
- Repeated queries might (re)fetch the very same blocks on each repetition.
Hiding information from the backend
optimizer is a big no-no.
We know that N cluster hosts will
be dealt with before we click "done"
on the stopwatch, but we didn't advise
the optimizer of that. So it can't
devise the best plan, since we're not
letting it see into the future.
It's entirely possible that the best
plan for N small queries is random reads
based on an index, but the best plan for
a single giant query is to tablescan
since we'll eventually inspect every
block anyway.
A question should disclose CREATE INDEX
details when performance is a concern, sure.
But any UNIQUE index should always be
revealed, as it affects correctness,
it tells us how to interpret the
given relation.
We always want to know about
PRIMARY KEY columns.
There are good reasons,
for example managing space on diverse
storage media,
to do an equi-join on two UNIQUE columns
of a pair of tables.
In which case we essentially have
one relation rather than two.
As I mentioned, @TobySpeight raises a good point.
But there are some subtleties to the particular
question in the docs.
If we wish to revisit the advice given by the docs,
maybe we'd rather tackle a simpler question
that uses another language,
perhaps Rust or Python?
EXPLAIN SELECTwould make it a paragon. If we need to get it right, then perhaps we need a proposal in a Community answer here, so we can refine it until we have a consensus? \$\endgroup\$