9

I'm trying to accomplish a query that requires a calculated column using a subquery that passes the date reference via a variable. I'm not sure if I'm not "doing it right" but essentially the query never finishes and spins for minutes on end. This is my query:

select @groupdate:=date_format(order_date,'%Y-%m'), count(distinct customer_email) as num_cust,
(
  select count(distinct cev.customer_email) as num_prev
  from _pj_cust_email_view cev
  inner join _pj_cust_email_view as prev_purch on (prev_purch.order_date < @groupdate) and (cev.customer_email=prev_purch.customer_email)
  where cev.order_date > @groupdate
) as prev_cust_count
from _pj_cust_email_view
group by @groupdate;

Subquery has an inner join accomplishes the self-join that only gives me the count of people that have purchased prior to the date in @groupdate. The EXPLAIN is below:

+----+----------------------+---------------------+------+---------------+-----------+---------+---------------------------+--------+---------------------------------+
| id | select_type          | table               | type | possible_keys | key       | key_len | ref                       | rows   | Extra                           |
+----+----------------------+---------------------+------+---------------+-----------+---------+---------------------------+--------+---------------------------------+
|  1 | PRIMARY              | _pj_cust_email_view | ALL  | NULL          | NULL      | NULL    | NULL                      | 140147 | Using temporary; Using filesort |
|  2 | UNCACHEABLE SUBQUERY | cev                 | ALL  | IDX_EMAIL     | NULL      | NULL    | NULL                      | 140147 | Using where                     |
|  2 | UNCACHEABLE SUBQUERY | prev_purch          | ref  | IDX_EMAIL     | IDX_EMAIL | 768     | cart_A.cev.customer_email |      1 | Using where                     |
+----+----------------------+---------------------+------+---------------+-----------+---------+---------------------------+--------+---------------------------------+

And the structure of the table _pj_cust_email_view is as such:

'_pj_cust_email_view', 'CREATE TABLE `_pj_cust_email_view` (
  `order_date` varchar(10) CHARACTER SET utf8 DEFAULT NULL,
  `customer_email` varchar(255) CHARACTER SET utf8 DEFAULT NULL,
  KEY `IDX_EMAIL` (`customer_email`),
  KEY `IDX_ORDERDATE` (`order_date`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1'

Again, as I said earlier, I'm not really sure that this is the best way to accomplish this. Any criticism, direction is appreciated!

Update

I've made a little progress, and I'm now doing the above procedurally by iterating through all known months instead of months in the database and setting the vars ahead of time. I don't like this still. This is what I've got now:

Sets the user defined vars

set @startdate:='2010-08', @enddate:='2010-09';

Gets total distinct emails in the given range

select count(distinct customer_email) as num_cust
from _pj_cust_email_view
where order_date between @startdate and @enddate;

Gets the total count of customers who had purchased prior to the given range

select count(distinct cev.customer_email) as num_prev
  from _pj_cust_email_view cev
  inner join _pj_cust_email_view as prev_purch on (prev_purch.order_date < @startdate) and (cev.customer_email=prev_purch.customer_email)
  where cev.order_date between @startdate and @enddate;

Where @startdate is set to the start of the month and @enddate signifies the end of that month's range.

I really feel like this still can be done in one full query.

2
  • See update above for the current solution which returns what I'm looking for, but iteratively and I'd like it to be in one full query. Commented Feb 25, 2011 at 14:56
  • It would help if you provided some sample input data and some expected output. Commented Feb 27, 2011 at 5:10

2 Answers 2

8
+200

I don't think you need to use subqueries at all, nor do you need to iterate over months.

Instead, I recommend you create a table to store all months. Even if you prepopulate it with 100 years of months, it would only have 1200 rows in it, which is trivial.

CREATE TABLE Months (
    start_date DATE, 
    end_date DATE, 
    PRIMARY KEY (start_date, end_date)
);
INSERT INTO Months (start_date, end_date) 
VALUES ('2011-03-01', '2011-03-31');

Store the actual start and end dates, so you can use the DATE data type and index the two columns properly.

edit: I think I understand your requirement a bit better, and I've cleaned up this answer. The following query may be right for you:

SELECT DATE_FORMAT(m.start_date, '%Y-%m') AS month,
  COUNT(DISTINCT cev.customer_email) AS current,
  GROUP_CONCAT(DISTINCT cev.customer_email) AS current_email,
  COUNT(DISTINCT prev.customer_email) AS earlier,
  GROUP_CONCAT(DISTINCT prev.customer_email) AS earlier_email
FROM Months AS m 
LEFT OUTER JOIN _pj_cust_email_view AS cev
  ON cev.order_date BETWEEN m.start_date AND m.end_date
INNER JOIN Months AS mprev
  ON mprev.start_date <= m.start_date
LEFT OUTER JOIN _pj_cust_email_view AS prev
  ON prev.order_date BETWEEN mprev.start_date AND mprev.end_date
GROUP BY month;

If you create the following compound index in your table:

CREATE INDEX order_email on _pj_cust_email_view (order_date, customer_email);

Then the query has the best chance of being an index-only query, and will run a lot faster.

Below is the EXPLAIN optimization report from this query. Note type: index for each table.

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: m
         type: index
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 6
          ref: NULL
         rows: 4
        Extra: Using index; Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: mprev
         type: index
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 6
          ref: NULL
         rows: 4
        Extra: Using where; Using index; Using join buffer
*************************** 3. row ***************************
           id: 1
  select_type: SIMPLE
        table: cev
         type: index
possible_keys: order_email
          key: order_email
      key_len: 17
          ref: NULL
         rows: 10
        Extra: Using index
*************************** 4. row ***************************
           id: 1
  select_type: SIMPLE
        table: prev
         type: index
possible_keys: order_email
          key: order_email
      key_len: 17
          ref: NULL
         rows: 10
        Extra: Using index

Here's some test data:

INSERT INTO Months (start_date, end_date) VALUES
('2011-03-01', '2011-03-31'),
('2011-02-01', '2011-02-28'),
('2011-01-01', '2011-01-31'),
('2010-12-01', '2010-12-31');

INSERT INTO _pj_cust_email_view VALUES
('ron', '2011-03-10'),
('hermione', '2011-03-15'),
('hermione', '2011-02-15'),
('hermione', '2011-01-15'),
('hermione', '2010-12-15'),
('neville', '2011-01-10'),
('harry', '2011-03-19'),
('harry', '2011-02-10'),
('molly', '2011-03-25'),
('molly', '2011-01-10');

Here's the result given that data, including the concatenated list of emails to make it easier to see.

+---------+---------+--------------------------+---------+----------------------------------+
| month   | current | current_email            | earlier | earlier_email                    |
+---------+---------+--------------------------+---------+----------------------------------+
| 2010-12 |       1 | hermione                 |       1 | hermione                         | 
| 2011-01 |       3 | neville,hermione,molly   |       3 | hermione,molly,neville           | 
| 2011-02 |       2 | hermione,harry           |       4 | harry,hermione,molly,neville     | 
| 2011-03 |       4 | molly,ron,harry,hermione |       5 | molly,ron,hermione,neville,harry | 
+---------+---------+--------------------------+---------+----------------------------------+
Sign up to request clarification or add additional context in comments.

9 Comments

I think one of the problems is gleaning the desired logic for previous orders. E.g., is it the overall count of distinct customers with order dates prior to the current start date? If that were the case as suggested by the OP, then using your test data, the previous order column for 2011-03 should be 4 (hermione, neville, harry, and molly all have an order prior to 2011-03-01) and for 2011-02 it should be 2 (hermione,neville).
@Thomas: I agree, it's unclear from the OP's question what the desired behavior is.
Sorry for the absence - been a tough week. The intent is exactly as thomas has described here, that it is a count of all distincts in all previous ranges to the current range, inclusive of the current distinct email. I'm trying to find a way to better describe it, and I can ammend the question above once I try the other suggestions here if further clarification is needed.
IMO, this is returns the correct answer given the clarification of the logic by the OP.
@philwinkle - Specifically what Bill is showing you is the list of all previous orders for anyone with an order in the current range (as opposed to all orders previous to the current range regardless of whether the given email has an order in the current range).
|
0

Although Bill has a nice query using multiple tables, this one does it with the SQL variables too, so no extra table. The inner query joins to your _pj_cust_email_view table and does a limit 10 to signify only going back 10 months from the current month. So, no hard-coding of dates, it is computed on the fly... if you want more or less months, just change the LIMIT clause.

By setting the @dt := as the LAST field in the internal query, only THEN does the date basis get assigned for the next record cycle to create your qualifying dates...

select justDates.FirstOfMonth,
       count( distinct EMCurr.customer_Email ) UniqThisMonth,
       count( distinct EMLast.customer_Email ) RepeatCustomers
   from 
      ( SELECT 
                 @dt FirstOfMonth,
                 last_day( @dt ) EndOfMonth,
                 @dt:= date_sub(@dt, interval 1 month) nextCycle
            FROM 
                 (select @dt := date_sub( current_date(), interval dayofmonth( current_date())-1 day )) vars,
                _pj_cust_email_view limit 10 
                ) JustDates
        join _pj_cust_email_view EMCurr
           on EMCurr.order_Date between JustDates.FirstOfMonth and JustDates.EndOfMonth
        left join _pj_cust_email_view EMLast
           on EMLast.order_Date < JustDates.FirstOfMonth
           and EMCurr.customer_Email = EMLast.customer_Email
    group by 
       1

4 Comments

I really like what you've done here. I'm going to give it a whirl and I'll let you know the results very soon.
@philwinkle, the only thing I didn't do was the date_format() call, but that should simply wrap around the FirstOfMonth date column.
Try as I might, man, I can't get this query to work. I've been hacking at it for a few hours. It seems like there are a few syntax issues and you're treating my view table as a column on the FROM subquery and limiting 10 to alias as JustDates.
Actually, I created a table in MY MySQL of the column structures your 'CREATE TABLE _pj_cust_email_view ( ... ). I then populated it with about 20 records and ran the query. So, I'm not sure what is missing. I was under the impression the _pj_cust_email_view WAS ACTUALLY a table, and yes, joined to that table, but you can actually use ANY table for that internal query to get 10 records. I'll be out some this evening, but will check in later.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.