Joined: 02 Apr 2008 Posts: 11 Location: Harrisburg, PA
Short query below runs very long. I include the cols in the table and the only index info. I was advised to replace the current approach to obtaining the table row with the 'latest' timestamp with code that uses the index. How would I do that.
The table:
Code:
HP_ID CHAR (9)
HP_ADDR_TKN INTEGER
NWK_ENR_TKN INTEGER
ENR_PAR_OPT_CD CHAR (3)
ENR_PAR_OPT_EFF_DATE DATE
ENR_PAR_OPT_CNL_DATE DATE
ANW_LGN_ID CHAR(8)
ANW_TSP_GRP TIMESTAMP
Index:
Code:
NWK_ENR_TKN 1 A INTEGER 4
ENR_PAR_OPT_CD 2 A CHAR 3
ENR_PAR_OPT_EFF_DT 3 A DATE 4
ANW_TSP_GRP 4 A TIMESTMP 10
Existing query:
Code:
SELECT A.HP_ID
, A.HP_ADDR_TKN
, A.ENR_PAR_OPT_CD
FROM T401ANPO A
WHERE A.ENR_PAR_OPT_CAN_DT > CURRENT DATE
AND A.ENR_PAR_OPT_EFF_DT <> A.ENR_PAR_OPT_CAN_DT
AND A.ANW_TSP_GRP =
(SELECT MAX (B.ANW_TSP_GRP)
FROM T401ANPO B
WHERE A.HP_ID = B.HP_ID
AND A.HP_ADDR_TKN = B.HP_ADDR_TKN
AND A.ENR_PAR_OPT_CD = B.ENR_PAR_OPT_CD)
ORDER BY A.HP_ID
, A.ENR_PAR_OPT_CD;
SELECT
A.HP_ID
, A.HP_ADDR_TKN
, A.ENR_PAR_OPT_CD
, MAX (A.ANW_TSP_GRP)
FROM T401ANPO A
WHERE A.ENR_PAR_OPT_CAN_DT > CURRENT DATE
AND A.ENR_PAR_OPT_EFF_DT <> A.ENR_PAR_OPT_CAN_DT
Group By
A.HP_ID
, A.HP_ADDR_TKN
, A.ENR_PAR_OPT_CD
ORDER BY
A.HP_ID
, A.ENR_PAR_OPT_CD;
Also, Do a RUNSTAT and REORG , Check with DBA for that.
You will need additional Index on A.ENR_PAR_OPT_CAN_DT column which will be a combination of A.ENR_PAR_OPT_CAN_DT and A.ENR_PAR_OPT_EFF_DT.
Chances are 0.0001% to have TIMESTAMP duplicate and second, MAX gets either one if its duplicates and at that point it doesn't matter which one it picks.
Chances are 0.0001% to have TIMESTAMP duplicate and second, MAX gets either one if its duplicates and at that point it doesn't matter which one it picks.
...unless the initial idea was: to select ALL rows with MAX values...
The source, and the role of this particular TIMESTAMP is unknown.
MAX will always gives you one row regardless of any idea.
...but database may (or may not) include multiple rows with exactly the same value equal to this MAX.
The original query would extract all those rows if they only existed. Not sure if this was the intention of the author, but the query structure looks like an attempt to do so (in a very inefficient way).
The modified query always returns only one single row for each MAX value.
If the TS really didn't want to get multiple rows with all TIMESTAMP values equal to their local MAX, then there would be absolutely no need to create the subquery in the initial sample.
If multiple rows with equal TIMESTAMPs are really needed then I would suggest to re-organize the query itself to a more reasonable process, - before trying to optimize indexes, etc. etc. etc.
Code:
SELECT A.HP_ID
, A.HP_ADDR_TKN
, A.ENR_PAR_OPT_CD
FROM T401ANPO A ,
(SELECT HP_ID
, HP_ADDR_TKN
, ENR_PAR_OPT_CD
, MAX(ANW_TSP_GRP) AS MAX_TSP
FROM T401ANPO
WHERE ENR_PAR_OPT_CAN_DT > CURRENT DATE
AND ENR_PAR_OPT_EFF_DT <> ENR_PAR_OPT_CAN_DT ) B
WHERE A.HP_ID = B.HP_ID
AND A.HP_ADDR_TKN = B.HP_ADDR_TKN
AND A.ENR_PAR_OPT_CD = B.ENR_PAR_OPT_CD
AND A.ANW_TSP_GRP = B.MAX_TSP
ORDER BY A.HP_ID
, A.ENR_PAR_OPT_CD;
The original example would run its inner subquery many times - again, and again for every row selected by the outer query. This seems to be a sort of suicide for big tables.
The modified example here runs its subquery only once to create so to speak a temporary table B, and then the outer query performs sequential JOIN operation between tables A, and B.
This must be much more efficient approach.
In the initial sample the field under question is named
ANW_TSP_GRP
I dare to guess, this stands for "(something) Timestamp Group", isn't it?
If so, then it must be almost obvious that groups of rows with equal TIMESTAMPs are expected.
Hence, my concerns about multiple rows to be selected are getting real.
I hope the TS has any idea about the actual intentions of those who created the original query? This understanding must be always the first step before trying any mechanical/automated "optimization" of any query, or any program code at all.
Not really. Because temp B has no keys when joined against A, you are missing Group By within B for MAX. You are building B say for 1M Rows and now again you are trying to JOIN that against 1M with possibly no keys since its temp table.
Quote:
...but database may (or may not) include multiple rows with exactly the same value equal to this MAX.
That's exactly what MAX issued to get one if they are same.
Quote:
This is the difference.
The original query and modified query results will always same if only one timestamp row present then you get that as MAX and if none present then no rows comes out from both the queries.
As suggested in my first reply, add the INDEX (A.HP_ID A.HP_ADDR_TKN needs to have indexes too if they are key) and do a RUNSTAT/REORG then problem should go away.
Try this and let us know most of the times <> creates problems too.
Code:
SELECT
A.HP_ID
, A.HP_ADDR_TKN
, A.ENR_PAR_OPT_CD
, MAX (A.ANW_TSP_GRP)
FROM T401ANPO A
WHERE
NOT EXISTS (
SELECT 1 FROM T401ANPO B
B.ENR_PAR_OPT_CAN_DT <= CURRENT DATE
AND B.ENR_PAR_OPT_EFF_DT = A.ENR_PAR_OPT_CAN_DT
AND B.HP_ID = A.HP_ID
AND B.HP_ADDR_TKN = A.HP_ADDR_TKN
AND B.ENR_PAR_OPT_CD = A.ENR_PAR_OPT_CD
)
Group By
A.HP_ID
, A.HP_ADDR_TKN
, A.ENR_PAR_OPT_CD
ORDER BY
A.HP_ID
, A.ENR_PAR_OPT_CD;
Not really. Because temp B has no keys when joined against A, you are missing Group By within B for MAX. You are building B say for 1M Rows and now again you are trying to JOIN that against 1M with possibly no keys since its temp table.
I bet the approach with JOIN instead of multiple subqueries on each outer query's row should win the game, even without special indexes optimization.
Rohit Umarjikar wrote:
Quote:
...but database may (or may not) include multiple rows with exactly the same value equal to this MAX.
That's exactly what MAX issued to get one if they are same.
I'm sorry, but this is full misunderstanding of the original issue.
Rohit Umarjikar wrote:
Quote:
This is the difference.
The original query and modified query results will always same if only one timestamp row present then you get that as MAX and if none present then no rows comes out from both the queries.
Again, the long explanation has been given that possibility of multiple rows is the reason for the complex original query.
Rohit Umarjikar wrote:
As suggested in my first reply, add the INDEX (A.HP_ID A.HP_ADDR_TKN needs to have indexes too if they are key) and do a RUNSTAT/REORG then problem should go away.
Again: step #1 should be always verification of the real concept of the code, and only the final step - any mechanical optimization of the process.
I bet the approach with JOIN instead of multiple subqueries on each outer query's row should win the game, even without special indexes optimization.
That will give syntax error for not using GROUP BY at first attempt. Second additional self-joins to all the columns of temp B is needed when joined with A to even make it optimized to the imagination.
Quote:
Again, the long explanation has been given that possibility of multiple rows is the reason for the complex original query.
Original query is quite simple and straightforward and let us keep it simple here than complicating and confusing it with any possibilities which don't exists.
Quote:
Again: step #1 should be always verification of the real concept of the code, and only the final step - any mechanical optimization of the process.
No. If the code is working fine and one starts seeing the performance issues then nothing is wrong with the code but wrong with INDEX(s)/REORG/RUNSTAT. DBA should be the first resort and then the code. Without having correct indexes ,RUNSTATS/REORGs done (specially when INSERTS and DELETEs)however you mold the query it will give performance issues at some point for sure since the underlying data is incremental.
"Two added to one - if that could but be done,"
It said, "with one's fingers and thumbs!"
Recollecting with tears how, in earlier years,
It had taken no pains with its sums.
"Taking Three as the subject to reason about -
A convenient number to state -
We add Seven, and Ten, and then multiply out
By One Thousand diminished by Eight.
"The result we proceed to divide, as you see,
By Nine Hundred and Ninety Two:
Then subtract Seventeen, and the answer must be
Exactly and perfectly true.
"The method employed I would gladly explain,
While I have it so clear in my head,
If I had but the time and you had but the brain -
But much yet remains to be said.
My intention was, to give the idea of the more efficient approach to the same task, but not to present a ready-to-copy-paste code to run.
At the experts forum this should not be a problem to fix a syntax error in the suggested process. I really missed GROUP BY part, but it doesn't change anything in my post. My mistake, it should really look like this
Code:
SELECT A.HP_ID
, A.HP_ADDR_TKN
, A.ENR_PAR_OPT_CD
FROM T401ANPO A ,
(SELECT HP_ID
, HP_ADDR_TKN
, ENR_PAR_OPT_CD
, MAX(ANW_TSP_GRP) AS MAX_TSP
FROM T401ANPO
WHERE ENR_PAR_OPT_CAN_DT > CURRENT DATE
AND ENR_PAR_OPT_EFF_DT <> ENR_PAR_OPT_CAN_DT
GROUP BY HP_ID
, HP_ADDR_TKN
, ENR_PAR_OPT_CD ) B
WHERE A.HP_ID = B.HP_ID
AND A.HP_ADDR_TKN = B.HP_ADDR_TKN
AND A.ENR_PAR_OPT_CD = B.ENR_PAR_OPT_CD
AND A.ANW_TSP_GRP = B.MAX_TSP
ORDER BY A.HP_ID
, A.ENR_PAR_OPT_CD;
Again: I bet it should work much, much faster than the one from initial post.
Again: the possible problem with multiple TIMESTAMP in a group of equal values can be easily concluded from all previous detailed explanation, if not to ignore them.
In order to verify my statements about the difference in performance for two approaches to this type of queries I have chosen a similar table from my existing database, with about 600K+ rows in it; just to run the performance test.
The running of the original query (adapted to my own table names, and field names which I do not display here)
Code:
SELECT A.HP_ID
, A.HP_ADDR_TKN
, A.ENR_PAR_OPT_CD
FROM T401ANPO A
WHERE A.ENR_PAR_OPT_CAN_DT > CURRENT DATE
AND A.ENR_PAR_OPT_EFF_DT <> A.ENR_PAR_OPT_CAN_DT
AND A.ANW_TSP_GRP =
(SELECT MAX (B.ANW_TSP_GRP)
FROM T401ANPO B
WHERE A.HP_ID = B.HP_ID
AND A.HP_ADDR_TKN = B.HP_ADDR_TKN
AND A.ENR_PAR_OPT_CD = B.ENR_PAR_OPT_CD)
ORDER BY A.HP_ID
, A.ENR_PAR_OPT_CD;
- it took approx. 1:30 minutes.
The running of the updated query (also adapted to the same table names, and field names)
Code:
SELECT A.HP_ID
, A.HP_ADDR_TKN
, A.ENR_PAR_OPT_CD
FROM T401ANPO A ,
(SELECT HP_ID
, HP_ADDR_TKN
, ENR_PAR_OPT_CD
, MAX(ANW_TSP_GRP) AS MAX_TSP
FROM T401ANPO
WHERE ENR_PAR_OPT_CAN_DT > CURRENT DATE
AND ENR_PAR_OPT_EFF_DT <> ENR_PAR_OPT_CAN_DT
GROUP BY HP_ID
, HP_ADDR_TKN
, ENR_PAR_OPT_CD ) B
WHERE A.HP_ID = B.HP_ID
AND A.HP_ADDR_TKN = B.HP_ADDR_TKN
AND A.ENR_PAR_OPT_CD = B.ENR_PAR_OPT_CD
AND A.ANW_TSP_GRP = B.MAX_TSP
ORDER BY A.HP_ID
, A.ENR_PAR_OPT_CD;
- it produced the same results (including multiple rows with equal MAX(TIMESTAMP) values!), and it took only 0:55 mins (e.g. 55 seconds).
The difference in performance of two approaches is almost 2 times without any special optimization, or adjustment of settings.
I expect that for tables of 100M+ rows the difference in performance would be even bigger. I have my own experience with performance issues; and always optimization of used methods of processing gave much better results compared to only optimization of indexes, and other dilettantish methods.
I was advised to replace the current approach to obtaining the table row with the 'latest' timestamp with code that uses the index.
This indeed means that only one value of unique max timestamp is expected and in any case of duplicates it shall be an error. let TS confirms the understanding first before assuming the designed query is for the sole purpose of duplicate timestamp values.
Option1:
1. Do the REORG/RUNSTAT and rerun to see the performance change.
2. If proper index can be allowed to added based on EXPLAIN then that's much better.
OPTION2:
Code:
SELECT
A.HP_ID
, A.HP_ADDR_TKN
, A.ENR_PAR_OPT_CD
, MAX (A.ANW_TSP_GRP)
FROM T401ANPO A
WHERE A.ENR_PAR_OPT_CAN_DT > CURRENT DATE
AND A.ENR_PAR_OPT_EFF_DT <> A.ENR_PAR_OPT_CAN_DT
Group By
A.HP_ID
, A.HP_ADDR_TKN
, A.ENR_PAR_OPT_CD
ORDER BY
A.HP_ID
, A.ENR_PAR_OPT_CD;
OPTION3 :
Code:
SELECT
A.HP_ID
, A.HP_ADDR_TKN
, A.ENR_PAR_OPT_CD
, MAX (A.ANW_TSP_GRP)
FROM T401ANPO A
WHERE
NOT EXISTS (
SELECT 1 FROM T401ANPO B
B.ENR_PAR_OPT_CAN_DT <= CURRENT DATE
AND B.ENR_PAR_OPT_EFF_DT = A.ENR_PAR_OPT_CAN_DT
AND B.HP_ID = A.HP_ID
AND B.HP_ADDR_TKN = A.HP_ADDR_TKN
AND B.ENR_PAR_OPT_CD = A.ENR_PAR_OPT_CD
)
Group By
A.HP_ID
, A.HP_ADDR_TKN
, A.ENR_PAR_OPT_CD
ORDER BY
A.HP_ID
, A.ENR_PAR_OPT_CD;
Quote:
The difference in performance of two approaches is almost 2 times without any special optimization, or adjustment of settings.
Might be the case that Results from the first query is still present in cashes and second query actually picking it form there than actually going to physical Db. Two sites have different TS/TB attributes , for e.g. Bufferpool value.
TS has many options to try now and come back which one works or none works.
Option 1 - can be considered, but there is a little chance to get serious performance improvement. I'd recommended it to be done after improving the code algorithm itself, not starting with this step.
Option2 - I tried with my similar test table. It really runs about 3-4 seconds with my database, but multiple rows with equal values of detected MAX(TIMESTAMP_GROUP) are missing from output.
This type of query is definitely very fast because it has ignored the essential part of the original query: obvious attempt to extract possible duplicate values equal to each MAX(TIMESTAMP_GROUP)
Option3 - there are syntax errors:
- (obvious one) keyword WHERE is missing
- (not obvious one) need to compare
B.ENR_PAR_OPT_EFF_DT = B.ENR_PAR_OPT_CAN_DT
instead of
B.ENR_PAR_OPT_EFF_DT = A.ENR_PAR_OPT_CAN_DT
When those errors fixed it runs about 2:15 minutes, producing the same results as option2 (e.g. with missing duplicate values).
The reason of bad performance is obvious without any RUNSTAT: all intermediate data produced in the middle of this query are significantly bigger than any other example given in this topic.
SELECT A.HP_ID
, A.HP_ADDR_TKN
, A.ENR_PAR_OPT_CD
FROM T401ANPO A ,
(SELECT HP_ID
, HP_ADDR_TKN
, ENR_PAR_OPT_CD
, MAX(ANW_TSP_GRP) AS MAX_TSP
FROM T401ANPO
WHERE ENR_PAR_OPT_CAN_DT > CURRENT DATE
AND ENR_PAR_OPT_EFF_DT <> ENR_PAR_OPT_CAN_DT
GROUP BY HP_ID
, HP_ADDR_TKN
, ENR_PAR_OPT_CD ) B
WHERE A.HP_ID = B.HP_ID
AND A.HP_ADDR_TKN = B.HP_ADDR_TKN
AND A.ENR_PAR_OPT_CD = B.ENR_PAR_OPT_CD
AND A.ANW_TSP_GRP = B.MAX_TSP
ORDER BY A.HP_ID
, A.ENR_PAR_OPT_CD;
- it produced the same results (including multiple rows with equal MAX(TIMESTAMP) values!), and it took only 0:55 mins (e.g. 55 seconds).
I detected a hidden bug in this SQL code. In a very rare case it can miss a few rows. (So far I found only one such missing row out of my 600K+ test table).
Correct version should be as follows
Code:
SELECT A.HP_ID
, A.HP_ADDR_TKN
, A.ENR_PAR_OPT_CD
FROM T401ANPO A ,
(SELECT HP_ID
, HP_ADDR_TKN
, ENR_PAR_OPT_CD
, MAX(ANW_TSP_GRP) AS MAX_TSP
FROM T401ANPO
WHERE ENR_PAR_OPT_CAN_DT > CURRENT DATE
AND ENR_PAR_OPT_EFF_DT <> ENR_PAR_OPT_CAN_DT
GROUP BY HP_ID
, HP_ADDR_TKN
, ENR_PAR_OPT_CD ) B
WHERE A.HP_ID = B.HP_ID
AND A.HP_ADDR_TKN = B.HP_ADDR_TKN
AND A.ENR_PAR_OPT_CD = B.ENR_PAR_OPT_CD
AND A.ANW_TSP_GRP = B.MAX_TSP
AND A.ENR_PAR_OPT_EFF_DT <> A.ENR_PAR_OPT_CAN_DT
ORDER BY A.HP_ID
, A.ENR_PAR_OPT_CD;
You need to add condition as well to your outer SELECT, it is missing right now in the correct version.
Quote:
It really runs about 3-4 seconds with my database
That's awesome.
Quote:
Option3 - there are syntax errors:
My bad
Thanks for running them.
Quote:
The reason of bad performance is obvious without any RUNSTAT:
RUNSTAT and REORG is done once a week or twice if table has INSERTs and DELETEs everyday. So the table that you have might have both done on time and TS don't. Also the EXPLAIN in each of the query in your case may not be same in TS case. To know the exact savings TS must run at their site and come back to us with which works.. If TS care about duplicates timestamp then probably I need to twist a bit and post back but will only do if TS really cares about that.
You need to add condition as well to your outer SELECT, it is missing right now in the correct version.
This is not technical/programming question, but rather a business rule question.
The answer is: rather no than yes, - depending on the assumed business rules implemented by this query.
This extra condition is not used in the initial subquery provided by TS.
E.g. the MAX value is chosen by the inner subquery from a larger subset of rows. So, in some cases adding this extra condition to the inner subquery would change the MAX value found by it to some lesser value.
1) When the condition
ENR_PAR_OPT_CAN_DT > CURRENT DATE
is used only in the outer query, there is a chance that the MAX value returned by the inner query does not belong to the more limited subset of rows found so far by the outer query, and those rows will be excluded from the final selection.
2) When the condition
ENR_PAR_OPT_CAN_DT > CURRENT DATE
is used in both inner, and outer queries, this situation cannot happen: the same subset of rows is used to find the MAX value, and to match rows from outer query.
So the final selection when this condition is used in both queries in general may be larger, compared to the same selection in only outer query.
BTW: the same problem may happen with another extra condition,
ENR_PAR_OPT_EFF_DT <> ENR_PAR_OPT_CAN_DT
Most likely, they both must be used only in the outer query, to perform final selection.
But again: this in fact depends on the business rules to be implemented - this is a question to TS who seems to be not interested in the problem he raised?
Since you added <> For table A and also added when getting max timetamp similarly , you need > current date condition added to table A along with <>.
Moreover, if you want to mimic what TS did from the ioriginal post then you don’t need <> and > current date in table B but very much needed in table A. Without doing this step the results are going to mismatch from TS.
Joined: 10 May 2007 Posts: 2454 Location: Hampshire, UK
Why don't you just answer the OPs query and get on with life? OP has not come back so all you 2 are doing is conjecture. Do this conversation with PMs if you wish to continue answering the question about how long the dtring is when you do not even know if it is a bit of string.
OP (topic starter) get back with some feedback - that, at least, is the minimal politeness.