IBM Mainframe Forum Index
 
Log In
 
IBM Mainframe Forum Index Mainframe: Search IBM Mainframe Forum: FAQ Register
 

Optimization Required for ICETOOL duplicate removal


IBM Mainframe Forums -> DFSORT/ICETOOL
Post new topic   Reply to topic
View previous topic :: View next topic  
Author Message
VivekKhanna

New User


Joined: 09 Feb 2009
Posts: 57
Location: India

PostPosted: Tue May 08, 2012 11:09 pm
Reply with quote

Hi Team,

I am performing SORT and Duplicate Removal (using ICETOOL) on a tape file holding 900 million records. Consider an example of file with key information as:

Code:

FIELD             START_POS              LENGTH
EMP_ID              1                       10
EMP_DEPT            15                      10
DEPT_DIV            25                      04
SEQ_NB              11                      04



The requirement says that:

a. Perform SORT based on ascending order of EMP_ID, EMP_DEPT, DEPT_DIV, and descending order of SEQ_NB.
b. Use the output from previous step and Perform duplicate removal using ICETOOL using the following control card:

Code:
SELECT FROM(SORTIN) TO(SORTOUT)-     
 ON(15,10,CH)-                         
 ON(25,04,CH)-
 ON(01,10,CH)-                         
 FIRST DISCARD(SORTXSUM) 


I have performed the above in two different steps. Here I am facing performance issues, since the time taken by both the steps is (2.5+3.5) hrs approx. All the intermediate and final files are on tape. DCB used is:

Code:
DCB = (LRECL=242,BLKSIZE=0,RECFM=FB,BUFNO=10)


Changing the BLKSIZE,BUFNO be of any help in this regard?

NOTE: I cannot use SORT using SUM FIELDS=NONE, for duplicate removal.
Back to top
View user's profile Send private message
Bill Woodger

Moderator Emeritus


Joined: 09 Mar 2011
Posts: 7309
Location: Inside the Matrix

PostPosted: Wed May 09, 2012 12:17 am
Reply with quote

Isn't the selection, without specifying SORT FIELDS=NONE or OPTION COPY in a xxxxCNTL file, going to sort again?

Do you need two separate sorted files, one just sorted and one, on different key, de-duped?

If you can set out exactly why there SORT followed by SELECT(with SORT) are needed, it'll be a help.
Back to top
View user's profile Send private message
Skolusu

Senior Member


Joined: 07 Dec 2007
Posts: 2205
Location: San Jose

PostPosted: Wed May 09, 2012 12:37 am
Reply with quote

VivekKhanna wrote:
The requirement says that:

a. Perform SORT based on ascending order of EMP_ID, EMP_DEPT, DEPT_DIV, and descending order of SEQ_NB.
b. Use the output from previous step and Perform duplicate removal using ICETOOL using the following control card:

Code:
SELECT FROM(SORTIN) TO(SORTOUT)-     
 ON(15,10,CH)-                         
 ON(25,04,CH)-
 ON(01,10,CH)-                         
 FIRST DISCARD(SORTXSUM) 


I have performed the above in two different steps. Here I am facing performance issues, since the time taken by both the steps is (2.5+3.5) hrs approx. All the intermediate and final files are on tape. DCB used is:

Code:
DCB = (LRECL=242,BLKSIZE=0,RECFM=FB,BUFNO=10)


Changing the BLKSIZE,BUFNO be of any help in this regard?

NOTE: I cannot use SORT using SUM FIELDS=NONE, for duplicate removal.


Your requirement does NOT match the Control cards you show on the SELECT . How did you change the order from ascending order of EMP_ID, EMP_DEPT, DEPT_DIV, to EMP_DEPT,DEPT_DIV,EMP_ID ???

Just for the record, you do not need 2 passes of data. Here is a DFSORT JCL which will give you the first record upon sorting the data in ascending order of EMP_ID, EMP_DEPT, DEPT_DIV, and descending order of SEQ_NB.


Code:

//STEP0100 EXEC PGM=ICETOOL   
//TOOLMSG  DD SYSOUT=*         
//DFSMSG   DD SYSOUT=*         
//IN       DD DSN=Your Input tape file,DISP=SHR
//OUT      DD SYSOUT=*                   
//TOOLIN   DD *                           
  SELECT FROM(IN) TO(OUT) FIRST USING(CTL1) -
      ON(01,10,CH) ON(15,10,CH) ON(25,4,CH)
//*
//CTL1CNTL DD *
  SORT FIELDS=(01,10,CH,A,    $ EMP-ID
               15,10,CH,A,    $ EMP-DEPT   
               25,04,CH,A,    $ DEPT-DIV
               11,04,CH,D)    $ SEQ_NB
//*
Back to top
View user's profile Send private message
Bill Woodger

Moderator Emeritus


Joined: 09 Mar 2011
Posts: 7309
Location: Inside the Matrix

PostPosted: Wed May 09, 2012 4:37 am
Reply with quote

What order is your original file in?

Is the order of fields for your SORT correct or does the SORT or SELECT contain a typo?

If your file is in the correct order, except for this SEQ-NB (sequence number?) and the purpose of the sequence number is to ensure that the last record is the duplicate that is retained, then I believe you can do the whole thing without even sorting the file once. To put it another way, is SEQ-NB ascending within the rest of the key, and you want to use it to keep the last record with duplicate key?

Are you that lucky?

SELECT as Kolusu showed, but with LASTDUP, and SORT FIELDS=NONE/OPTION COPY in the CTL1CNTL.

If that's what you've got, I'll PM you two invoices, one for the resource saving and the other for the mindreading.
Back to top
View user's profile Send private message
VivekKhanna

New User


Joined: 09 Feb 2009
Posts: 57
Location: India

PostPosted: Wed May 09, 2012 3:16 pm
Reply with quote

Hi Bill

1. There is no order for original file. Original file is a table unload.

2. Order for SORT and SELECT is different. It is clearly mentioned in the above requirement.

Quote:
a. Perform SORT based on ascending order of EMP_ID, EMP_DEPT, DEPT_DIV, and descending order of SEQ_NB.


3. MAX(SEQ_NB) gives me the latest record, for the EMP_ID, EMP_DEPT and DEPT_DIV. I need to pick the record for above key combination with MAX(SEQ_NB).

4. LUCKY?? Didn't get that.

Hope the above replies resolve your queries.
Back to top
View user's profile Send private message
Bill Woodger

Moderator Emeritus


Joined: 09 Mar 2011
Posts: 7309
Location: Inside the Matrix

PostPosted: Wed May 09, 2012 3:43 pm
Reply with quote

The better you describe your requirement, the better answers you get. If you'd mentioned that it was a database unload...

Lucky, because if it was like that, already in order, you'd have had no sort to do at all.

You mentioned fields in different order in the SORT and in the SELECT. Was that deliberate?

The SELECT, unless told otherwise, does a SORT.

If you only need one order, and the file is currently unordered, then Kolusu's code is going to do exactly what you want, and it is going to save a whole SORT. Which is pretty lucky.

You only need to SORT twice if you need the output in two different orders for some reason. So just one step needed, Kolusu's, unless you need two orders and forgot to mention that as well, and mistyped the keys.
Back to top
View user's profile Send private message
Skolusu

Senior Member


Joined: 07 Dec 2007
Posts: 2205
Location: San Jose

PostPosted: Wed May 09, 2012 9:20 pm
Reply with quote

Bill Woodger wrote:

SELECT as Kolusu showed, but with LASTDUP, and SORT FIELDS=NONE/OPTION COPY in the CTL1CNTL.


Bill,

I think you overlooked that I am sorting on the Seq num DESCENDING. BTW you need LAST and not LASTDUP in this case to pick any Unique combo records. As is SELECT with FIRST is much more efficient than LASTDUP or LAST.

VivekKhanna wrote:

3. MAX(SEQ_NB) gives me the latest record, for the EMP_ID, EMP_DEPT and DEPT_DIV. I need to pick the record for above key combination with MAX(SEQ_NB).


If this indeed your requirement then the above Shown JCL by me will give you the desired results. Did you try it?
Back to top
View user's profile Send private message
Bill Woodger

Moderator Emeritus


Joined: 09 Mar 2011
Posts: 7309
Location: Inside the Matrix

PostPosted: Wed May 09, 2012 10:10 pm
Reply with quote

I had my failed mind-reading hat on. Without the knowledge that it was a database unload, I was thinking it must be some type of ordinary file, with some sequence to it already, and with a sequence number already that must/should/could be ascending on the natural sequence of the file.

Then, if someone wanted to "de-dupe" but keep the last record, not the first, and therefore sorted 980 million records with the sequence number Descending to get the last first, being the only purpose of the sort, then the whole sorting of the file could be avoided with an appropriate SELECT.

However, the mind-reading hat, as mentioned, failed. I should keep it locked away. I've been told hats don't suit me anyway.

Turns out it was a database unload, sequence "whatever", needs sorting anyway, so why not descending on the sequence number, with NOEQUALS if they are unique (and they are not part of the de-dupe key).

I didn't know that FIRST was more efficient, so that is filed away now, thanks.
Back to top
View user's profile Send private message
View previous topic :: :: View next topic  
Post new topic   Reply to topic View Bookmarks
All times are GMT + 6 Hours
Forum Index -> DFSORT/ICETOOL

 


Similar Topics
Topic Forum Replies
No new posts Duplicate transid's declared using CEDA CICS 3
No new posts Shift left VB record without x00 endi... DFSORT/ICETOOL 11
No new posts how to calculate SUM value for VB fil... DFSORT/ICETOOL 1
No new posts how to calculate SUM for VB file usin... JCL & VSAM 1
No new posts Null values are considered in Total c... DFSORT/ICETOOL 6
Search our Forums:

Back to Top