Optimization Required for ICETOOL duplicate removal

VivekKhanna · New User Joined: 09 Feb 2009 Posts: 57 Location: India

Hi Team,

I am performing SORT and Duplicate Removal (using ICETOOL) on a tape file holding 900 million records. Consider an example of file with key information as:

Bill Woodger · Posted: Wed May 09, 2012 12:17 am

Isn't the selection, without specifying SORT FIELDS=NONE or OPTION COPY in a xxxxCNTL file, going to sort again?

Do you need two separate sorted files, one just sorted and one, on different key, de-duped?

If you can set out exactly why there SORT followed by SELECT(with SORT) are needed, it'll be a help.

Skolusu · Posted: Wed May 09, 2012 12:37 am

Bill Woodger · Posted: Wed May 09, 2012 4:37 am

What order is your original file in?

Is the order of fields for your SORT correct or does the SORT or SELECT contain a typo?

If your file is in the correct order, except for this SEQ-NB (sequence number?) and the purpose of the sequence number is to ensure that the last record is the duplicate that is retained, then I believe you can do the whole thing without even sorting the file once. To put it another way, is SEQ-NB ascending within the rest of the key, and you want to use it to keep the last record with duplicate key?

Are you that lucky?

SELECT as Kolusu showed, but with LASTDUP, and SORT FIELDS=NONE/OPTION COPY in the CTL1CNTL.

If that's what you've got, I'll PM you two invoices, one for the resource saving and the other for the mindreading.

VivekKhanna · New User Joined: 09 Feb 2009 Posts: 57 Location: India

Hi Bill

1. There is no order for original file. Original file is a table unload.

2. Order for SORT and SELECT is different. It is clearly mentioned in the above requirement.

Bill Woodger · Posted: Wed May 09, 2012 3:43 pm

The better you describe your requirement, the better answers you get. If you'd mentioned that it was a database unload...

Lucky, because if it was like that, already in order, you'd have had no sort to do at all.

You mentioned fields in different order in the SORT and in the SELECT. Was that deliberate?

The SELECT, unless told otherwise, does a SORT.

If you only need one order, and the file is currently unordered, then Kolusu's code is going to do exactly what you want, and it is going to save a whole SORT. Which is pretty lucky.

You only need to SORT twice if you need the output in two different orders for some reason. So just one step needed, Kolusu's, unless you need two orders and forgot to mention that as well, and mistyped the keys.

Skolusu · Posted: Wed May 09, 2012 9:20 pm

Bill Woodger · Posted: Wed May 09, 2012 10:10 pm

I had my failed mind-reading hat on. Without the knowledge that it was a database unload, I was thinking it must be some type of ordinary file, with some sequence to it already, and with a sequence number already that must/should/could be ascending on the natural sequence of the file.

Then, if someone wanted to "de-dupe" but keep the last record, not the first, and therefore sorted 980 million records with the sequence number Descending to get the last first, being the only purpose of the sort, then the whole sorting of the file could be avoided with an appropriate SELECT.

However, the mind-reading hat, as mentioned, failed. I should keep it locked away. I've been told hats don't suit me anyway.

Turns out it was a database unload, sequence "whatever", needs sorting anyway, so why not descending on the sequence number, with NOEQUALS if they are unique (and they are not part of the de-dupe key).

I didn't know that FIRST was more efficient, so that is filed away now, thanks.