In order to detect duplicates, the file have to be sorted on Field1 and Field2, so you will always need 2 sorts.
The only way to do this in one step is to use ICETOOL or SYNCTOOL.
But 2 records with same Fields 1 & 2 may have different values in Fields 3, 4 & 5.
How do you know which one is relevant and which one is not ?
Joined: 09 Mar 2011 Posts: 7309 Location: Inside the Matrix
If the data is not already in any order, to de-duplicate on two different sets of keys, you will have to SORT twice.
Pay attention to Marso's point. Unless the data of the second key is "connected" to the first key, you need to know which "second key" data you need to keep from the records with non-unique "first key".
Code:
A 1
A 0
A 9
A 2
B 1
Those represent your two keys (specifically you have two groups of keys, but logically one key each for de-duplication).
From the above, you have quite a lot of possible final outcomes.
Code:
De-dupe on A with lowest second key, de-dupe on second key
A 0
B 1
De-dupe on A with first second key, de-dupe on first second key
A 1
De-dupe on A with first second key, de-dupe on last second key
B 1
De-dupe on A with highest second key, de-dupe on second key
B 1
A 9
De-dupe on A with last second key, de-dupe on second key
B 1
A 2
I'd imagine that SUM FIELDS=NONE would perform better than "doing it by hand" as your colleague suggested (you'd still need two passes of the file, and two steps without SyncTOOL). Don't you get SyncTOOL for free?
First thing then is to establish exactly what data you want to keep as the base record when dropping duplicate records. You have multiple elements to your two keys, so make sure at the level of all the elements.
The only way you're going to get two SORTs in one step is with JOINKEYS, and then you would need SUM FIELDS=NONE for the JOINKEYS, which would require the availability of JNFnCNTL files, which your SyncSORT may not have (you have to experiment, as I don't think it is documented in the manual even if you can actually use them). The second JOINKEYS dataset should be "empty". JOIN UNPAIRED,F1,ONLY.
Then you do the second SORT in the Maintask.
I've never compared this to doing two separate steps with two different SORTs.
You do have some options with XDUPS and DUPKEYS as well as SUM FIELDS=NONE, so you may be able to express simply how you need to do the task. Once you know what data to keep...
As to what you customer feels at any given time, that is fine, as long as it does not require Magic to carry out. There's no such thing as Magic.
While as everyone said the performance is based on total volume of data you process and passeses you do so unless you put something in place and observe for a week or so you can't tell which one is better.
Joined: 09 Mar 2011 Posts: 7309 Location: Inside the Matrix
Rohit,
I think you're going to have to think about both of those.
To post untested solutions really means they need to work although there may be typos. Not just plain not work.
The first is only going to de-duplicate on the first key if the records happen to turn out to be in first-key sequence when sorted on the second key (remember the original data is not in any order, and even if it were it would require a relationship between the keys which would not require sorting twice anyway - no the case so far).
The ICETOOL one is going to SORT on one thing and de-duplicate on a different thing. SELECT does a SORT. But if you specify a SORT in the USING file, that is the SORT it will do for the SELECT.
Bill, I am sure I have some learnings on these operators and yes I agree that one needs to put an additional efforts to make it working but at least they start and get a shell to build with. thanks.
Btw this sort fields=(1,5,ch,a,6,5,ch,a) can be rewritten to sort fields=(1,10,ch,a) so this is really a ONE key.