Deduplicate and sort based on different fields

shankarm · Active User Joined: 17 May 2010 Posts: 175 Location: India

Hello,

Hope you are all fine.

I have 5 fields totally.
- i want to remove the duplicate records based on field 1 and 2
- then sort the records based on field 3,4 & 5.

I can do it in two steps as follows,

Lets say each field is of lenght 5.

Step1: sort card (remove the duplicates records based on field 1 and 2)
sort fields=(1,5,ch,a,6,5,ch,a)
sum fields=none

Step2: sort based on field 3,4 and 5.

Do we have a way to accomplish this in one step? Please advise.

Bill Woodger · Posted: Tue Oct 07, 2014 5:22 pm

Is the data already in Field1/Field2 order?

You could get it in "one step" by using two ICETOOL operators.

What is the purpose of the one-step-only?

FIELDS=(1,10,CH,A) is equivalent to what you coded.

Marso · Posted: Tue Oct 07, 2014 6:44 pm

In order to detect duplicates, the file have to be sorted on Field1 and Field2, so you will always need 2 sorts.
The only way to do this in one step is to use ICETOOL or SYNCTOOL.

But 2 records with same Fields 1 & 2 may have different values in Fields 3, 4 & 5.
How do you know which one is relevant and which one is not ?

shankarm · Active User Joined: 17 May 2010 Posts: 175 Location: India

data is not in anyorder. i have to sort and deduplicate based on different set of fields.

I understand that we can do it in two steps but my customer feels two sort steps will affect the performance.

I believe we dont have lisence for icetool.

One of my collegue gave me this, i didnt test this yet. will keep you posted.

SORT FIELDS=(7,3,CH,A,11,1,CH,A),EQUALS
OUTREC IFTHEN=(WHEN=INIT,OVERLAY=(81 EQNUM,1,ZD,RESTART=(1,9)))
OUTFIL INCLUDE=(81,1,ZD,EQ,1),BUILD=(1,80)
/*

enrico-sorichetti · Posted: Wed Oct 08, 2014 10:12 pm

Bill Woodger · Posted: Wed Oct 08, 2014 10:33 pm

If the data is not already in any order, to de-duplicate on two different sets of keys, you will have to SORT twice.

Pay attention to Marso's point. Unless the data of the second key is "connected" to the first key, you need to know which "second key" data you need to keep from the records with non-unique "first key".

Rohit Umarjikar · Posted: Thu Oct 09, 2014 5:18 am

<<If I understand your requirement>>

I am not sure if below would work for you ( NOT TESTED), just try to place your keys accordingly and let us know.

Bill Woodger · Posted: Thu Oct 09, 2014 5:43 am

Rohit,

I think you're going to have to think about both of those.

To post untested solutions really means they need to work although there may be typos. Not just plain not work.

The first is only going to de-duplicate on the first key if the records happen to turn out to be in first-key sequence when sorted on the second key (remember the original data is not in any order, and even if it were it would require a relationship between the keys which would not require sorting twice anyway - no the case so far).

The ICETOOL one is going to SORT on one thing and de-duplicate on a different thing. SELECT does a SORT. But if you specify a SORT in the USING file, that is the SORT it will do for the SELECT.

Rohit Umarjikar · Posted: Thu Oct 09, 2014 11:05 am

Bill, I am sure I have some learnings on these operators and yes I agree that one needs to put an additional efforts to make it working but at least they start and get a shell to build with. thanks.

Btw this sort fields=(1,5,ch,a,6,5,ch,a) can be rewritten to sort fields=(1,10,ch,a) so this is really a ONE key.