# Deduplicate and sort based on different fields

Author Message
shankarm

Active User

Joined: 17 May 2010
Posts: 175
Location: India

 Posted: Tue Oct 07, 2014 4:54 pm Hello, Hope you are all fine. I have 5 fields totally. - i want to remove the duplicate records based on field 1 and 2 - then sort the records based on field 3,4 & 5. I can do it in two steps as follows, Lets say each field is of lenght 5. Step1: sort card (remove the duplicates records based on field 1 and 2) sort fields=(1,5,ch,a,6,5,ch,a) sum fields=none Step2: sort based on field 3,4 and 5. Do we have a way to accomplish this in one step? Please advise.
Bill Woodger

Moderator Emeritus

Joined: 09 Mar 2011
Posts: 7310
Location: Inside the Matrix

 Posted: Tue Oct 07, 2014 5:22 pm Is the data already in Field1/Field2 order? You could get it in "one step" by using two ICETOOL operators. What is the purpose of the one-step-only? FIELDS=(1,10,CH,A) is equivalent to what you coded.
Marso

REXX Moderator

Joined: 13 Mar 2006
Posts: 1349
Location: Israel

 Posted: Tue Oct 07, 2014 6:44 pm In order to detect duplicates, the file have to be sorted on Field1 and Field2, so you will always need 2 sorts. The only way to do this in one step is to use ICETOOL or SYNCTOOL. But 2 records with same Fields 1 & 2 may have different values in Fields 3, 4 & 5. How do you know which one is relevant and which one is not ?
shankarm

Active User

Joined: 17 May 2010
Posts: 175
Location: India

 Posted: Wed Oct 08, 2014 9:51 pm data is not in anyorder. i have to sort and deduplicate based on different set of fields. I understand that we can do it in two steps but my customer feels two sort steps will affect the performance. I believe we dont have lisence for icetool. One of my collegue gave me this, i didnt test this yet. will keep you posted. SORT FIELDS=(7,3,CH,A,11,1,CH,A),EQUALS OUTREC IFTHEN=(WHEN=INIT,OVERLAY=(81 EQNUM,1,ZD,RESTART=(1,9))) OUTFIL INCLUDE=(81,1,ZD,EQ,1),BUILD=(1,80) /*
enrico-sorichetti

Superior Member

Joined: 14 Mar 2007
Posts: 10782
Location: italy

Posted: Wed Oct 08, 2014 10:12 pm

 Quote: I understand that we can do it in two steps but my customer feels two sort steps will affect the performance.

what impact the performance is not the number of steps but the number of passes over input and intermediate data !

and You will always have TWO passes
1) sort on 1/2 to get rid of the duplicates
2) sort on 3/4/5 for the final sequence
Bill Woodger

Moderator Emeritus

Joined: 09 Mar 2011
Posts: 7310
Location: Inside the Matrix

Posted: Wed Oct 08, 2014 10:33 pm

If the data is not already in any order, to de-duplicate on two different sets of keys, you will have to SORT twice.

Pay attention to Marso's point. Unless the data of the second key is "connected" to the first key, you need to know which "second key" data you need to keep from the records with non-unique "first key".
 Code: A 1 A 0 A 9 A 2 B 1

Those represent your two keys (specifically you have two groups of keys, but logically one key each for de-duplication).

From the above, you have quite a lot of possible final outcomes.

 Code: De-dupe on A with lowest second key, de-dupe on second key A 0 B 1 De-dupe on A with first second key, de-dupe on first second key A 1 De-dupe on A with first second key, de-dupe on last second key B 1 De-dupe on A with highest second key, de-dupe on second key B 1 A 9 De-dupe on A with last second key, de-dupe on second key B 1 A 2

I'd imagine that SUM FIELDS=NONE would perform better than "doing it by hand" as your colleague suggested (you'd still need two passes of the file, and two steps without SyncTOOL). Don't you get SyncTOOL for free?

First thing then is to establish exactly what data you want to keep as the base record when dropping duplicate records. You have multiple elements to your two keys, so make sure at the level of all the elements.

The only way you're going to get two SORTs in one step is with JOINKEYS, and then you would need SUM FIELDS=NONE for the JOINKEYS, which would require the availability of JNFnCNTL files, which your SyncSORT may not have (you have to experiment, as I don't think it is documented in the manual even if you can actually use them). The second JOINKEYS dataset should be "empty". JOIN UNPAIRED,F1,ONLY.

Then you do the second SORT in the Maintask.

I've never compared this to doing two separate steps with two different SORTs.

You do have some options with XDUPS and DUPKEYS as well as SUM FIELDS=NONE, so you may be able to express simply how you need to do the task. Once you know what data to keep...

As to what you customer feels at any given time, that is fine, as long as it does not require Magic to carry out. There's no such thing as Magic.
Rohit Umarjikar

Global Moderator

Joined: 21 Sep 2010
Posts: 2673
Location: NYC,USA

Posted: Thu Oct 09, 2014 5:18 am

I am not sure if below would work for you ( NOT TESTED), just try to place your keys accordingly and let us know.

 Code: //STEP0100 EXEC PGM=SORT                                        //SYSOUT   DD SYSOUT=*                                          //SORTIN   DD DSN=Your FB input 1000 byte file,DISP=SHR //SORTOUT  DD SYSOUT=*                                          //SYSIN    DD *                                                  SORT FIELDS=('Provide Keys for your final SortOrder'),EQUALS                OUTREC IFTHEN=(WHEN=GROUP,KEYBEGIN=('Your initial keys'),PUSH=(1001:SEQ=8))    OUTFIL BUILD=(1,1000),INCLUDE=(1001,8,ZD,EQ,1)

If you have a ICETOOL ( Otherwise Ignore)

 Code: //S1 EXEC PGM=ICETOOL //TOOLMSG DD SYSOUT=* //DFSMSG DD SYSOUT=* //IN DD DSN=... input file //OUT DD DSN=...  output file //TOOLIN DD * SELECT FROM(IN) TO(OUT) ON('Your initial keys') FIRST USING(CTL1) /* //CTL1CNTL DD *   SORT FIELDS=('Provide Keys for your final Sort Order') /*

While as everyone said the performance is based on total volume of data you process and passeses you do so unless you put something in place and observe for a week or so you can't tell which one is better.
Bill Woodger

Moderator Emeritus

Joined: 09 Mar 2011
Posts: 7310
Location: Inside the Matrix

 Posted: Thu Oct 09, 2014 5:43 am Rohit, I think you're going to have to think about both of those. To post untested solutions really means they need to work although there may be typos. Not just plain not work. The first is only going to de-duplicate on the first key if the records happen to turn out to be in first-key sequence when sorted on the second key (remember the original data is not in any order, and even if it were it would require a relationship between the keys which would not require sorting twice anyway - no the case so far). The ICETOOL one is going to SORT on one thing and de-duplicate on a different thing. SELECT does a SORT. But if you specify a SORT in the USING file, that is the SORT it will do for the SELECT.
Rohit Umarjikar

Global Moderator

Joined: 21 Sep 2010
Posts: 2673
Location: NYC,USA

 Posted: Thu Oct 09, 2014 11:05 am Bill, I am sure I have some learnings on these operators and yes I agree that one needs to put an additional efforts to make it working but at least they start and get a shell to build with. thanks. Btw this sort fields=(1,5,ch,a,6,5,ch,a) can be rewritten to sort fields=(1,10,ch,a) so this is really a ONE key.
 View Bookmarks All times are GMT + 6 Hours

 Topic Forum Replies Similar Topics Sort to extract lines DFSORT/ICETOOL 8 SORT - Multiple Conditions DFSORT/ICETOOL 4 SYNCSORT/ICETOOL JOINKEYS SORT Statem... DFSORT/ICETOOL 13 Installing sort utilities on hercules... DFSORT/ICETOOL 4 Sort card help to prepare status report DFSORT/ICETOOL 13
Search our Forums: