IBM Mainframe Forum Index
 
Log In
 
IBM Mainframe Forum Index Mainframe: Search IBM Mainframe Forum: FAQ Register
 

Deduplicate and sort based on different fields


IBM Mainframe Forums -> SYNCSORT
Post new topic   Reply to topic
View previous topic :: View next topic  
Author Message
shankarm

Active User


Joined: 17 May 2010
Posts: 175
Location: India

PostPosted: Tue Oct 07, 2014 4:54 pm
Reply with quote

Hello,

Hope you are all fine.

I have 5 fields totally.
- i want to remove the duplicate records based on field 1 and 2
- then sort the records based on field 3,4 & 5.

I can do it in two steps as follows,

Lets say each field is of lenght 5.

Step1: sort card (remove the duplicates records based on field 1 and 2)
sort fields=(1,5,ch,a,6,5,ch,a)
sum fields=none

Step2: sort based on field 3,4 and 5.

Do we have a way to accomplish this in one step? Please advise.
Back to top
View user's profile Send private message
Bill Woodger

Moderator Emeritus


Joined: 09 Mar 2011
Posts: 7309
Location: Inside the Matrix

PostPosted: Tue Oct 07, 2014 5:22 pm
Reply with quote

Is the data already in Field1/Field2 order?

You could get it in "one step" by using two ICETOOL operators.

What is the purpose of the one-step-only?

FIELDS=(1,10,CH,A) is equivalent to what you coded.
Back to top
View user's profile Send private message
Marso

REXX Moderator


Joined: 13 Mar 2006
Posts: 1353
Location: Israel

PostPosted: Tue Oct 07, 2014 6:44 pm
Reply with quote

In order to detect duplicates, the file have to be sorted on Field1 and Field2, so you will always need 2 sorts.
The only way to do this in one step is to use ICETOOL or SYNCTOOL.

But 2 records with same Fields 1 & 2 may have different values in Fields 3, 4 & 5.
How do you know which one is relevant and which one is not ?
Back to top
View user's profile Send private message
shankarm

Active User


Joined: 17 May 2010
Posts: 175
Location: India

PostPosted: Wed Oct 08, 2014 9:51 pm
Reply with quote

data is not in anyorder. i have to sort and deduplicate based on different set of fields.

I understand that we can do it in two steps but my customer feels two sort steps will affect the performance.

I believe we dont have lisence for icetool.

One of my collegue gave me this, i didnt test this yet. will keep you posted.

SORT FIELDS=(7,3,CH,A,11,1,CH,A),EQUALS
OUTREC IFTHEN=(WHEN=INIT,OVERLAY=(81 EQNUM,1,ZD,RESTART=(1,9)))
OUTFIL INCLUDE=(81,1,ZD,EQ,1),BUILD=(1,80)
/*
Back to top
View user's profile Send private message
enrico-sorichetti

Superior Member


Joined: 14 Mar 2007
Posts: 10873
Location: italy

PostPosted: Wed Oct 08, 2014 10:12 pm
Reply with quote

Quote:
I understand that we can do it in two steps but my customer feels two sort steps will affect the performance.

Your customer is an idiot icon_cool.gif

what impact the performance is not the number of steps but the number of passes over input and intermediate data !

and You will always have TWO passes
1) sort on 1/2 to get rid of the duplicates
2) sort on 3/4/5 for the final sequence
Back to top
View user's profile Send private message
Bill Woodger

Moderator Emeritus


Joined: 09 Mar 2011
Posts: 7309
Location: Inside the Matrix

PostPosted: Wed Oct 08, 2014 10:33 pm
Reply with quote

If the data is not already in any order, to de-duplicate on two different sets of keys, you will have to SORT twice.

Pay attention to Marso's point. Unless the data of the second key is "connected" to the first key, you need to know which "second key" data you need to keep from the records with non-unique "first key".
Code:

A 1
A 0
A 9
A 2
B 1


Those represent your two keys (specifically you have two groups of keys, but logically one key each for de-duplication).

From the above, you have quite a lot of possible final outcomes.

Code:
De-dupe on A with lowest second key, de-dupe on second key
A 0
B 1

De-dupe on A with first second key, de-dupe on first second key
A 1

De-dupe on A with first second key, de-dupe on last second key
B 1

De-dupe on A with highest second key, de-dupe on second key
B 1
A 9

De-dupe on A with last second key, de-dupe on second key
B 1
A 2


I'd imagine that SUM FIELDS=NONE would perform better than "doing it by hand" as your colleague suggested (you'd still need two passes of the file, and two steps without SyncTOOL). Don't you get SyncTOOL for free?

First thing then is to establish exactly what data you want to keep as the base record when dropping duplicate records. You have multiple elements to your two keys, so make sure at the level of all the elements.

The only way you're going to get two SORTs in one step is with JOINKEYS, and then you would need SUM FIELDS=NONE for the JOINKEYS, which would require the availability of JNFnCNTL files, which your SyncSORT may not have (you have to experiment, as I don't think it is documented in the manual even if you can actually use them). The second JOINKEYS dataset should be "empty". JOIN UNPAIRED,F1,ONLY.

Then you do the second SORT in the Maintask.

I've never compared this to doing two separate steps with two different SORTs.

You do have some options with XDUPS and DUPKEYS as well as SUM FIELDS=NONE, so you may be able to express simply how you need to do the task. Once you know what data to keep...

As to what you customer feels at any given time, that is fine, as long as it does not require Magic to carry out. There's no such thing as Magic.
Back to top
View user's profile Send private message
Rohit Umarjikar

Global Moderator


Joined: 21 Sep 2010
Posts: 3051
Location: NYC,USA

PostPosted: Thu Oct 09, 2014 5:18 am
Reply with quote

<<If I understand your requirement>>

I am not sure if below would work for you ( NOT TESTED), just try to place your keys accordingly and let us know.

Code:
//STEP0100 EXEC PGM=SORT                                       
//SYSOUT   DD SYSOUT=*                                         
//SORTIN   DD DSN=Your FB input 1000 byte file,DISP=SHR
//SORTOUT  DD SYSOUT=*                                         
//SYSIN    DD *                                               
  SORT FIELDS=('Provide Keys for your final SortOrder'),EQUALS             
  OUTREC IFTHEN=(WHEN=GROUP,KEYBEGIN=('Your initial keys'),PUSH=(1001:SEQ=8)) 
  OUTFIL BUILD=(1,1000),INCLUDE=(1001,8,ZD,EQ,1)


If you have a ICETOOL ( Otherwise Ignore)

Code:
//S1 EXEC PGM=ICETOOL
//TOOLMSG DD SYSOUT=*
//DFSMSG DD SYSOUT=*
//IN DD DSN=... input file
//OUT DD DSN=...  output file
//TOOLIN DD *
SELECT FROM(IN) TO(OUT) ON('Your initial keys') FIRST USING(CTL1)
/*
//CTL1CNTL DD *
  SORT FIELDS=('Provide Keys for your final Sort Order')
/*               


While as everyone said the performance is based on total volume of data you process and passeses you do so unless you put something in place and observe for a week or so you can't tell which one is better.
Back to top
View user's profile Send private message
Bill Woodger

Moderator Emeritus


Joined: 09 Mar 2011
Posts: 7309
Location: Inside the Matrix

PostPosted: Thu Oct 09, 2014 5:43 am
Reply with quote

Rohit,

I think you're going to have to think about both of those.

To post untested solutions really means they need to work although there may be typos. Not just plain not work.

The first is only going to de-duplicate on the first key if the records happen to turn out to be in first-key sequence when sorted on the second key (remember the original data is not in any order, and even if it were it would require a relationship between the keys which would not require sorting twice anyway - no the case so far).

The ICETOOL one is going to SORT on one thing and de-duplicate on a different thing. SELECT does a SORT. But if you specify a SORT in the USING file, that is the SORT it will do for the SELECT.
Back to top
View user's profile Send private message
Rohit Umarjikar

Global Moderator


Joined: 21 Sep 2010
Posts: 3051
Location: NYC,USA

PostPosted: Thu Oct 09, 2014 11:05 am
Reply with quote

Bill, I am sure I have some learnings on these operators and yes I agree that one needs to put an additional efforts to make it working but at least they start and get a shell to build with. thanks.

Btw this sort fields=(1,5,ch,a,6,5,ch,a) can be rewritten to sort fields=(1,10,ch,a) so this is really a ONE key.
Back to top
View user's profile Send private message
View previous topic :: :: View next topic  
Post new topic   Reply to topic View Bookmarks
All times are GMT + 6 Hours
Forum Index -> SYNCSORT

 


Similar Topics
Topic Forum Replies
No new posts Need to set RC4 through JCL SORT DFSORT/ICETOOL 5
No new posts How to split large record length file... DFSORT/ICETOOL 10
No new posts JCL sort card - get first day and las... JCL & VSAM 9
No new posts Sort First/last record of a subset th... DFSORT/ICETOOL 7
No new posts how to calculate SUM value for VB fil... DFSORT/ICETOOL 1
Search our Forums:

Back to Top