IBM Mainframe Forum Index
 
Log In
 
IBM Mainframe Forum Index Mainframe: Search IBM Mainframe Forum: FAQ Register
 

Syncsort - Remove duplicates in a csv file


IBM Mainframe Forums -> JCL & VSAM
Post new topic   Reply to topic
View previous topic :: View next topic  
Author Message
mksranjani

New User


Joined: 03 Feb 2010
Posts: 29
Location: Chennai

PostPosted: Fri Jun 10, 2011 2:20 pm
Reply with quote

hi,

I have a comma separated file of lrecl 646. I need to remove the duplicates from the file. I tried with sort function 'SUM FIELDS=NONE' . But it did not remove the duplicates. can someone help?
Back to top
View user's profile Send private message
expat

Global Moderator


Joined: 14 Mar 2007
Posts: 8797
Location: Welsh Wales

PostPosted: Fri Jun 10, 2011 2:26 pm
Reply with quote

Show a sample of your input, and the wrong output.
The input output file DSORG, RECFM and LRECL
Back to top
View user's profile Send private message
mksranjani

New User


Joined: 03 Feb 2010
Posts: 29
Location: Chennai

PostPosted: Fri Jun 10, 2011 2:37 pm
Reply with quote

hi, please find below the sort i used

Code:
//STEP0001 EXEC PGM=SORT                           
//SYSOUT    DD SYSOUT=*                             
//SYSPRINT DD SYSOUT=*                             
//SORTIN    DD DSN=INPUT FILE...                     
//SORTOUT  DD DSN=OUTPUT FILE....                   
//                     DISP=(NEW,CATLG,DELETE),             
//                     UNIT=TESTDA,SPACE=(CYL,(10,50),RLSE),
//                     DCB=(LRECL=646,RECFM=FB,BLKSIZE=0)   
//SYSIN       DD *                                     
 SORT FIELDS=(1,646,CH,A)                           
 SUM FIELDS=NONE         
/*

Input file has the same dcb parameters as that of output.
In the output file, duplicates were not removed. it was the same as that of input.

My input data looks like below

Code:
1234567891,A,123456789,H64398376,,
5874639368,B,AAA123456,AAAAAAA,BBBBBB,
5874639368,B,AAA123456,AAAAAAA,BBBBBB,
Back to top
View user's profile Send private message
enrico-sorichetti

Superior Member


Joined: 14 Mar 2007
Posts: 10872
Location: italy

PostPosted: Fri Jun 10, 2011 2:43 pm
Reply with quote

just checked with

Code:
****** ***************************** Top of Data ******************************
000001 //ENRICO1  JOB NOTIFY=&SYSUID,
000002 //             MSGLEVEL=(1,1),CLASS=A,MSGCLASS=H
000003 //*
000004 //S1      EXEC PGM=SORT
000005 //SYSPRINT  DD SYSOUT=*
000006 //SYSOUT    DD SYSOUT=*
000007 //SORTIN    DD *
000008 1234567891,A,123456789,H64398376,,
000009 5874639368,B,AAA123456,AAAAAAA,BBBBBB,
000010 5874639368,B,AAA123456,AAAAAAA,BBBBBB,
000011 //SORTOUT   DD SYSOUT=*,
000012 //             DCB=(RECFM=FB,LRECL=80)
000013 //SYSIN     DD *
000014   SORT   FIELDS=(1,80,CH,A)
000015   SUM    FIELDS=NONE
000016 //*
****** **************************** Bottom of Data ****************************


and it worked, so something is missing in Your description of the problem
Back to top
View user's profile Send private message
expat

Global Moderator


Joined: 14 Mar 2007
Posts: 8797
Location: Welsh Wales

PostPosted: Fri Jun 10, 2011 2:44 pm
Reply with quote

And the output looks exactly as I would expect.

I think that you have omitted some vital information here. I would hazard a guess that this input has come from another operating system and may have a unicode seperator embedded into the data, which means that you have multiple records on each line.

However, if this is not the case you need to describe your requirement in far more detail than you already have.
Back to top
View user's profile Send private message
expat

Global Moderator


Joined: 14 Mar 2007
Posts: 8797
Location: Welsh Wales

PostPosted: Fri Jun 10, 2011 2:45 pm
Reply with quote

Oh, I now see that the input data has changed icon_rolleyes.gif
Back to top
View user's profile Send private message
enrico-sorichetti

Superior Member


Joined: 14 Mar 2007
Posts: 10872
Location: italy

PostPosted: Fri Jun 10, 2011 2:49 pm
Reply with quote

I just edited the post adding the code tags icon_redface.gif
Back to top
View user's profile Send private message
mksranjani

New User


Joined: 03 Feb 2010
Posts: 29
Location: Chennai

PostPosted: Fri Jun 10, 2011 2:52 pm
Reply with quote

Hi,
Sorry, please see if the below information helps.
Input file is a comma separated file. Each field has variable lengths.
Each line has only one record.
Total lrecl of the file is 646.
I want to remove duplicate records.
Back to top
View user's profile Send private message
enrico-sorichetti

Superior Member


Joined: 14 Mar 2007
Posts: 10872
Location: italy

PostPosted: Fri Jun 10, 2011 2:58 pm
Reply with quote

the additionl information posted is quite useless, it adds nothing to the topic
You want to check the overall byte by byte matching or the real value matching

Code:
xxxx,aaa,bbb, 0,......
xxxx,aaa,bbb,00,.....


would fail the byte by byte matching
but logically the two records are the same, a <numeric> 0 is the same as a numeric 00
and the extra blank might be considered as harmless withespace

what is that You did not understand in the snippet I posted after testing!
Back to top
View user's profile Send private message
mksranjani

New User


Joined: 03 Feb 2010
Posts: 29
Location: Chennai

PostPosted: Fri Jun 10, 2011 2:59 pm
Reply with quote

hi Enricho, it does work. but is it possible to remove the duplicates without sorting the records.
Back to top
View user's profile Send private message
mksranjani

New User


Joined: 03 Feb 2010
Posts: 29
Location: Chennai

PostPosted: Fri Jun 10, 2011 3:01 pm
Reply with quote

hi, i want to match it byte by byte.
Back to top
View user's profile Send private message
enrico-sorichetti

Superior Member


Joined: 14 Mar 2007
Posts: 10872
Location: italy

PostPosted: Fri Jun 10, 2011 3:17 pm
Reply with quote

nothing that You could not have found out Yourself a forum search with DUPLICATE PRESERVE

Code:
****** ***************************** Top of Data ******************************
000001 //ENRICO1  JOB NOTIFY=&SYSUID,
000002 //             MSGLEVEL=(1,1),CLASS=A,MSGCLASS=H
000003 //*
000004 //S1      EXEC PGM=SORT
000005 //SYSPRINT  DD SYSOUT=*
000006 //SYSOUT    DD SYSOUT=*
000007 //SORTIN    DD *
000008 5874639369,B,AAA123456,AAAAAAA,BBBBBB,
000009 5874639369,B,AAA123456,AAAAAAA,BBBBBB,
000010 1234567891,A,123456789,H64398376,,
000011 5874639368,B,AAA123456,AAAAAAA,BBBBBB,
000012 5874639368,B,AAA123456,AAAAAAA,BBBBBB,
000013 //SORTOUT   DD SYSOUT=*,
000014 //             DCB=(RECFM=FB,LRECL=80)
000015 //SYSIN     DD *
000016   OPTION COPY
000017   INREC  IFTHEN=(WHEN=INIT,OVERLAY=(81:SEQNUM,8,ZD,RESTART=(1,80)))
000018   OUTFIL INCLUDE=(81,8,ZD,EQ,1),BUILD=(1,80)
000019 //*
****** **************************** Bottom of Data ****************************


modify accordingly to Your record length,
the above snippet works if the duplicate record are contiguous,
if they are not You might need two/more passes

if You are using DFSORT better wait for Frank or Kolusu for a better suggestion

they work in a different time zone and right now they might be still in bed
icon_biggrin.gif
Back to top
View user's profile Send private message
mksranjani

New User


Joined: 03 Feb 2010
Posts: 29
Location: Chennai

PostPosted: Fri Jun 10, 2011 3:30 pm
Reply with quote

thank you enricho.
I tried the above code you gave.

OPTION COPY
INREC IFTHEN=(WHEN=INIT,OVERLAY=(647:SEQNUM,8,ZD,RESTART=(1,646)))
OUTFIL INCLUDE=(647,8,ZD,EQ,1),BUILD=(1,646)

But got the below error
WER271A INREC STATEMENT : NUMERIC FIELD ERROR
Back to top
View user's profile Send private message
enrico-sorichetti

Superior Member


Joined: 14 Mar 2007
Posts: 10872
Location: italy

PostPosted: Fri Jun 10, 2011 3:36 pm
Reply with quote

Code:
WER271A INREC STATEMENT : NUMERIC FIELD ERROR


as said gazillions of times WER messages indicate that You are using SYNCSORT not DFSORT

topic moved where it belongs

Frank as Kolusu are DFSORT developers, You cannot expect Them to provide advice on a competitor' s product

and... by the way my name is Enrico, not Enricho icon_biggrin.gif
Back to top
View user's profile Send private message
mksranjani

New User


Joined: 03 Feb 2010
Posts: 29
Location: Chennai

PostPosted: Fri Jun 10, 2011 3:45 pm
Reply with quote

Thank you Enrico.. icon_smile.gif
Back to top
View user's profile Send private message
enrico-sorichetti

Superior Member


Joined: 14 Mar 2007
Posts: 10872
Location: italy

PostPosted: Fri Jun 10, 2011 6:11 pm
Reply with quote

tested with DFSORT, searched the manuals and here is explanation of the

Quote:
ICE111A E REFORMATTING FIELD ERROR


Quote:
The length (m for p,m or FIXLEN=m for %nn) for a RESTART field was greater than 256 bytes.


there should be something similar in Your sysout and manuals
Back to top
View user's profile Send private message
sqlcode1

Active Member


Joined: 08 Apr 2010
Posts: 577
Location: USA

PostPosted: Fri Jun 10, 2011 9:43 pm
Reply with quote

Untest on syncsort but should work...(Tested on DFSort...)
Code:

//SYSIN DD *                                                           
 INREC  IFTHEN=(WHEN=INIT,OVERLAY=(647:SEQNUM,8,ZD,RESTART=(001,256),   
                                   655:SEQNUM,8,ZD,RESTART=(257,256),   
                                   663:SEQNUM,8,ZD,RESTART=(513,134))),
        IFTHEN=(WHEN=(647,8,ZD,EQ,1,OR,655,8,ZD,EQ,1,OR,663,8,ZD,EQ,1),
                          OVERLAY=(671:C'Y'))                           
 SORT FIELDS=COPY                                                       
 OUTFIL INCLUDE=(671,1,CH,EQ,C'Y'),BUILD=(1,646)                       
/*                                                                                                       


Thanks,
Back to top
View user's profile Send private message
enrico-sorichetti

Superior Member


Joined: 14 Mar 2007
Posts: 10872
Location: italy

PostPosted: Fri Jun 10, 2011 9:54 pm
Reply with quote

still there is the question pending...
what if the duplicate records are not contiguous !
Back to top
View user's profile Send private message
sqlcode1

Active Member


Joined: 08 Apr 2010
Posts: 577
Location: USA

PostPosted: Fri Jun 10, 2011 10:33 pm
Reply with quote

mksranjani wrote:
hi Enricho, it does work. but is it possible to remove the duplicates without sorting the records.

I thought OP didn't wanted to sort records.

Correct me if I am wrong but even if records are not contiguous, I think its possible using outrec. It would take 2 pass only if after removing duplicates OP wants to go back to original order.

Thanks,
Back to top
View user's profile Send private message
enrico-sorichetti

Superior Member


Joined: 14 Mar 2007
Posts: 10872
Location: italy

PostPosted: Fri Jun 10, 2011 10:46 pm
Reply with quote

the solutions we posted got rid of the duplicates without altering the original ordering,
but will work only for contiguous duplicates
if the duplicates are not contiguous as I said before I feel that a multi pass approach is needed!

too bad that the TS is using Syncsort,
I guess that we are bound to miss Frank and Kolusu smart solutions!

I searched the forums for non contiguous duplicates keeping the original order and the first occurrence, but I was not able to find anything
Back to top
View user's profile Send private message
Phrzby Phil

Senior Member


Joined: 31 Oct 2006
Posts: 1042
Location: Richmond, Virginia

PostPosted: Fri Jun 10, 2011 11:08 pm
Reply with quote

1. add a seq# column
2. sort by other fields
3. dedupe
4. re-order by seq#
Back to top
View user's profile Send private message
sqlcode1

Active Member


Joined: 08 Apr 2010
Posts: 577
Location: USA

PostPosted: Fri Jun 10, 2011 11:15 pm
Reply with quote

Phrzby Phil,
Point #2 and 4 would require separate pass. We were thinking of single pass solution.

Thanks,
Back to top
View user's profile Send private message
enrico-sorichetti

Superior Member


Joined: 14 Mar 2007
Posts: 10872
Location: italy

PostPosted: Fri Jun 10, 2011 11:18 pm
Reply with quote

my bad for using the <wrong> terminology

anybody can do it with TWO sort steps

the point here are not the multiple passes of data ( they will be needed anyway )

the puzzle is how to do it with a single SORT/ICETOOL step
Back to top
View user's profile Send private message
superk

Global Moderator


Joined: 26 Apr 2004
Posts: 4652
Location: Raleigh, NC, USA

PostPosted: Fri Jun 10, 2011 11:21 pm
Reply with quote

This topic is for the Syncsort product. Could we all please refrain from commenting on it and let Alissa provide the appropriate solution.
Back to top
View user's profile Send private message
enrico-sorichetti

Superior Member


Joined: 14 Mar 2007
Posts: 10872
Location: italy

PostPosted: Sat Jun 11, 2011 12:39 pm
Reply with quote

Just exploring
a two step (two passes ) job for non contiguous duplicates ( using SORT)
Code:
****** ***************************** Top of Data ******************************
000001 //ENRICO1  JOB NOTIFY=&SYSUID,
000002 //             MSGLEVEL=(1,1),CLASS=A,MSGCLASS=H
000003 //*
000004 //CLEANUP EXEC PGM=CLEANUP
000005 //STEPLIB   DD DISP=SHR,DSN=SYS2.LINKLIB
000006 //*
000007 //S1      EXEC PGM=SORT
000008 //SYSPRINT  DD SYSOUT=*
000009 //SYSOUT    DD SYSOUT=*
000010 //SORTIN    DD *
000011 5874639369,B,AAA123456,AAAAAAA,BBBBBB,
000012 5874639369,B,AAA123456,AAAAAAA,BBBBBB,
000013 1234567891,A,123456789,H64398376,,
000014 5874639368,B,AAA123456,AAAAAAA,BBBBBB,
000015 5874639368,B,AAA123456,AAAAAAA,BBBBBB,
000016 5874639369,B,AAA123456,AAAAAAA,BBBBBB,
000017 5874639369,B,AAA123456,AAAAAAA,BBBBBB,
000018 1234567891,A,123456789,H64398376,,
000019 5874639368,B,AAA123456,AAAAAAA,BBBBBB,
000020 5874639368,B,AAA123456,AAAAAAA,BBBBBB,
000021 5874639369,B,AAA123456,AAAAAAA,BBBBBB,
000022 5874639369,B,AAA123456,AAAAAAA,BBBBBB,
000023 1234567891,A,123456789,H64398376,,
000024 5874639368,B,AAA123456,AAAAAAA,BBBBBB,
000025 5874639368,B,AAA123456,AAAAAAA,BBBBBB,
000026 5874639369,B,AAA123456,AAAAAAA,BBBBBB,
000027 5874639369,B,AAA123456,AAAAAAA,BBBBBB,
000028 1234567891,A,123456789,H64398376,,
000029 5874639368,B,AAA123456,AAAAAAA,BBBBBB,
000030 5874639368,B,AAA123456,AAAAAAA,BBBBBB,
000031 5874639369,B,AAA123456,AAAAAAA,BBBBBB,
000032 5874639369,B,AAA123456,AAAAAAA,BBBBBB,
000033 1234567891,A,123456789,H64398376,,
000034 5874639368,B,AAA123456,AAAAAAA,BBBBBB,
000035 5874639368,B,AAA123456,AAAAAAA,BBBBBB,
000036 //SORTOUT   DD DISP=(NEW,PASS),DSN=&&TEMP88,
000037 //             UNIT=SYSDA,SPACE=(CYL,(1,1)),
000038 //             DCB=(RECFM=FB,LRECL=88)
000039 //SYSIN     DD *
000040   INREC  OVERLAY=(81:SEQNUM,8,ZD)
000041   SORT   FIELDS=(1,80,CH,A,81,8,ZD,A)
000042   OUTREC BUILD=(1,88)
000043 //*
000044 //S2      EXEC PGM=SORT
000045 //SYSPRINT  DD SYSOUT=*
000046 //SYSOUT    DD SYSOUT=*
000047 //SORTIN    DD DISP=(OLD,PASS),DSN=&&TEMP88
000048 //SORTOUT   DD SYSOUT=*,DCB=(LRECL=80)
000049 //SYSIN     DD *
000050   SORT   FIELDS=(81,8,ZD,A)
000051   INREC  IFTHEN=(WHEN=INIT,OVERLAY=(89:SEQNUM,8,ZD,RESTART=(1,80)))
000052   OUTFIL INCLUDE=(89,8,ZD,EQ,1),BUILD=(1,80)
000053 //*
****** **************************** Bottom of Data ****************************


a one step (two passes anyway ) for the same ( using ICETOOL )
Code:
****** ***************************** Top of Data ******************************
000001 //ENRICO1  JOB NOTIFY=&SYSUID,
000002 //             MSGLEVEL=(1,1),CLASS=A,MSGCLASS=H
000003 //*
000004 //CLEANUP EXEC PGM=CLEANUP
000005 //STEPLIB   DD DISP=SHR,DSN=SYS2.LINKLIB
000006 //*
000007 //S1      EXEC PGM=ICETOOL
000008 //TOOLMSG   DD SYSOUT=*
000009 //SYSPRINT  DD SYSOUT=*
000010 //DFSMSG    DD SYSOUT=*
000011 //SYSOUT    DD SYSOUT=*
000012 //SORTIN    DD *
000013 5874639369,B,AAA123456,AAAAAAA,BBBBBB,
000014 5874639369,B,AAA123456,AAAAAAA,BBBBBB,
000015 1234567891,A,123456789,H64398376,,
000016 5874639368,B,AAA123456,AAAAAAA,BBBBBB,
000017 5874639368,B,AAA123456,AAAAAAA,BBBBBB,
000018 5874639369,B,AAA123456,AAAAAAA,BBBBBB,
000019 5874639369,B,AAA123456,AAAAAAA,BBBBBB,
000020 1234567891,A,123456789,H64398376,,
000021 5874639368,B,AAA123456,AAAAAAA,BBBBBB,
000022 5874639368,B,AAA123456,AAAAAAA,BBBBBB,
000023 5874639369,B,AAA123456,AAAAAAA,BBBBBB,
000024 5874639369,B,AAA123456,AAAAAAA,BBBBBB,
000025 1234567891,A,123456789,H64398376,,
000026 5874639368,B,AAA123456,AAAAAAA,BBBBBB,
000027 5874639368,B,AAA123456,AAAAAAA,BBBBBB,
000028 5874639369,B,AAA123456,AAAAAAA,BBBBBB,
000029 5874639369,B,AAA123456,AAAAAAA,BBBBBB,
000030 1234567891,A,123456789,H64398376,,
000031 5874639368,B,AAA123456,AAAAAAA,BBBBBB,
000032 5874639368,B,AAA123456,AAAAAAA,BBBBBB,
000033 5874639369,B,AAA123456,AAAAAAA,BBBBBB,
000034 5874639369,B,AAA123456,AAAAAAA,BBBBBB,
000035 1234567891,A,123456789,H64398376,,
000036 5874639368,B,AAA123456,AAAAAAA,BBBBBB,
000037 5874639368,B,AAA123456,AAAAAAA,BBBBBB,
000038 //TEMP88    DD DISP=(MOD,PASS),DSN=&&TEMP88,
000039 //             UNIT=SYSDA,SPACE=(CYL,(1,1)),
000040 //             DCB=(RECFM=FB,LRECL=88)
000041 //SORTOUT   DD SYSOUT=*,DCB=(LRECL=80)
000042 //TOOLIN    DD *
000043   SORT FROM(SORTIN) TO(TEMP88)  USING(CTL1)
000044   SORT FROM(TEMP88) TO(SORTOUT) USING(CTL2)
000045 //CTL1CNTL  DD *
000046   INREC  OVERLAY=(81:SEQNUM,8,ZD)
000047   SORT   FIELDS=(1,80,CH,A,81,8,ZD,A)
000048   OUTREC BUILD=(1,88)
000049 //CTL2CNTL  DD *
000050   SORT   FIELDS=(81,8,ZD,A)
000051   INREC  IFTHEN=(WHEN=INIT,OVERLAY=(89:SEQNUM,8,ZD,RESTART=(1,80)))
000052   OUTFIL INCLUDE=(89,8,ZD,EQ,1),BUILD=(1,80)
****** **************************** Bottom of Data ****************************


as usually a SORT expert might come up with a better solution icon_biggrin.gif
Back to top
View user's profile Send private message
View previous topic :: :: View next topic  
Post new topic   Reply to topic View Bookmarks
All times are GMT + 6 Hours
Forum Index -> JCL & VSAM Goto page 1, 2  Next

 


Similar Topics
Topic Forum Replies
No new posts How to split large record length file... DFSORT/ICETOOL 10
No new posts Extracting Variable decimal numbers f... DFSORT/ICETOOL 17
No new posts SFTP Issue - destination file record ... All Other Mainframe Topics 2
No new posts Access to non cataloged VSAM file JCL & VSAM 18
No new posts Sortjoin and Search for a String and ... DFSORT/ICETOOL 1
Search our Forums:

Back to Top