The task from the original topic seems to me quite interesting, since it is a special functionality I was not faced previously.
As said before, a straightforward solution creates intermediate set of rows as cartesian product with the size equal to the product of (file1 size) * (file2 size). For a limited size of input data this approach is acceptable, and may be preferred since it's quite simple. But for huge data files another approach is needed, to provide only single scan of the huge file.
One of possible solutions has been given at the end of the original topic. It works fine for big files, but nevertheless later I found a serious drawback in that sample. When selection key ranges are not prepared correctly, e.g. some key ranges are overlapped (for instance, KKKKKK-MMMMMM, and LLLLLL-NNNNNN), then incorrect output might be produced with no indication that something was wrong. Automatic verification and/or correction of input ranges is not very trivial task (BTW, it also might be a good exercise as training on SORT algorithms and methods).
A simple decision would be just indication that input ranges are not specified correctly, to prevent further use of wrong output. Only minor update to the previous code is needed.
Code:
//*=====================================================================
//* SORT AND SPLIT RANGE LIST INTO TWO SEPARATE LISTS: FROM/TO VALUES
//*=====================================================================
//*
//RANGES EXEC PGM=SYNCSORT
//*
//SYSOUT DD SYSOUT=*
//*
//SORTIN DD *
999999777777
333333555555
DDDDDDDDDDDD
AAAAAABBBBBB
PPPPPPQQQQQQ
$$$$$$$$$$$$
XXXXXXYYYYYY
KKKKKKLLLLLL
//*
//RANGES DD DISP=(NEW,PASS),
// UNIT=SYSDA,SPACE=(TRK,(50,50),RLSE),
// DSN=&&RANGES
//*
//SYSIN DD *
*
INCLUDE COND=(1,6,CH,LE,7,12,CH) eliminate bad ranges
*
SORT FIELDS=(1,6,CH,A) sort left range ascending
*
OUTFIL FNAMES=(RANGES), create list of left/right limits
BUILD=(1,6,74X'00', left limit with min of data
/,7,6,74X'FF') right limit with max of data
*keys data
*----- ----------------------------------------
*from1 0000000000000000000000000000000000000000
*to1 FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
*from2 0000000000000000000000000000000000000000
*to2 FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
* . . . . . . . . . . . . . . . . . . . . . . .
*fromN 0000000000000000000000000000000000000000
*toN FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
END
//*
//*=====================================================================
//* MERGE HUGE INPUT, AND TWO RANGE FILES; ALL MUST BE SORTED
//*=====================================================================
//BETWEEN EXEC PGM=SYNCSORT
//*
//SYSOUT DD SYSOUT=*
//*
//SORTIN01 DD DISP=(OLD,DELETE),DSN=&&RANGES left/right limits
//SORTIN02 DD * input data (huge)
$$$$$$123456
######234567
@@@@@@345678
AAAAAA123456
BBBBBB234567
CCCCCC345678
DDDDDD456789
EEEEEE123456
FFFFFF234567
GGGGGG345678
HHHHHH456789
IIIIII123456
JJJJJJ234567
KKKKKK345678
LLLLLL456789
MMMMMM123456
NNNNNN234567
OOOOOO345678
PPPPPP456789
QQQQQQ123456
RRRRRR234567
SSSSSS345678
TTTTTT456789
UUUUUU123456
VVVVVV234567
WWWWWW345678
XXXXXX456789
YYYYYY123456
ZZZZZZ234567
000000456789
111111345678
222222234567
333333123456
444444456789
555555345678
666666234567
777777123456
888888456789
999999345678
//*
//SELECTED DD SYSOUT=*
//*
//SYSIN DD *
*
MERGE FIELDS=(1,6,CH,A, merge using record keys
7,6,CH,A) place range records before/after data groups
*
OUTREC IFTHEN=(WHEN=GROUP, detect groups of allowed ranges
BEGIN=(7,1,CH,EQ,X'00'), marker of left range record
END=(7,1,CH,EQ,X'FF'), marker of right range record
PUSH=(81:7,6)) mark group records as valid
*
OUTFIL FNAMES=(SELECTED),
INCLUDE=(7,1,CH,NE,X'00', exclude left markers
AND,7,1,CH,NE,X'FF', exclude right markers
AND,81,6,CH,EQ,X'00') include only valid records
*
END
//*
The last solution from previous post prevents from producing wrong results of a huge dataset (when it becomes difficult to detect that some wrong data have been included). It causes JCL step to ABEND when given list of ranges is not normalized to non-overlapping list of left-right values.
A much better approach would be to fix the pairs of ranges in the input dataset, including any combination of the following:
1) eliminate wrong ranges, like ZZZZZZ-XXXXXX
2) combine the fully nested ranges, like AAAAAA-DDDDDD, and BBBBBB-CCCCCC into single AAAAAA-DDDDDD
3) combine partially overlapping ranges, like KKKKKK-MMMMMM, and LLLLLL-NNNNNN into KKKKKK-NNNNNN
4) combine adjacent ranges, like PPPPPP-QQQQQQ, and QQQQQQ-SSSSSS into single PPPPPPP-SSSSSS
Of course, this can be done using any computer tool - COBOL/REXX/Assembler/PLI/… +100 others. But it is preferable to always use the tool already used for master task being implemented; in this case SORT utility + JCL. It is not recommended to involve many different tools for one single task unless it is really inevitable.
Fortunately, modern SORT utilities do provide vast variety of operations on flat tables (stored in datasets) to fulfil the required input table normalization. The only thing required is - the ability to think in terms of table operations - very similar to those provided by languages like SQL.
Here is one of possible solutions, implemented as simple sequence of JCL steps, each performing SORT utility for 1-2-3 required operations on several intermediate tables.
Let's consider the steps one by one.
Code:
//*
//*SET DEBUG='SYSOUT=*'
// SET DEBUG=DUMMY
// SET LOG='SYSOUT=*'
//*SET LOG=DUMMY
//*
//*=====================================================================
//* PRE-PROCESS THE RANGE LIST:
//* 1) ELIMINATE BAD RANGES (WHERE LEFT > RIGHT VALUE)
//* 2) MERGE OVERLAPPING, and ADJACENT RANGES INTO LARGER SINGLE ONE
//* 3) PREPARE "VERTICAL" RANGE LIST TO BE USED WHEN MERGING HUGE DATABA
//*=====================================================================
//SPLIT EXEC PGM=SYNCSORT
//*
//SYSOUT DD &DEBUG
//*
//SORTIN DD *
999999777777
222222444444
333333333333
333333666666
555555555555
111111222222
000000222222
000000000000
888888999999
999999777777
333333555555
444444666666
DDDDDDDDDDDD
AAAAAACCCCCC
PPPPPPQQQQQQ
BBBBBBIIIIII
$$$$$$$$$$$$
TTTTTTYYYYYY
KKKKKKLLLLLL
//*
//REPORT DD &LOG
//SPLIT DD DISP=(NEW,PASS),
// UNIT=SYSDA,SPACE=(TRK,(10,10),RLSE),
// DSN=&&SPLIT
//*
//*
//SYSIN DD *
INCLUDE COND=(1,6,CH,LE,7,12,CH) eliminate bad ranges immediately
*
SORT FIELDS=COPY
*
OUTFIL FNAMES=(REPORT,SPLIT),
BUILD=(1,6,X,C'+1', mark nesting level to increase by 1
/,7,6,X,C'-1') mark nesting level to decrease by 1
*
END
//*
//*=====================================================================
This preprocessing step splits un-normalized list of ranges into vertical list of values, and ignores initially bad ranges
Next two steps do the same with end-values of each range
Code:
//*=====================================================================
//RIGHTS EXEC PGM=SYNCSORT
//*
//SYSOUT DD &DEBUG
//*
//SORTIN DD DISP=(OLD,DELETE),DSN=&&SPLIT
//*
//REPORT DD &LOG
//RIGHTS DD DISP=(NEW,PASS),
// UNIT=SYSDA,SPACE=(TRK,(10,10),RLSE),
// DSN=&&RIGHTS
//*
//*
//SYSIN DD *
SORT FIELDS=(1,6,CH,A, normal order of ranges
8,2,SFF,D) place start-range before end-range
*
OUTFIL FNAMES=(REPORT,RIGHTS),
REMOVECC,NODETAIL,
BUILD=(1,9,50X), reserve space for SUBTOTAL
SECTIONS=(1,6,
TRAILER3=(1,6,X,
* for each range border value calculate its nesting level
SUBTOTAL=(8,2,SFF,EDIT=(TTTTT))))
*
END
//*
//*=====================================================================
It calculates nesting level for all end-range values
//*=====================================================================
//ENDS EXEC PGM=SYNCSORT
//*
//SYSOUT DD &DEBUG
//*
//SORTIN DD DISP=(OLD,DELETE),DSN=&&RIGHTS
//*
//REPORT DD &LOG
//ENDS DD DISP=(NEW,PASS),
// UNIT=SYSDA,SPACE=(TRK,(10,10),RLSE),
// DSN=&&ENDS
//*
//*
//SYSIN DD *
INCLUDE COND=(11,5,ZD,EQ,+0) select only most outer right ranges
*
SORT FIELDS=(1,6,CH,A) sort ranges in normal order
*
OUTREC BUILD=(1,6,X,SEQNUM,5,ZD) assign seq number to each range
*
OUTFIL FNAMES=(REPORT,ENDS)
*
END
//*
//*=====================================================================
And again, it also re-orders the end-range values in ascending order, and assigns sequential number to each value:
Finally, two ordered lists of start/end ranges are to be joined
Code:
//*=====================================================================
//RANGES EXEC PGM=SYNCSORT
//*
//SYSOUT DD &DEBUG
//*
//STARTS DD DISP=(OLD,DELETE),DSN=&&STARTS
//ENDS DD DISP=(OLD,DELETE),DSN=&&ENDS
//*
//REPORT DD &LOG
//VRANGE DD DISP=(NEW,PASS),
// UNIT=SYSDA,SPACE=(TRK,(10,10),RLSE),
// DSN=&&VRANGE
//*
//*
//SYSIN DD *
* Join matching start-end range values by range seq number
JOINKEYS F1=STARTS,
FIELDS=(8,5,A), seq number of left range record
SORTED
JOINKEYS F2=ENDS,
FIELDS=(8,5,A), seq number of right range record
SORTED
*
REFORMAT FIELDS=(F1:1,6, use both: left border
F2:1,6) and right border
*
SORT FIELDS=COPY
*
OUTFIL FNAMES=(REPORT), viewable report on final ranges
HEADER1=(C'From To ',
/,C'------ ------'),
BUILD=(1,6,C'-',7,6)
*
OUTFIL FNAMES=(VRANGE), vertical list of ranges to merge later
BUILD=(1,6,74X'00', provide left range BEFORE any data value
/,7,6,74X'FF') provide right range AFTER any data value
*
END
//*
It produces the list of normalized ranges for the following values
Code:
From To
------ ------
$$$$$$-$$$$$$
AAAAAA-IIIIII
KKKKKK-LLLLLL
PPPPPP-QQQQQQ
TTTTTT-YYYYYY
000000-666666
888888-999999
The dataset DSN=&&VRANGE can be used in MERGE step of the previous post with no chance of wrong data to be extracted from the huge input dataset.
Final trick in this normalizing the input ranges list.
This is for ICETOOL/SYNCTOOL lovers.
Code:
//*
//*SET DEBUG='SYSOUT=*'
// SET DEBUG=DUMMY
// SET LOG='SYSOUT=*'
//*SET LOG=DUMMY
//*
//*=====================================================================
//* PRE-PROCESS THE RANGE LIST:
//* 1) ELIMINATE BAD RANGES (WHERE LEFT > RIGHT VALUE)
//* 2) MERGE OVERLAPPING, and ADJACENT RANGES INTO LARGER SINGLE ONE
//* 3) PREPARE "VERTICAL" RANGE LIST TO BE USED WHEN MERGING HUGE DATA
//*=====================================================================
//FIXRANGE EXEC PGM=SYNCTOOL
//*
//TOOLMSG DD &DEBUG
//SSMSG DD &DEBUG
//*
//SORTIN DD *
$$$$$$$$$$$$
$$$$$$######
$$$$$$@@@@@@
ZZZZZZZZZZZZ
YYYYYYZZZZZZ
XXXXXXZZZZZZ
999999777777
222222444444
333333333333
333333666666
555555555555
111111222222
000000222222
000000000000
888888999999
999999777777
333333555555
444444666666
DDDDDDDDDDDD
AAAAAACCCCCC
PPPPPPQQQQQQ
BBBBBBIIIIII
$$$$$$$$$$$$
TTTTTTYYYYYY
KKKKKKLLLLLL
//*
//SPLIT@ DD &LOG
//SPLIT DD UNIT=SYSDA,SPACE=(TRK,(10,10),RLSE)
//LEFTS@ DD &LOG
//LEFTS DD UNIT=SYSDA,SPACE=(TRK,(10,10),RLSE)
//STARTS@ DD &LOG
//STARTS DD UNIT=SYSDA,SPACE=(TRK,(10,10),RLSE)
//RIGHTS@ DD &LOG
//RIGHTS DD UNIT=SYSDA,SPACE=(TRK,(10,10),RLSE)
//ENDS@ DD &LOG
//ENDS DD UNIT=SYSDA,SPACE=(TRK,(10,10),RLSE)
//REPORT DD &LOG
//VRANGE DD DISP=(NEW,PASS),
// UNIT=SYSDA,SPACE=(TRK,(10,10),RLSE),
// DSN=&&VRANGE
//*
//TOOLIN DD *
SORT FROM(SORTIN) TO(SPLIT,SPLIT@) USING(S001)
SORT FROM(SPLIT) TO(LEFTS,LEFTS@) USING(S002)
SORT FROM(LEFTS) TO(STARTS,STARTS@) USING(S003)
SORT FROM(SPLIT) TO(RIGHTS,RIGHTS@) USING(S004)
SORT FROM(RIGHTS) TO(ENDS,ENDS@) USING(S005)
COPY JKFROM TO(REPORT,VRANGE) USING(S006)
//*
//S001CNTL DD *
INCLUDE COND=(1,6,CH,LE,7,12,CH) eliminate bad ranges immediately
SORT FIELDS=COPY
OUTFIL FNAMES=(SPLIT,SPLIT@),
BUILD=(1,6,X,C'+1', mark nesting level to increase by 1
/,7,6,X,C'-1') mark nesting level to decrease by 1
END
//*
//S002CNTL DD *
SORT FIELDS=(1,6,CH,D, reverse order of ranges
8,2,SFF,A) place end-range before start-range
OUTFIL FNAMES=(LEFTS,LEFTS@),
REMOVECC,NODETAIL,
BUILD=(1,9,50X), reserve space for SUBTOTAL
SECTIONS=(1,6,
TRAILER3=(1,6,X,
* for each range border value calculate its nesting level
SUBTOTAL=(8,2,SFF,EDIT=(TTTTT))))
END
//*
//S003CNTL DD *
INCLUDE COND=(11,5,ZD,EQ,+0) select only most outer left ranges
SORT FIELDS=(1,6,CH,A) sort chosen ranges in normal order
OUTREC BUILD=(1,6,X,SEQNUM,5,ZD) assign seq number to each range
OUTFIL FNAMES=(STARTS,STARTS@)
END
//*
//S004CNTL DD *
SORT FIELDS=(1,6,CH,A, normal order of ranges
8,2,SFF,D) place start-range before end-range
OUTFIL FNAMES=(RIGHTS,RIGHTS@),
REMOVECC,NODETAIL,
BUILD=(1,9,50X), reserve space for SUBTOTAL
SECTIONS=(1,6,
TRAILER3=(1,6,X,
* for each range border value calculate its nesting level
SUBTOTAL=(8,2,SFF,EDIT=(TTTTT))))
END
//*
//S005CNTL DD *
INCLUDE COND=(11,5,ZD,EQ,+0) select only most outer right ranges
SORT FIELDS=(1,6,CH,A) sort ranges in normal order
OUTREC BUILD=(1,6,X,SEQNUM,5,ZD) assign seq number to each range
OUTFIL FNAMES=(ENDS,ENDS@)
END
//*
//S006CNTL DD *
* Join matching start-end range values by range seq number
JOINKEYS F1=STARTS,
FIELDS=(8,5,A), seq number of left range record
SORTED
JOINKEYS F2=ENDS,
FIELDS=(8,5,A), seq number of right range record
SORTED
REFORMAT FIELDS=(F1:1,6, use both: left border
F2:1,6) and right border
SORT FIELDS=COPY
OUTFIL FNAMES=(REPORT), viewable report on final ranges
HEADER1=(C'From To ',
/,C'------ ------'),
BUILD=(1,6,C'-',7,6)
OUTFIL FNAMES=(VRANGE), vertical list of ranges to merge later
BUILD=(1,6,74X'00', provide left range BEFORE any data value
/,7,6,74X'FF') provide right range AFTER any data value
END
//*
//*=====================================================================
Code:
From To
------ ------
$$$$$$-@@@@@@
AAAAAA-IIIIII
KKKKKK-LLLLLL
PPPPPP-QQQQQQ
TTTTTT-ZZZZZZ
000000-666666
888888-999999