IBM Mainframe Forum Index
 
Log In
 
IBM Mainframe Forum Index Mainframe: Search IBM Mainframe Forum: FAQ Register
 

Split a very large FB file based on Key into small files.


IBM Mainframe Forums -> DFSORT/ICETOOL
Post new topic   Reply to topic
View previous topic :: View next topic  
Author Message
VivekKhanna

New User


Joined: 09 Feb 2009
Posts: 57
Location: India

PostPosted: Mon Apr 16, 2012 8:27 pm
Reply with quote

Hi,

I want to split a very large file based on key into small file. The smaller files may have different number of records. But main requirement is:

All the records with matching keys should exist in the same file.

For example, my input file is:
Code:

----+----1----+----2----+----3----+----4----+----5
1111   AABBCCDD
1111   ABCDABCD
2222   ABABCDCD
3333   ACACBDBD
1111   AACCBBDD
3333   ARDECRED


Key in the above file is first four byts of the record. If the above file is split up into 2 files (maximum record out in output file=4), the output should be:

File 1:
Code:

----+----1----+----2----+----3----+----4----+----5
1111   AABBCCDD
1111   ABCDABCD
1111   AACCBBDD
2222   ABABCDCD


File 2:
Code:

----+----1----+----2----+----3----+----4----+----5
3333   ACACBDBD
3333   ARDECRED
Back to top
View user's profile Send private message
Akatsukami

Global Moderator


Joined: 03 Oct 2009
Posts: 1788
Location: Bloomington, IL

PostPosted: Mon Apr 16, 2012 8:37 pm
Reply with quote

How is the number of data sets determined?
Back to top
View user's profile Send private message
VivekKhanna

New User


Joined: 09 Feb 2009
Posts: 57
Location: India

PostPosted: Mon Apr 16, 2012 8:40 pm
Reply with quote

Maximum number of output datasets is 99.
Back to top
View user's profile Send private message
Bill Woodger

Moderator Emeritus


Joined: 09 Mar 2011
Posts: 7309
Location: Inside the Matrix

PostPosted: Mon Apr 16, 2012 8:47 pm
Reply with quote

I think you were asked a little more than that.

How do you determine, from your data, which of up to 99 output datasets gets a particular key written to it.

Your input sample shows unsorted, your output sorted, yes? Or is this the "magic" "dynamic" sort of thing?

Are all datasets to be written to every time the job is run?

Please explain your requirement as fully and clearly as you can. Read through it a few times before posting. Show samples for input and expected out. RECFM/LRECL for input and outputs.
Back to top
View user's profile Send private message
VivekKhanna

New User


Joined: 09 Feb 2009
Posts: 57
Location: India

PostPosted: Mon Apr 16, 2012 9:03 pm
Reply with quote

I don't want a particular key to go into a specific output dataset, I only want matching key, to be present with the same dataset ( not to span over multiple datasets).

All the datasets will be written. The output should be sorted, that can be achieved by using sort on the key fields.

As, this is a one time execution, and I want to avoid a COBOL program to do the same. The number of records in output dataset may differ, to a maximum of 5. Consider the following example:

Code:

----+----1----+----2----+----3
1111   AABBCCDD
1111   AACCBBDD
1111   AADDCCBB
1111   AABCBCDD
2222   ACACBDBD
2222   AACCBBDD
2222   AACCDDBB
3333   AACCVVRR
3333   AACCDDBB
3333   AABBCCDD
3333   AABBDDCC


Since the number of records in the input file are 11, so 99 output files will be created with only 3 files holding data and rest will be empty files. The output should be as mentioned below:

File 1:
Code:

----+----1----+----2----+----3
1111   AABBCCDD
1111   AACCBBDD
1111   AADDCCBB
1111   AABCBCDD


File 2:
Code:

----+----1----+----2----+----3
2222   ACACBDBD
2222   AACCBBDD
2222   AACCDDBB


File 3:
Code:

----+----1----+----2----+----3
3333   AACCVVRR
3333   AACCDDBB
3333   AABBCCDD
3333   AABBDDCC
Back to top
View user's profile Send private message
Skolusu

Senior Member


Joined: 07 Dec 2007
Posts: 2205
Location: San Jose

PostPosted: Mon Apr 16, 2012 10:25 pm
Reply with quote

VivekKhanna wrote:

I don't want a particular key to go into a specific output dataset, I only want matching key, to be present with the same dataset ( not to span over multiple datasets).
As, this is a one time execution, and I want to avoid a COBOL program to do the same. The number of records in output dataset may differ, to a maximum of 5. Consider the following example:

Code:

----+----1----+----2----+----3
1111   AABBCCDD
1111   AACCBBDD
1111   AADDCCBB
1111   AABCBCDD
2222   ACACBDBD
2222   AACCBBDD
2222   AACCDDBB
3333   AACCVVRR
3333   AACCDDBB
3333   AABBCCDD
3333   AABBDDCC



You are NOT consistent with your rules. If your maximum records in the output file is 5, as per the sample key 1111 has 4 records and 2222 has 3 records, So if you consider 5 records then shouldn't key 2222 also be in the same file 1111?

What happens if you have 8 records for the key value 1111? 8 > 5 , so it should be split into a new file? This will contradict with your earlier rule that the key shouldn't be split across files. So make up your mind about how you want to split the file.

What is the LRECL and RECFM of the file?
Back to top
View user's profile Send private message
VivekKhanna

New User


Joined: 09 Feb 2009
Posts: 57
Location: India

PostPosted: Tue Apr 17, 2012 2:59 pm
Reply with quote

Quote:

I only want matching key, to be present with the same dataset (not to span over multiple datasets).


This means that the output files should not be like:
Code:

----+----1----+----2----+----3
1111   AABBCCDD
1111   AACCBBDD
1111   AADDCCBB
1111   AABCBCDD
2222   ACACBDBD
Back to top
View user's profile Send private message
Bill Woodger

Moderator Emeritus


Joined: 09 Mar 2011
Posts: 7309
Location: Inside the Matrix

PostPosted: Tue Apr 17, 2012 3:05 pm
Reply with quote

If you are going to get anything out of this, you're going to have to be accurate and fully detailed in your description of the requirement.

Go through everything you have said.

Give that to a colleague and ask them to sketch out on paper what can be understood from it.

Go through everything you have been asked.

Provide answers for all those and run it by the colleague again.

If the colleague is still unclear, provide them with clarification.

Once complete, post all the answers here.

You're asking for an amount of work to be done, and no-one wants to do it three times because you are lacking in your ability to describe your requirement to others.
Back to top
View user's profile Send private message
VivekKhanna

New User


Joined: 09 Feb 2009
Posts: 57
Location: India

PostPosted: Tue Apr 17, 2012 4:07 pm
Reply with quote

Ok, I'll restate my problem again. I have fixed length file, with first four bytes as key. As mentioned below:

Code:

----+----1----+----2----+----3
1111   AABBCCDD
1111   AACCBBDD
1111   AADDCCBB
1111   AABCBCDD
2222   ACACBDBD
2222   AACCBBDD
2222   AACCDDBB
3333   AACCVVRR
3333   AACCDDBB
3333   AABBCCDD
3333   AABBDDCC
4444   AABBCCDD
5555   AABBDDCC


The maximum number of records that an output file can hold is 5. Now, I want to split this big file in such a way that all the matching key records should not span across multiple files. If I use a utility like ICEMAN, it will break the above mentioned file in following manner:

File 1:
Code:

----+----1----+----2----+----3
1111   AABBCCDD
1111   AACCBBDD
1111   AADDCCBB
1111   AABCBCDD
2222   ACACBDBD


File 2:
Code:

----+----1----+----2----+----3
2222   AACCBBDD
2222   AACCDDBB
3333   AACCVVRR
3333   AACCDDBB
3333   AABBCCDD


File 3:
Code:

----+----1----+----2----+----3
3333   AABBDDCC
4444   AABBCCDD
5555   AABBDDCC


We can see that key '2222' has been spanned across two different files, File 1 and File 2. Similarly, key '3333' has been spanned across two different files File 2 and File 3.

The requirement says that the Key should not span across different files. Thus the output should be like:

File 1:
Code:

----+----1----+----2----+----3
1111   AABBCCDD
1111   AACCBBDD
1111   AADDCCBB
1111   AABCBCDD


File 2:
Code:

----+----1----+----2----+----3
2222   ACACBDBD
2222   AACCBBDD
2222   AACCDDBB


File 3:
Code:

----+----1----+----2----+----3
3333   AACCVVRR
3333   AACCDDBB
3333   AABBCCDD
3333   AABBDDCC
4444   AABBCCDD


File 4:
Code:

----+----1----+----2----+----3
5555   AABBDDCC



There is no strict rule that I have to send 5 records in a file, the record count each file can differ (but a file can hold maximum of 5 records). Please let me know, in case, any other information is required.
Back to top
View user's profile Send private message
Bill Woodger

Moderator Emeritus


Joined: 09 Mar 2011
Posts: 7309
Location: Inside the Matrix

PostPosted: Tue Apr 17, 2012 5:47 pm
Reply with quote

So your file isn't really that big?

You show a single along with four of another key in an output file. Is that vital, or can they be in two files?

What if you have more keys than fit in 99 output files?

What if you have more than five of one key?

If you GROUP on the key, with an ID, you could then use 99 OUTFIL INCLUDE specifying the ID value serially.

If you want to be "minimal", you'll have to have a sequence number at the very least as part of the grouping, and since you could have up to five keys going to the same file, the code would extend in complexity.
Back to top
View user's profile Send private message
VivekKhanna

New User


Joined: 09 Feb 2009
Posts: 57
Location: India

PostPosted: Tue Apr 17, 2012 6:35 pm
Reply with quote

I need to perform this kind of operation on a file holding about 400-900 million records. And I need to split it into 99 smaller files of holding around 10 million records. I need to make sure that the last record that I am writing in every smaller file should be the last record for that key (For example 1111/2222). Yes, the smaller file can contain different keys.
Back to top
View user's profile Send private message
dbzTHEdinosauer

Global Moderator


Joined: 20 Oct 2006
Posts: 6966
Location: porcelain throne

PostPosted: Tue Apr 17, 2012 6:40 pm
Reply with quote

Viv,
thx,
now we all know what you need.
Back to top
View user's profile Send private message
enrico-sorichetti

Superior Member


Joined: 14 Mar 2007
Posts: 10873
Location: italy

PostPosted: Tue Apr 17, 2012 6:50 pm
Reply with quote

Your explanation of the requirement is still not clear ...

where does the 5 come from....
does it mean that every key has at most 5 occurrences....

sine IT is pretty deterministic, it is not unusual ( understatement )
to have the desire to know a file content icon_cool.gif

up to 99 ????

if the different keys in the input file are less or equal to 99
each file will contain only all the occurrences of one key.

it is clear that all the occurrences of a key must be in one file.

for example to keep it short
for keys in sequence 1, 2, 3, ..., 98,99,100, ...
file1 ==> 1
file2 ==> 2
file3 ==> 3
...
file98 ==> 98
file99 ==> 99

and after that
file1 ==> 1, 100
file2 ==> 2, 101
file3 ==> 3, 102
...
file98 ==> 98,197
file99 ==> 99, 198

and after that
file1 ==> 1,100,199
file2 ==> 2,101,200

and so on

obtaining the correct information is much more difficult than providing a solution

icon_evil.gif
Back to top
View user's profile Send private message
Bill Woodger

Moderator Emeritus


Joined: 09 Mar 2011
Posts: 7309
Location: Inside the Matrix

PostPosted: Tue Apr 17, 2012 7:09 pm
Reply with quote

OK.... a surprising turn of events from the sample data...

So, you have an input file. You want to write the data to an output file with a maximum number of records in that which cannot exceed 10,000,000. You must not split keys across files. You do this as many times as necessary until your input is exhausted.

Have you given us the real key-length and LRECL?

How many records can you have for the same key?

It would be much easier if the output could be 10-million-and-a-bit, the "bit" being those remaining records of the same key as the 10 millionth.

What are you going to do with the output?

If,on one run, you have "only" 400,000,000, do you want them split evenly across 99 files, or do you want them still in lumps of 10-million-and-a-bit?
Back to top
View user's profile Send private message
VivekKhanna

New User


Joined: 09 Feb 2009
Posts: 57
Location: India

PostPosted: Tue Apr 17, 2012 9:27 pm
Reply with quote

LRECL will be around 300 bytes.

There can be a maximum of 50 records for same key.

10 million + some X records is acceptable. (X can vary between 01-50).

Output will be FTPed.

on one run, there can be more then 400,000,000. The target system cannot hold more then 10 million + X records in a single file. So splitting is mandatory.
Back to top
View user's profile Send private message
Bill Woodger

Moderator Emeritus


Joined: 09 Mar 2011
Posts: 7309
Location: Inside the Matrix

PostPosted: Tue Apr 17, 2012 9:35 pm
Reply with quote

OK. at max 50 per key that gives 200,000+ keys. You have an alpha-numeric key, or is it a bit longer than you have shown.

"Around 300" for the LRECL. You mean it is VB max of 300, or what?
Back to top
View user's profile Send private message
Bill Woodger

Moderator Emeritus


Joined: 09 Mar 2011
Posts: 7309
Location: Inside the Matrix

PostPosted: Wed Apr 18, 2012 12:09 am
Reply with quote

An idea, demonstrated with 80-byte fixed-length records.

A sequence number (10 digits) for every record.

A GROUP on the key, which pushes the first four digits of the sequence number (I've fudged this in the example, so it works with groups of 10, not 10,000,000).

OUTFIL to distribute the data based on the first four of the record which was the start of the GROUP.

I put a "SAVE" for any overflow over 99 files (two files in my example).

Code:
//BISPLT  EXEC PGM=SORT
//SYSOUT   DD SYSOUT=*
//SORTOF01 DD SYSOUT=*
//SORTOF02 DD SYSOUT=*
//SORTOFOV DD SYSOUT=*
//SORTOUT DD SYSOUT=*
//SYSIN DD *
  OPTION COPY
  INREC IFTHEN=(WHEN=INIT,OVERLAY=(81:SEQNUM,10,ZD)),
        IFTHEN=(WHEN=GROUP,KEYBEGIN=(1,5),PUSH=(91:86,4))
  OUTFIL FILES=01,INCLUDE=(91,4,ZD,EQ,00),BUILD=(1,80)
  OUTFIL FILES=02,INCLUDE=(91,4,ZD,EQ,01),BUILD=(1,80)
  OUTFIL FILES=OV,SAVE,BUILD=(1,80)
//SORTIN DD *                                   


Lightly tested with 34 records :-)

The multiple OUTFILS can easily be generated.
Back to top
View user's profile Send private message
View previous topic :: :: View next topic  
Post new topic   Reply to topic View Bookmarks
All times are GMT + 6 Hours
Forum Index -> DFSORT/ICETOOL

 


Similar Topics
Topic Forum Replies
No new posts Compare 2 files and retrive records f... DFSORT/ICETOOL 3
No new posts Compare 2 files(F1 & F2) and writ... JCL & VSAM 8
No new posts FTP VB File from Mainframe retaining ... JCL & VSAM 8
No new posts Extract the file name from another fi... DFSORT/ICETOOL 6
No new posts How to split large record length file... DFSORT/ICETOOL 10
Search our Forums:

Back to Top