Split a very large FB file based on Key into small files.

VivekKhanna · New User Joined: 09 Feb 2009 Posts: 57 Location: India

Hi,

I want to split a very large file based on key into small file. The smaller files may have different number of records. But main requirement is:

All the records with matching keys should exist in the same file.

For example, my input file is:

Akatsukami · Posted: Mon Apr 16, 2012 8:37 pm

How is the number of data sets determined?

VivekKhanna · New User Joined: 09 Feb 2009 Posts: 57 Location: India

Maximum number of output datasets is 99.

Bill Woodger · Posted: Mon Apr 16, 2012 8:47 pm

I think you were asked a little more than that.

How do you determine, from your data, which of up to 99 output datasets gets a particular key written to it.

Your input sample shows unsorted, your output sorted, yes? Or is this the "magic" "dynamic" sort of thing?

Are all datasets to be written to every time the job is run?

Please explain your requirement as fully and clearly as you can. Read through it a few times before posting. Show samples for input and expected out. RECFM/LRECL for input and outputs.

VivekKhanna · New User Joined: 09 Feb 2009 Posts: 57 Location: India

I don't want a particular key to go into a specific output dataset, I only want matching key, to be present with the same dataset ( not to span over multiple datasets).

All the datasets will be written. The output should be sorted, that can be achieved by using sort on the key fields.

As, this is a one time execution, and I want to avoid a COBOL program to do the same. The number of records in output dataset may differ, to a maximum of 5. Consider the following example:

Skolusu · Posted: Mon Apr 16, 2012 10:25 pm

VivekKhanna · New User Joined: 09 Feb 2009 Posts: 57 Location: India

Bill Woodger · Posted: Tue Apr 17, 2012 3:05 pm

If you are going to get anything out of this, you're going to have to be accurate and fully detailed in your description of the requirement.

Go through everything you have said.

Give that to a colleague and ask them to sketch out on paper what can be understood from it.

Go through everything you have been asked.

Provide answers for all those and run it by the colleague again.

If the colleague is still unclear, provide them with clarification.

Once complete, post all the answers here.

You're asking for an amount of work to be done, and no-one wants to do it three times because you are lacking in your ability to describe your requirement to others.

VivekKhanna · New User Joined: 09 Feb 2009 Posts: 57 Location: India

Ok, I'll restate my problem again. I have fixed length file, with first four bytes as key. As mentioned below:

Bill Woodger · Posted: Tue Apr 17, 2012 5:47 pm

So your file isn't really that big?

You show a single along with four of another key in an output file. Is that vital, or can they be in two files?

What if you have more keys than fit in 99 output files?

What if you have more than five of one key?

If you GROUP on the key, with an ID, you could then use 99 OUTFIL INCLUDE specifying the ID value serially.

If you want to be "minimal", you'll have to have a sequence number at the very least as part of the grouping, and since you could have up to five keys going to the same file, the code would extend in complexity.

VivekKhanna · New User Joined: 09 Feb 2009 Posts: 57 Location: India

I need to perform this kind of operation on a file holding about 400-900 million records. And I need to split it into 99 smaller files of holding around 10 million records. I need to make sure that the last record that I am writing in every smaller file should be the last record for that key (For example 1111/2222). Yes, the smaller file can contain different keys.

dbzTHEdinosauer · Posted: Tue Apr 17, 2012 6:40 pm

Viv,
thx,
now we all know what you need.

enrico-sorichetti · Posted: Tue Apr 17, 2012 6:50 pm

Your explanation of the requirement is still not clear ...

where does the 5 come from....
does it mean that every key has at most 5 occurrences....

sine IT is pretty deterministic, it is not unusual ( understatement )
to have the desire to know a file content

up to 99 ????

if the different keys in the input file are less or equal to 99
each file will contain only all the occurrences of one key.

it is clear that all the occurrences of a key must be in one file.

for example to keep it short
for keys in sequence 1, 2, 3, ..., 98,99,100, ...
file1 ==> 1
file2 ==> 2
file3 ==> 3
...
file98 ==> 98
file99 ==> 99

and after that
file1 ==> 1, 100
file2 ==> 2, 101
file3 ==> 3, 102
...
file98 ==> 98,197
file99 ==> 99, 198

and after that
file1 ==> 1,100,199
file2 ==> 2,101,200

and so on

obtaining the correct information is much more difficult than providing a solution

Bill Woodger · Posted: Tue Apr 17, 2012 7:09 pm

OK.... a surprising turn of events from the sample data...

So, you have an input file. You want to write the data to an output file with a maximum number of records in that which cannot exceed 10,000,000. You must not split keys across files. You do this as many times as necessary until your input is exhausted.

Have you given us the real key-length and LRECL?

How many records can you have for the same key?

It would be much easier if the output could be 10-million-and-a-bit, the "bit" being those remaining records of the same key as the 10 millionth.

What are you going to do with the output?

If,on one run, you have "only" 400,000,000, do you want them split evenly across 99 files, or do you want them still in lumps of 10-million-and-a-bit?

VivekKhanna · New User Joined: 09 Feb 2009 Posts: 57 Location: India

LRECL will be around 300 bytes.

There can be a maximum of 50 records for same key.

10 million + some X records is acceptable. (X can vary between 01-50).

Output will be FTPed.

on one run, there can be more then 400,000,000. The target system cannot hold more then 10 million + X records in a single file. So splitting is mandatory.

Bill Woodger · Posted: Tue Apr 17, 2012 9:35 pm

OK. at max 50 per key that gives 200,000+ keys. You have an alpha-numeric key, or is it a bit longer than you have shown.

"Around 300" for the LRECL. You mean it is VB max of 300, or what?

Bill Woodger · Posted: Wed Apr 18, 2012 12:09 am

An idea, demonstrated with 80-byte fixed-length records.

A sequence number (10 digits) for every record.

A GROUP on the key, which pushes the first four digits of the sequence number (I've fudged this in the example, so it works with groups of 10, not 10,000,000).

OUTFIL to distribute the data based on the first four of the record which was the start of the GROUP.

I put a "SAVE" for any overflow over 99 files (two files in my example).