Storing huge volume of data, compare and process

Pradeep K M · New User Joined: 13 Jan 2017 Posts: 2 Location: India

Hi,

There are total 100 flat files (TAPE) with approx 1 million records in each file, created since 2009 every month. The record length is 200 bytes - let's call it as set1. Monthly, I'll be getting another flat file with same layout comprising of approx 20 thousand records - set2. I need to compare set2 with set1 based on 18 bytes key and then write the matched records into an output file.

Notes:

* It will be a monthly process. Set1 data changes every month in such a way that, the oldest file among the 100 will be out of scope and a new file will be added every month.
* Set 2 data is not a static data - keeps changing every month.
* There is no general criteria using which I could reduce/eliminate the volume of data from 100 flat files.
* DB2 is out of scope as this needs to be finished quickly. Working with DBAs and taking approvals, access etc takes quite a long time in our company.
* Will be used only in batch job.

The queries that I have are,

* How should I handle such a huge data in an efficient way in terms of storage, performance CPU Time etc.
* Do I create a single VSAM KSDS one time to store data from 100 flat files (total will be approx 100M after removing the duplicates) and then do the compare. After comparison write the output to a new file, remove the oldest data and update the new file to the VSAM. Also, I will get some scenarios where I need to update the existing records (in Case of VSAM).
* Or Is it better to use the combined TAPE file or concatenated tape files (100) instead of going for VSAM where we need storage in disk.
* If I use tape files, I feel the efficiency will be low compared to VSAM.
* Is there any method where I could split the data and work on it or is there any other better idea?

Robert Sample · Posted: Mon Jan 16, 2017 6:20 pm

Pradeep K M · New User Joined: 13 Jan 2017 Posts: 2 Location: India

To Robert Sample:

Thanks for the response. Apologies in case if I couldn't convey my message clearly. Is it ok to create the VSAM ONLY ONE TIME initially by taking the 100 TAPE files and then do the insert and rewrite to the same VSAM every month using the new TAPE file?

Robert Sample · Posted: Mon Jan 16, 2017 8:36 pm

I don't think you've explained nearly enough for an accurate determination to be made. Your original post said that the oldest tape file's data will be dropped each month. How do you determine which records in the VSAM data set are to be dropped each month? If you have a way to determine that, then a VSAM KSDS makes sense. Otherwise, as I pointed out earlier, you'll need to rebuild the VSAM data set every month and that will DEFINITELY be less efficient than just processing the tape files directly.

There may be other reasons to build a VSAM data set from the tape data -- online processes or other batch jobs that need the data. Without knowing a lot of the specifics, it is not possible for us to say whether or not building a VSAM data set makes sense.