There are total 100 flat files (TAPE) with approx 1 million records in each file, created since 2009 every month. The record length is 200 bytes - let's call it as set1. Monthly, I'll be getting another flat file with same layout comprising of approx 20 thousand records - set2. I need to compare set2 with set1 based on 18 bytes key and then write the matched records into an output file.
Notes:
* It will be a monthly process. Set1 data changes every month in such a way that, the oldest file among the 100 will be out of scope and a new file will be added every month.
* Set 2 data is not a static data - keeps changing every month.
* There is no general criteria using which I could reduce/eliminate the volume of data from 100 flat files.
* DB2 is out of scope as this needs to be finished quickly. Working with DBAs and taking approvals, access etc takes quite a long time in our company.
* Will be used only in batch job.
The queries that I have are,
* How should I handle such a huge data in an efficient way in terms of storage, performance CPU Time etc.
* Do I create a single VSAM KSDS one time to store data from 100 flat files (total will be approx 100M after removing the duplicates) and then do the compare. After comparison write the output to a new file, remove the oldest data and update the new file to the VSAM. Also, I will get some scenarios where I need to update the existing records (in Case of VSAM).
* Or Is it better to use the combined TAPE file or concatenated tape files (100) instead of going for VSAM where we need storage in disk.
* If I use tape files, I feel the efficiency will be low compared to VSAM.
* Is there any method where I could split the data and work on it or is there any other better idea?
Joined: 06 Jun 2008 Posts: 8700 Location: Dubuque, Iowa, USA
Quote:
* If I use tape files, I feel the efficiency will be low compared to VSAM.
This makes ABSOLUTELY no sense. To create a VSAM data set from your tape data, you will have to read all 100 tape files, sort to remove duplicates, and then define a VSAM data set and load it from the remaining data. Simply reading the 100 tape files and doing your comparisons means you are NOT performing the latter steps of this process, which -- by definition -- means you are increasing efficiency.
Write a program in the language of your choice to read the smaller data set into memory (a COBOL array, for example), and use that to drive your processing. You can load the array in key sequence. This allows you to use binary SEARCH if the tape files are not sorted by key sequence, or merely make one pass through the array for each tape if they are sorted by key sequence. Either way, even adding the time to create the program, you'll use much less time each month than you would by creating a VSAM data set.
Thanks for the response. Apologies in case if I couldn't convey my message clearly. Is it ok to create the VSAM ONLY ONE TIME initially by taking the 100 TAPE files and then do the insert and rewrite to the same VSAM every month using the new TAPE file?
Joined: 06 Jun 2008 Posts: 8700 Location: Dubuque, Iowa, USA
I don't think you've explained nearly enough for an accurate determination to be made. Your original post said that the oldest tape file's data will be dropped each month. How do you determine which records in the VSAM data set are to be dropped each month? If you have a way to determine that, then a VSAM KSDS makes sense. Otherwise, as I pointed out earlier, you'll need to rebuild the VSAM data set every month and that will DEFINITELY be less efficient than just processing the tape files directly.
There may be other reasons to build a VSAM data set from the tape data -- online processes or other batch jobs that need the data. Without knowing a lot of the specifics, it is not possible for us to say whether or not building a VSAM data set makes sense.