2 files comparison , complex match criteria

Deeptha · New User Joined: 11 May 2010 Posts: 6 Location: Bangalore

I'm unable to get around to finding a concrete solution for this one.

I have a file , say FileA that has a million records, and another File , say FileB that can have 50,000 records. What is happening today is , a table which is subscripted to have 50,000 instances, gets loaded from FileB. The matching happens between FileA and the table.

Now, the match criteria is not a straight forward one that checks for equalities. It is complex , for example like this one :

1. A = B AND
2. (C = D OR C = Spaces) AND
3. (E = F OR E = Spaces) AND
4. G > H AND
5. I < J

Let us say A,C,E,G,I are from FileA and B,D,F,H,J are from the table.

Note that both files are sorted on A,B, C,D,E,G, H, I & J.

If the match criteria is satisfied, then a matching report is generated. An additional field prefixed on the table is updated to Y for a match.

If NOT, thats where my problem is. The program checks to see the following :

1. A < B AND E < 500
2. C < D AND E < 500
3. E < F AND E < 500
4. G < H AND E < 500
5. I > J AND E < 500

if any record matches the above condition, then an unmatched report is generated and the next record on FileA is read & the search process begins again, starting from the 1st occurrence of the table.

On the other hand, if records meet the following condition :

1. A < B AND E > 500
2. C < D AND E > 500
3. E < F AND E > 500
4. G < H AND E > 500
5. I > J AND E > 500

500 is deducted from E (which is from FileA) , and the search starts all over again, from the begining of the table.

If records statisfy :

1. A > B
2. C > D
3. E > F
4. G > H
5. I < J

Then the next element on the table is accessed for searching , and the process continues.

The icing on the cake is , at the end of FileA, the entire table is unloaded, and during the table unload, the match indicator on the Table is checked for value Y. If it does then nothing, if not, another unmatched report is generated.

The real problem in production is that this program runs for 22 hours, and the CPU consumption is 5 hours. We are trying to tune this program by eliminating the use of an intermediate table and using just 2 sequential files to process.

Please suggest on the best way of doing it. My problem is the part where 500 is deducted from a value in FileA and the search process restarting all over again on the table. Is there a way around this ??

Thanks

dick scherrer · Posted: Wed May 12, 2010 9:04 am

Hello and welcome to the forum,

Suggest you review the code for gross inefficiencies. The 50,000 records are not by chance being read over and over for each record in the main input. . .? This would be a mistake, but could be happening causing most of the lost/wasted time.

Suggest the way the table/array is searched be looked at. Eliminate things as quickly as possible to reduce the number of unneeded compares.

Suggest that maybe multiple tables/arrays be used instead of only one?

It will help someone help you if you get rid of the alphabet soup and post some fieldnames that people can relate to. Also, post some input records, a sample table, and the output wanted when that input data and sample table are processed.

Take heart - some of my current processes read between 10 and 100 million very large records and do considerable array processing and run in only a couple of hours (depending on the system load)

Binop B · Posted: Wed May 12, 2010 10:03 am

Hi Deeptha...

First of all got to appreciate the way you have told us the requirement/problem... You have taken the effort to put most of the details here in an ordered way... Including Dick's suggestions will make it perfect...

Adding onto Dick's suggestions... my suggestion would be to try and understand the business functionality of this program... Probably this code was written by some amateur long back and once you know the business perspective it might help...

dick scherrer · Posted: Thu May 13, 2010 4:40 am

Hello,

If you do not follow up, it is nearly impossible for us to help. . .

Have you determined that the file to build the array is only opened once?

How is the array defined? Searched?

Deeptha · New User Joined: 11 May 2010 Posts: 6 Location: Bangalore

Dick :

You would'nt believe it if I say I am still trying to get my way around solving this. The table like I initially mentioned is built even before the matching routine starts with FileB and it is capable of having 50,000 instances. How it is searched - It is almost like a Perform inside a loop, there is no 'SEARCH' sentence as such. It has a PERFORMs the match routine, and when the criteria is not met with , it then does a GO TO on the same match routine. So several iterations get by for the routine.

Whats making my work tough is, I have this variable, FILEA-LOC-CODE from the driver file, that is checked for being greater than 500, if it is, then 500 is deducted from FILEA-LOC-CODE and the searching process begins all over again on the table (from the 1st element). If the value is less than 500 & it does not even go thru the match, then it is considered as unmatched.

In one of my wild attempts to get somewhere, I tried using couple of more SORT steps before the program runs (ironically noticed the file that loads the table is not sorted). When I tried to do a few date manipulations with SORT, I could achieve a CPU reduction 11%.

Today's mission is to try singling out the matching process on SORT / ICETOOL and let the program handle only the non-matching process. Am keeping my fingers crossed. I was busy the whole of yesterday trying to get somewhere (which I did, but not the the extent I want to), so I did not get a chance to give you file / table structures. I will do so today.

Any help is MOST welcome.

Thanks

Deeptha · New User Joined: 11 May 2010 Posts: 6 Location: Bangalore

Robert Sample · Posted: Thu May 13, 2010 7:52 am

If your site has STROBE or another performance analysis tool, use it. STROBE can tell you exactly what line(s) of code the program is hitting the most, and which line(s) of code take the most CPU time -- not always the same lines, either.

If you don't have a performance analysis tool available, you can do it yourself. Get the counts for the various conditions, and organize your code to place the most common conditions first. Rewrite the code to reduce the IF statements as much as possible. For example,

dick scherrer · Posted: Thu May 13, 2010 8:16 am

Hello,

Deeptha · New User Joined: 11 May 2010 Posts: 6 Location: Bangalore

So here is where I went today , at 3:12 PM IST.

The COBOL program that runs in PROD does a few date manipulations on a few fields on the table : Eg:

IF TABLE-CENTURY = '0' MOVE '20' to WORK-DATE-CC and the match criteria uses the WORK-DATE to match against FIleA's WORK-DATE.

The small progress I made today was to push all these date manipulations into a SORT card, and make sure the COBOL code was free from this. I ran a test on this, with a test input driver file having 50,000 records and these are the stats :

PROD version of the program having 50,000 records in the input

CPU Time : 00:00:27.51
CPU Units : 1,045,690
Elapsed Time : 00:00:31.06

Test version of the program having 50,000 records in the input

CPU Time : 00:00:20.20
CPU Units : 767,773
Elapsed Time : 00:00:26.84

Dont know if I should feel good about this because of the 26.5% of benefit I get in CPU time. I'd like to do better than this...still trying...I give myself a day's time...if nothing, I am throwing in my towel!

Robert Sample · Posted: Thu May 13, 2010 4:43 pm

Deeptha · New User Joined: 11 May 2010 Posts: 6 Location: Bangalore

Robert :

Wish I had all the time in the world to tune this program. Unfortunately we are caught in a system where deadlines ARE the end of the world. I'm sorry you consider this a waste of your time.

Saying goodbye to this forum.

dbzTHEdinosauer · Posted: Thu May 13, 2010 7:24 pm

dick scherrer · Posted: Fri May 14, 2010 7:56 am

Hello,

Deeptha · New User Joined: 11 May 2010 Posts: 6 Location: Bangalore

It is unfortunate to see one of the 'Global Moderators' break the rule of respect that is being mentioned on your forum rules - give respect, be patient and help - is that your way of giving respect ? Throw in the towel now and stop wasting our time. I think that comment was premature and definitely by no stretch was it respectful.

I dont see why I should humor any of you.

Thanks for you time.

dick scherrer · Posted: Fri May 14, 2010 8:46 am

Hello,

dbzTHEdinosauer · Posted: Sat May 15, 2010 5:56 pm

Before one can design the process, one must understand the data.
goes without saying,
can't debug or improve the performance of a process,
without understanding the data.

This abstract
A, B, C, D <>= Q, M, J
nonsense
gives no-one a chance to participate fully.

That the TS has taken his ball and gone home is his problem,
but hopefully future TSs will describe their process a little better.