File comparison to find most efficient way

ashishsr123 · New User Joined: 06 May 2008 Posts: 33 Location: Chennai

Hi,

Interview question !

Scenario:

We have 3 files:
File A : contains Merchant ID only..10 in number.
File B: Contains transaction records for various merchants. Here are Merchants including merchants in file A. This file is HUGE ,having millions of records.

Output:
File C: Having transaction details of merchants in file A.
Note: File B has records which has merchant ID's as well. Merchant id is numeric number.

We have to find most efficient way to achieve this. A simple algorithm is required irrespective of any language.

Any ideas:

I suggest we sort the file b and then do binary search on file B against file merchant id is file A.( Now i see there will be lot issues with this as binary search would pic the first matching record and move out.)

We could also do a sequential search on file B and write record to file C when we have a match for merchant ID in file B.

Robert Sample · Posted: Wed May 27, 2009 6:30 pm

Write a short program which reads file A into a table in memory. Make one pass against file B, checking each merchant id against table in memory; if there is a match output file B record to file C.

Your suggestion requires random access to the file for the binary search; I'm not even sure it would be possible to implement this under VSAM. Binary search also wouldn't necessarily find the first matching record -- depending on how the keys occur, any matching record could be selected. You would then have to read backward and forward in the file B to find the other matches.

As far as your suggestion of a sequential search on B, why make one pass against millions of records for each record of A when a table will reduce it to one pass, total?

ashishsr123 · New User Joined: 06 May 2008 Posts: 33 Location: Chennai

dbzTHEdinosauer · Posted: Wed May 27, 2009 9:03 pm

use sort to match a and b and output to c.
no program, only control cards.

Robert Sample · Posted: Wed May 27, 2009 9:31 pm

Yes, occurs in COBOL.

sajjan jindal · New User Joined: 09 Sep 2007 Posts: 60 Location: india

I assume that the File B will consist a single matching record for the record in File - A (however the below algo wont need much of modification for multiple records into the File B)

1. Start Reading the records from both the files until EOF
2. If File-A-Rec > File-B-Rec
Read next record from the File-B.
3. Else If File-A-Rec < File-B-Rec
Read next record from the File-A.
4. Else if File-A-Rec = File-B-Rec
write the record to File-C-Rec.
Read next record from the File-A.
Read next record from the File-B.
5. go to step 1

dick scherrer · Posted: Mon Jul 06, 2009 7:16 pm

Hello,

For this particular question, possibly the most important bit of information wasn't provided/posted. . .

Are file A and file B in merchant id sequence or not?

If they are not in sequence, Rovert's suggestion would be most efficient.

If they are in sequence, DBZ's sort solution would be efficient.

If the rules were that this must be done with a program, code like pseudo code from sajjan jindal could work well. This is known as a "2-file match/merge" or "line balance". There is a "Sticky" near the top of the cobol forum that contains working code that will provide this functionality with very minor modification.

ashishsr123 · New User Joined: 06 May 2008 Posts: 33 Location: Chennai

Hello,

Dick:
Both the files are not sorted

sajjan jindal:
Your solution would not fit here as File B can have many matching records for File A.
Your solution would work in both the Files are sorted. Since it contains millions of records , it is not feasible.

I think Robert solution is simple and best.

Arun Raj · Posted: Sat Jul 11, 2009 7:21 pm