Comparing 2 unsorted files

andrea · New User Joined: 08 Jun 2012 Posts: 9 Location: Italia

Hi all,
first of all, well foud to everybody.

Now, that's my need (I already searched for a solution, but I wasn't be able to find it).

I have two files, produced by two different procedures (an unknown* old one, and a brand new one), that should compare, but they, often are a bit different.

*unknonw = without source, and without any documentation about, developed many years ago from retired people.

The files ar sequential flat files, without any key or something similar.

I need to compare them and extract lines that are equal, putting them in a new file in the same order in which they are in original files.

It could be useful (but it's not a must) to extract also lines that are only in the first file, and lines in the second one.

I posted my question here, because the only suggestion I found, require use of DFSORT or ICETOOL, that, as I tried, change order of lines extracted (I tried comparing a cobol program, with an its lightly modified copy, and the result file begins with the blank lines

).

It's important to say that I can not know if in each file there are records that can appear as duplicates, even in a far position.

krishna_ragav · New User Joined: 29 Oct 2010 Posts: 10 Location: Chennai

Hi,

Need more clarity on your requirement. There should be some fields which must be present in both the files. Please look for common fields and try your options.

andrea · New User Joined: 08 Jun 2012 Posts: 9 Location: Italia

Nic Clouston · Posted: Mon Jun 11, 2012 5:01 pm

You can get sort to add a sequence number as it READS the files, then do your analysis and then sort the output files on the sequence number that was added on input and build your output without the keys. Samples abound. But I suspect you need keys unless you use the whole record as the key.

andrea · New User Joined: 08 Jun 2012 Posts: 9 Location: Italia

I saw in section about DFSORT/ICETOOL some solutions using that utilities.
But, any of these I tried modifies the original order of my files.

It's important that I say, for example, that the n-th record of the first file doesn't exists on the second, or vice-versa.

Imagine to have:
File1
AAAAA
BBBBB
CCCCC
DDDDD
AAAAA
EEEEE

File2:
AAAAA
BBBBB
FFFFF
CCCCC
CCCCC
EEEEE

In this case, records 1, 2, 3, 6 of File1 are conteined in File2, record 4 and 6 doesn't
Records 1, 2, 4, 6 of File2 are conteined in File1, records 3 and 5 doesn't

Bill Woodger · Posted: Mon Jun 11, 2012 6:13 pm

Don't you have a file comparison product available?

andrea · New User Joined: 08 Jun 2012 Posts: 9 Location: Italia

Anuj Dhawan · Posted: Mon Jun 11, 2012 7:04 pm

For a start - let's consider the entire records as the key. But we still need to know, what is LRECL of the inputs, how many records in both the files, what are your options to choose from -- e.g.: SORT (which one - DFSORT or SyncSort), COBOL any other language.

andrea · New User Joined: 08 Jun 2012 Posts: 9 Location: Italia

Pandora-Box · Posted: Mon Jun 11, 2012 7:57 pm

For this input

File1

andrea · New User Joined: 08 Jun 2012 Posts: 9 Location: Italia

dick scherrer · Posted: Mon Jun 11, 2012 10:14 pm

Hello,

If there is no "key" and the records have no "sequence" other than arrival order, how is a duplicate or a difference identified?

It may help if you post some "real" data (not 500+ bytes, but only enough to demonstrate the actual data. If the data is sensitive, change the values consistently between the 2 sample input files.

dbzTHEdinosauer · Posted: Mon Jun 11, 2012 11:41 pm

this is the kind of thing that superc does very well.

taking advantage of the C.4.1 Update control file (LINE Compare Type) generated,
one could parse this and use as input to sort to generate the
files desired (matches, inserts, deletes).

you want the files compared line by line.
sort does not do this well when there is no key involved.

what you want is a utility that will and that would be a 3.12.

because of superc output methodology, you would only have the sequence of the record in one-or-the-other-file, which could then be used as a key
(records > 133 would not display the complete record)
to obtain the complete record for sort to generate the actual files.