Fuzzy logic matching between strings

karthick Ramesh · New User Joined: 19 Mar 2008 Posts: 12 Location: India

Hi all,

A substring matching solution that looks for longest sequence of letters that are common and ordered within two strings (not necessarily in sequence). An example below:

Duplicate - Chris
Original - Christopher

Ori - Andrew
Dup - Drew

we trying some thing related to a fuzzy logic kind of search.

Is this kind of match is possible, if yes please can some me.

dick scherrer · Posted: Tue Apr 15, 2008 12:43 pm

Hello,

Yes, it is possible.

You would need some arrays and some parsing logic that started with the full length of the shorter string and looped thru the larger string using reference modification. If a match on the full length is found, you're done. If not, reduce the length of the shorter string by 1 and try again. If not found, "shift" over one positon in the search value and continue.

Continue reducing the search length and shifting until you find a match or until the search length is 1 at which time, i'd say "no match" unless a single character match is to be considered a "hit". You could also set up your process to end with a "no hit" when the length got down to some other number than 1. It is your process, so use what rule makes sense for your requirement.

It will use very many cpu cycles if the strings are long.

enrico-sorichetti · Posted: Tue Apr 15, 2008 12:53 pm

I would write an assembler subroutine to do it,

using the new z/arch string instructions

karthick Ramesh · New User Joined: 19 Mar 2008 Posts: 12 Location: India

Hi All,

Actually this a part of my COBOL Program to avoid duplicates. We are trying various possible ways to close this gap only using COBOL.

Im trying using the ideas of Dick, But i couldnt get how to find the Shorter String with in a longer string.

Currently im trying this logic,
Here im trying to find the common letters B/w two strings and try giving a score. But we may have some risk like other name which maches this case may also end with result.

Can any one tell me How to avoid duplicate letters in a string.

Like Chris - Chriss (Count for 's' should be 1 and the 2 nd 's' to be omited)

MOVE 1 TO X.
MOVE 1 TO Y.
MOVE 0 TO SCORE.
PERFORM UNTIL
W-FIRST-NM(X:1) = SPACES
PERFORM UNTIL
V-FIRST-NM(Y:1) = SPACES
IF W-FIRST-NM(X:1) = V-FIRST-NM(Y:1)
ADD 1 TO SCORE
END-IF
ADD 1 TO Y
END-PERFORM
MOVE 1 TO Y
ADD 1 TO X
END-PERFORM.

IF SCORE > 3(To Be Decided)
DISPLAY 'SCORE :' SCORE.
DISPLAY 'W : ' W-FIRST-NM.
DISPLAY 'V : ' V-FIRST-NM.
PERFORM MATCH .
GO TO EXIT.

enrico-sorichetti · Posted: Tue Apr 15, 2008 2:28 pm

what You are trying to implement is what in the unix world is known as regular expressions parsing

if Your program uses db2 You might try to implement the approach described in

http://www.ibm.com/developerworks/db2/library/techarticle/0301stolze/0301stolze.html

dick scherrer · Posted: Tue Apr 15, 2008 7:31 pm

Hello,

Your code shows the "compare" processing one byte at a time. Please re-read my suggestion - the length the first time thru will be the length of the shorter string, not 1. I believe that when the process proceeds and the length is reduced to 1, the search for a duplicate is over.

I'm leaning towards a length of 2 also being "the end", but that would be up to your "rules".

The way i would proceed, i would not have a "score" - either i would have a hit or not by comparing some piece of the string (like the original request showed) against the other.

Even if you kept the majority of your process in cobol, you might consider writing this bit of code in assembler and making it a callable sub-routine. Actually, i'd make this a sub-routine regardless of the language used to code it. As most organizations no longer have professional assembler coders on-staff, assembler may or may not be an option for you.

No matter how you implement, i suggest that the complete "rule" be documented and understood before the code is attempted.

dick scherrer · Posted: Wed Apr 16, 2008 7:18 am

Hello,

Looking over my previous reply:

skkp2006 · Posted: Thu Apr 17, 2008 2:46 pm

Hi karthick,

If you have DB2 you can have a look at the LOCATE function.
This will help you to find the relative position of the string within another string.

karthick Ramesh · New User Joined: 19 Mar 2008 Posts: 12 Location: India

Hi,

Can any one tell me How to avoid duplicate letters in a string.

eg:
praveen -> praven
peter -> petr
simmon -> simon[/code][/quote]

dick scherrer · Posted: Fri Apr 18, 2008 7:43 pm

Hello,

karthick Ramesh · New User Joined: 19 Mar 2008 Posts: 12 Location: India

Hi Dick/All,
We have arrived at rule which gives the fuzzy logic compare on two names. This piece of example is taken from Java and can any one tell me how is this possible in COBOL?

This mainly focus on the largest ordered substring common between both strings.If the names were "Chrisr" and "Christopher" the score would be 6.

First, a 2D array is set up that is large enough for both words length +1 - the initial values are 0 for the first row and columns.

This is indexed from 0 to length in java - for cobol this will have to be from 1 to length +1

The algorithm then goes through each word letter in each word and computes the substring length - i've left out the actual algorithm as i coudn't word it in english very well! Basically, it compares the characters and if they are the same it takes the score at position x-1, y-1 and adds 1 to it. If the score at position x-1, y or y-1, x is higher than x,y this is carried over to represent the longest string so far.The below tables show this in action:

christopher
b00000000000
h01111111111
r01222222222
r01222222223
i01233333333
s01234444444

Here, B is not present in christopher and so its entry is 0 - as soon as an h is encountered the score increases and by checking if the entries to the left or above are greater the highest score is carried to the final position

christopher
c11111111111
h12222222222
r12333333333
r12333333334
i12344444444
s12345555555

So based on this the last digit from the Array is taken as the Score, from which the match b/w two strings are decided.

dick scherrer · Posted: Wed Apr 23, 2008 2:27 am

Hello Karthick,

Please re-post your last reply. What has been posted is not consistent.

You mention: