IBM Mainframe Forum Index
 
Log In
 
IBM Mainframe Forum Index Mainframe: Search IBM Mainframe Forum: FAQ Register
 

Fuzzy logic matching between strings


IBM Mainframe Forums -> COBOL Programming
Post new topic   Reply to topic
View previous topic :: View next topic  
Author Message
karthick Ramesh

New User


Joined: 19 Mar 2008
Posts: 12
Location: India

PostPosted: Tue Apr 15, 2008 12:23 pm
Reply with quote

Hi all,

A substring matching solution that looks for longest sequence of letters that are common and ordered within two strings (not necessarily in sequence). An example below:

Duplicate - Chris
Original - Christopher

Ori - Andrew
Dup - Drew

we trying some thing related to a fuzzy logic kind of search.

Is this kind of match is possible, if yes please can some me.
Back to top
View user's profile Send private message
dick scherrer

Moderator Emeritus


Joined: 23 Nov 2006
Posts: 19244
Location: Inside the Matrix

PostPosted: Tue Apr 15, 2008 12:43 pm
Reply with quote

Hello,

Yes, it is possible.

You would need some arrays and some parsing logic that started with the full length of the shorter string and looped thru the larger string using reference modification. If a match on the full length is found, you're done. If not, reduce the length of the shorter string by 1 and try again. If not found, "shift" over one positon in the search value and continue.

Continue reducing the search length and shifting until you find a match or until the search length is 1 at which time, i'd say "no match" unless a single character match is to be considered a "hit". You could also set up your process to end with a "no hit" when the length got down to some other number than 1. It is your process, so use what rule makes sense for your requirement.

It will use very many cpu cycles if the strings are long.
Back to top
View user's profile Send private message
enrico-sorichetti

Superior Member


Joined: 14 Mar 2007
Posts: 10873
Location: italy

PostPosted: Tue Apr 15, 2008 12:53 pm
Reply with quote

I would write an assembler subroutine to do it,

using the new z/arch string instructions

Quote:
The string-instruction facility (or logical string
assist) provides instructions for (1) moving a
string of bytes until a specified ending byte is
found, (2) logically comparing two strings until
an inequality or a specified ending byte is
found, and (3) searching a string of a specified
length for a specified byte.


the instruction most fitting would be
Quote:
COMPARE UNTIL SUBSTRING EQUAL


sample call parameters
haystack,needle, min_match_length,found_position,found_match_length

not too difficult to implement
order of magnitudes better performance than cobol
Back to top
View user's profile Send private message
karthick Ramesh

New User


Joined: 19 Mar 2008
Posts: 12
Location: India

PostPosted: Tue Apr 15, 2008 1:50 pm
Reply with quote

Hi All,

Actually this a part of my COBOL Program to avoid duplicates. We are trying various possible ways to close this gap only using COBOL.

Im trying using the ideas of Dick, But i couldnt get how to find the Shorter String with in a longer string.

Currently im trying this logic,
Here im trying to find the common letters B/w two strings and try giving a score. But we may have some risk like other name which maches this case may also end with result.

Can any one tell me How to avoid duplicate letters in a string.

Like Chris - Chriss (Count for 's' should be 1 and the 2 nd 's' to be omited)

MOVE 1 TO X.
MOVE 1 TO Y.
MOVE 0 TO SCORE.
PERFORM UNTIL
W-FIRST-NM(X:1) = SPACES
PERFORM UNTIL
V-FIRST-NM(Y:1) = SPACES
IF W-FIRST-NM(X:1) = V-FIRST-NM(Y:1)
ADD 1 TO SCORE
END-IF
ADD 1 TO Y
END-PERFORM
MOVE 1 TO Y
ADD 1 TO X
END-PERFORM.

IF SCORE > 3(To Be Decided)
DISPLAY 'SCORE :' SCORE.
DISPLAY 'W : ' W-FIRST-NM.
DISPLAY 'V : ' V-FIRST-NM.
PERFORM MATCH .
GO TO EXIT.
Back to top
View user's profile Send private message
enrico-sorichetti

Superior Member


Joined: 14 Mar 2007
Posts: 10873
Location: italy

PostPosted: Tue Apr 15, 2008 2:28 pm
Reply with quote

what You are trying to implement is what in the unix world is known as regular expressions parsing

if Your program uses db2 You might try to implement the approach described in

http://www.ibm.com/developerworks/db2/library/techarticle/0301stolze/0301stolze.html
Back to top
View user's profile Send private message
dick scherrer

Moderator Emeritus


Joined: 23 Nov 2006
Posts: 19244
Location: Inside the Matrix

PostPosted: Tue Apr 15, 2008 7:31 pm
Reply with quote

Hello,

Your code shows the "compare" processing one byte at a time. Please re-read my suggestion - the length the first time thru will be the length of the shorter string, not 1. I believe that when the process proceeds and the length is reduced to 1, the search for a duplicate is over.

I'm leaning towards a length of 2 also being "the end", but that would be up to your "rules".

The way i would proceed, i would not have a "score" - either i would have a hit or not by comparing some piece of the string (like the original request showed) against the other.

Even if you kept the majority of your process in cobol, you might consider writing this bit of code in assembler and making it a callable sub-routine. Actually, i'd make this a sub-routine regardless of the language used to code it. As most organizations no longer have professional assembler coders on-staff, assembler may or may not be an option for you.

No matter how you implement, i suggest that the complete "rule" be documented and understood before the code is attempted.
Back to top
View user's profile Send private message
dick scherrer

Moderator Emeritus


Joined: 23 Nov 2006
Posts: 19244
Location: Inside the Matrix

PostPosted: Wed Apr 16, 2008 7:18 am
Reply with quote

Hello,

Looking over my previous reply:

Quote:
Your code shows the "compare" processing one byte at a time. Please re-read my suggestion - the length the first time thru will be the length of the shorter string, not 1. I believe that when the process proceeds and the length is reduced to 1, the search for a duplicate is over.

I'm leaning towards a length of 2 also being "the end", but that would be up to your "rules".

The way i would proceed, i would not have a "score" - either i would have a hit or not by comparing some piece of the string (like the original request showed) against the other.


Re-reading now, it reads (to me anyway) a bit too strong. I did not mean to "direct", but suggest. Of course the implementation would completely depend on your "rules".

You mentioned "Can any one tell me How to avoid duplicate letters in a string. ". What if the name has valid double-letters (william, "Claes Norreen", etc?
Back to top
View user's profile Send private message
skkp2006

New User


Joined: 14 Jul 2006
Posts: 93
Location: Chennai,India

PostPosted: Thu Apr 17, 2008 2:46 pm
Reply with quote

Hi karthick,

If you have DB2 you can have a look at the LOCATE function.
This will help you to find the relative position of the string within another string.

Code:
SELECT LOCATE('DREW', 'ANDREW')
  FROM SYSIBM.SYSDUMMY1;


Have a look at the below link.

http://publib.boulder.ibm.com/infocenter/dzichelp/v2r2/index.jsp?topic=/com.ibm.db2.doc.sqlref/flocate.htm


Syam
Back to top
View user's profile Send private message
karthick Ramesh

New User


Joined: 19 Mar 2008
Posts: 12
Location: India

PostPosted: Fri Apr 18, 2008 2:31 pm
Reply with quote

Hi,

Can any one tell me How to avoid duplicate letters in a string.

eg:
praveen -> praven
peter -> petr
simmon -> simon[/code][/quote]
Back to top
View user's profile Send private message
dick scherrer

Moderator Emeritus


Joined: 23 Nov 2006
Posts: 19244
Location: Inside the Matrix

PostPosted: Fri Apr 18, 2008 7:43 pm
Reply with quote

Hello,

Quote:
Can any one tell me How to avoid duplicate letters in a string.
maybe not until you do a complete job of defining the "rules".

Please look at your original request and now this one. They are not the same.

Once all of the rules are known, we may be able to offer suggestions.
Back to top
View user's profile Send private message
karthick Ramesh

New User


Joined: 19 Mar 2008
Posts: 12
Location: India

PostPosted: Tue Apr 22, 2008 10:02 am
Reply with quote

Hi Dick/All,
We have arrived at rule which gives the fuzzy logic compare on two names. This piece of example is taken from Java and can any one tell me how is this possible in COBOL?

This mainly focus on the largest ordered substring common between both strings.If the names were "Chrisr" and "Christopher" the score would be 6.

First, a 2D array is set up that is large enough for both words length +1 - the initial values are 0 for the first row and columns.

This is indexed from 0 to length in java - for cobol this will have to be from 1 to length +1

The algorithm then goes through each word letter in each word and computes the substring length - i've left out the actual algorithm as i coudn't word it in english very well! Basically, it compares the characters and if they are the same it takes the score at position x-1, y-1 and adds 1 to it. If the score at position x-1, y or y-1, x is higher than x,y this is carried over to represent the longest string so far.The below tables show this in action:

christopher
b00000000000
h01111111111
r01222222222
r01222222223
i01233333333
s01234444444

Here, B is not present in christopher and so its entry is 0 - as soon as an h is encountered the score increases and by checking if the entries to the left or above are greater the highest score is carried to the final position

christopher
c11111111111
h12222222222
r12333333333
r12333333334
i12344444444
s12345555555

So based on this the last digit from the Array is taken as the Score, from which the match b/w two strings are decided.
Back to top
View user's profile Send private message
dick scherrer

Moderator Emeritus


Joined: 23 Nov 2006
Posts: 19244
Location: Inside the Matrix

PostPosted: Wed Apr 23, 2008 2:27 am
Reply with quote

Hello Karthick,

Please re-post your last reply. What has been posted is not consistent.

You mention:
Quote:
If the names were "Chrisr" and "Christopher" the score would be 6.

However, your "arrays" do not contain "chrisr". Somewhere a "b" got into the values - why? What caused the letters to be "re-arranged" from the input sequence ("chrris")? You have made 4 and 5 red, but i see no 6?

When you re-post the arrays, please use the "Code" tag near the top of the Reply Panel - it makes for better readability.

Until an understandable rule is posted, it will be most difficult to offer suggestions.

Once you re-post, i'll clean up the topic.
Back to top
View user's profile Send private message
View previous topic :: :: View next topic  
Post new topic   Reply to topic View Bookmarks
All times are GMT + 6 Hours
Forum Index -> COBOL Programming

 


Similar Topics
Topic Forum Replies
No new posts Rexx pattern matching on PS qualifer ... CLIST & REXX 1
No new posts Finding faulty logic Subscript out of... COBOL Programming 5
No new posts File matching functionality in Easytr... DFSORT/ICETOOL 14
This topic is locked: you cannot edit posts or make replies. Need assistance in job scheduling logic. Mainframe Interview Questions 2
No new posts Rexx Logic error while adding seperat... CLIST & REXX 3
Search our Forums:

Back to Top