Portal | Manuals | References | Downloads | Info | Programs | JCLs | Mainframe wiki | Quick Ref

Author Message
karthick Ramesh

New User

Joined: 19 Mar 2008
Posts: 12
Location: India

 Posted: Tue Apr 15, 2008 12:23 pm    Post subject: Fuzzy logic matching between strings Hi all, A substring matching solution that looks for longest sequence of letters that are common and ordered within two strings (not necessarily in sequence). An example below: Duplicate - Chris Original - Christopher Ori - Andrew Dup - Drew we trying some thing related to a fuzzy logic kind of search. Is this kind of match is possible, if yes please can some me.

dick scherrer

Site Director

Joined: 23 Nov 2006
Posts: 19270
Location: Inside the Matrix

 Posted: Tue Apr 15, 2008 12:43 pm    Post subject: Hello, Yes, it is possible. You would need some arrays and some parsing logic that started with the full length of the shorter string and looped thru the larger string using reference modification. If a match on the full length is found, you're done. If not, reduce the length of the shorter string by 1 and try again. If not found, "shift" over one positon in the search value and continue. Continue reducing the search length and shifting until you find a match or until the search length is 1 at which time, i'd say "no match" unless a single character match is to be considered a "hit". You could also set up your process to end with a "no hit" when the length got down to some other number than 1. It is your process, so use what rule makes sense for your requirement. It will use very many cpu cycles if the strings are long.
enrico-sorichetti

Global Moderator

Joined: 14 Mar 2007
Posts: 10374
Location: italy

Posted: Tue Apr 15, 2008 12:53 pm    Post subject: Reply to: Fuzzy logic matching between strings

I would write an assembler subroutine to do it,

using the new z/arch string instructions

 Quote: The string-instruction facility (or logical string assist) provides instructions for (1) moving a string of bytes until a specified ending byte is found, (2) logically comparing two strings until an inequality or a specified ending byte is found, and (3) searching a string of a specified length for a specified byte.

the instruction most fitting would be
 Quote: COMPARE UNTIL SUBSTRING EQUAL

sample call parameters
haystack,needle, min_match_length,found_position,found_match_length

not too difficult to implement
order of magnitudes better performance than cobol
karthick Ramesh

New User

Joined: 19 Mar 2008
Posts: 12
Location: India

 Posted: Tue Apr 15, 2008 1:50 pm    Post subject: Hi All, Actually this a part of my COBOL Program to avoid duplicates. We are trying various possible ways to close this gap only using COBOL. Im trying using the ideas of Dick, But i couldnt get how to find the Shorter String with in a longer string. Currently im trying this logic, Here im trying to find the common letters B/w two strings and try giving a score. But we may have some risk like other name which maches this case may also end with result. Can any one tell me How to avoid duplicate letters in a string. Like Chris - Chriss (Count for 's' should be 1 and the 2 nd 's' to be omited) MOVE 1 TO X. MOVE 1 TO Y. MOVE 0 TO SCORE. PERFORM UNTIL W-FIRST-NM(X:1) = SPACES PERFORM UNTIL V-FIRST-NM(Y:1) = SPACES IF W-FIRST-NM(X:1) = V-FIRST-NM(Y:1) ADD 1 TO SCORE END-IF ADD 1 TO Y END-PERFORM MOVE 1 TO Y ADD 1 TO X END-PERFORM. IF SCORE > 3(To Be Decided) DISPLAY 'SCORE :' SCORE. DISPLAY 'W : ' W-FIRST-NM. DISPLAY 'V : ' V-FIRST-NM. PERFORM MATCH . GO TO EXIT.
enrico-sorichetti

Global Moderator

Joined: 14 Mar 2007
Posts: 10374
Location: italy

 Posted: Tue Apr 15, 2008 2:28 pm    Post subject: Reply to: Fuzzy logic matching between strings what You are trying to implement is what in the unix world is known as regular expressions parsing if Your program uses db2 You might try to implement the approach described in http://www.ibm.com/developerworks/db2/library/techarticle/0301stolze/0301stolze.html
dick scherrer

Site Director

Joined: 23 Nov 2006
Posts: 19270
Location: Inside the Matrix

 Posted: Tue Apr 15, 2008 7:31 pm    Post subject: Hello, Your code shows the "compare" processing one byte at a time. Please re-read my suggestion - the length the first time thru will be the length of the shorter string, not 1. I believe that when the process proceeds and the length is reduced to 1, the search for a duplicate is over. I'm leaning towards a length of 2 also being "the end", but that would be up to your "rules". The way i would proceed, i would not have a "score" - either i would have a hit or not by comparing some piece of the string (like the original request showed) against the other. Even if you kept the majority of your process in cobol, you might consider writing this bit of code in assembler and making it a callable sub-routine. Actually, i'd make this a sub-routine regardless of the language used to code it. As most organizations no longer have professional assembler coders on-staff, assembler may or may not be an option for you. No matter how you implement, i suggest that the complete "rule" be documented and understood before the code is attempted.
dick scherrer

Site Director

Joined: 23 Nov 2006
Posts: 19270
Location: Inside the Matrix

Posted: Wed Apr 16, 2008 7:18 am    Post subject:

Hello,

 Quote: Your code shows the "compare" processing one byte at a time. Please re-read my suggestion - the length the first time thru will be the length of the shorter string, not 1. I believe that when the process proceeds and the length is reduced to 1, the search for a duplicate is over. I'm leaning towards a length of 2 also being "the end", but that would be up to your "rules". The way i would proceed, i would not have a "score" - either i would have a hit or not by comparing some piece of the string (like the original request showed) against the other.

Re-reading now, it reads (to me anyway) a bit too strong. I did not mean to "direct", but suggest. Of course the implementation would completely depend on your "rules".

You mentioned "Can any one tell me How to avoid duplicate letters in a string. ". What if the name has valid double-letters (william, "Claes Norreen", etc?
skkp2006

New User

Joined: 14 Jul 2006
Posts: 93
Location: Chennai,India

Posted: Thu Apr 17, 2008 2:46 pm    Post subject: Reply to: Fuzzy logic matching between strings

Hi karthick,

If you have DB2 you can have a look at the LOCATE function.
This will help you to find the relative position of the string within another string.

 Code: SELECT LOCATE('DREW', 'ANDREW')   FROM SYSIBM.SYSDUMMY1;

Have a look at the below link.

http://publib.boulder.ibm.com/infocenter/dzichelp/v2r2/index.jsp?topic=/com.ibm.db2.doc.sqlref/flocate.htm

Syam
karthick Ramesh

New User

Joined: 19 Mar 2008
Posts: 12
Location: India

 Posted: Fri Apr 18, 2008 2:31 pm    Post subject: Hi, Can any one tell me How to avoid duplicate letters in a string. eg: praveen -> praven peter -> petr simmon -> simon[/code][/quote]
dick scherrer

Site Director

Joined: 23 Nov 2006
Posts: 19270
Location: Inside the Matrix

Posted: Fri Apr 18, 2008 7:43 pm    Post subject:

Hello,

 Quote: Can any one tell me How to avoid duplicate letters in a string.
maybe not until you do a complete job of defining the "rules".

Please look at your original request and now this one. They are not the same.

Once all of the rules are known, we may be able to offer suggestions.
karthick Ramesh

New User

Joined: 19 Mar 2008
Posts: 12
Location: India

 Posted: Tue Apr 22, 2008 10:02 am    Post subject: Hi Dick/All, We have arrived at rule which gives the fuzzy logic compare on two names. This piece of example is taken from Java and can any one tell me how is this possible in COBOL? This mainly focus on the largest ordered substring common between both strings.If the names were "Chrisr" and "Christopher" the score would be 6. First, a 2D array is set up that is large enough for both words length +1 - the initial values are 0 for the first row and columns. This is indexed from 0 to length in java - for cobol this will have to be from 1 to length +1 The algorithm then goes through each word letter in each word and computes the substring length - i've left out the actual algorithm as i coudn't word it in english very well! Basically, it compares the characters and if they are the same it takes the score at position x-1, y-1 and adds 1 to it. If the score at position x-1, y or y-1, x is higher than x,y this is carried over to represent the longest string so far.The below tables show this in action: christopher b00000000000 h01111111111 r01222222222 r01222222223 i01233333333 s01234444444 Here, B is not present in christopher and so its entry is 0 - as soon as an h is encountered the score increases and by checking if the entries to the left or above are greater the highest score is carried to the final position christopher c11111111111 h12222222222 r12333333333 r12333333334 i12344444444 s12345555555 So based on this the last digit from the Array is taken as the Score, from which the match b/w two strings are decided.
dick scherrer

Site Director

Joined: 23 Nov 2006
Posts: 19270
Location: Inside the Matrix

Posted: Wed Apr 23, 2008 2:27 am    Post subject:

Hello Karthick,

You mention:
 Quote: If the names were "Chrisr" and "Christopher" the score would be 6.

However, your "arrays" do not contain "chrisr". Somewhere a "b" got into the values - why? What caused the letters to be "re-arranged" from the input sequence ("chrris")? You have made 4 and 5 red, but i see no 6?

When you re-post the arrays, please use the "Code" tag near the top of the Reply Panel - it makes for better readability.

Until an understandable rule is posted, it will be most difficult to offer suggestions.

Once you re-post, i'll clean up the topic.
 All times are GMT + 6 Hours
 Page 1 of 1

Search our Forum:

 Topic Author Forum Replies Posted Similar Topics Is CICS BMS logic on z/OS different ... Andi1982 CICS 2 Fri Dec 01, 2017 1:56 pm Matching and unmatched records!! Vignesh Sid SYNCSORT 3 Thu Nov 09, 2017 11:22 am JES2 job size field matching Windows ... SRICOBSAS All Other Mainframe Topics 4 Tue Sep 05, 2017 5:49 pm Occurrence of various strings within ... TS70363 JCL & VSAM 1 Thu Apr 27, 2017 8:18 pm fuzzy compare Martylin JCL & VSAM 9 Thu Mar 30, 2017 2:45 pm

 © 2003-2017 IBM MAINFRAME Software Support Division
 Job Vacancies | Forum Rules | Bookmarks | Subscriptions | FAQ | Polls | Contact Us