View previous topic :: View next topic
|
Author |
Message |
karthick Ramesh
New User
Joined: 19 Mar 2008 Posts: 12 Location: India
|
|
|
|
Hi all,
A substring matching solution that looks for longest sequence of letters that are common and ordered within two strings (not necessarily in sequence). An example below:
Duplicate - Chris
Original - Christopher
Ori - Andrew
Dup - Drew
we trying some thing related to a fuzzy logic kind of search.
Is this kind of match is possible, if yes please can some me. |
|
Back to top |
|
|
dick scherrer
Moderator Emeritus
Joined: 23 Nov 2006 Posts: 19244 Location: Inside the Matrix
|
|
|
|
Hello,
Yes, it is possible.
You would need some arrays and some parsing logic that started with the full length of the shorter string and looped thru the larger string using reference modification. If a match on the full length is found, you're done. If not, reduce the length of the shorter string by 1 and try again. If not found, "shift" over one positon in the search value and continue.
Continue reducing the search length and shifting until you find a match or until the search length is 1 at which time, i'd say "no match" unless a single character match is to be considered a "hit". You could also set up your process to end with a "no hit" when the length got down to some other number than 1. It is your process, so use what rule makes sense for your requirement.
It will use very many cpu cycles if the strings are long. |
|
Back to top |
|
|
enrico-sorichetti
Superior Member
Joined: 14 Mar 2007 Posts: 10873 Location: italy
|
|
|
|
I would write an assembler subroutine to do it,
using the new z/arch string instructions
Quote: |
The string-instruction facility (or logical string
assist) provides instructions for (1) moving a
string of bytes until a specified ending byte is
found, (2) logically comparing two strings until
an inequality or a specified ending byte is
found, and (3) searching a string of a specified
length for a specified byte. |
the instruction most fitting would be
Quote: |
COMPARE UNTIL SUBSTRING EQUAL |
sample call parameters
haystack,needle, min_match_length,found_position,found_match_length
not too difficult to implement
order of magnitudes better performance than cobol |
|
Back to top |
|
|
karthick Ramesh
New User
Joined: 19 Mar 2008 Posts: 12 Location: India
|
|
|
|
Hi All,
Actually this a part of my COBOL Program to avoid duplicates. We are trying various possible ways to close this gap only using COBOL.
Im trying using the ideas of Dick, But i couldnt get how to find the Shorter String with in a longer string.
Currently im trying this logic,
Here im trying to find the common letters B/w two strings and try giving a score. But we may have some risk like other name which maches this case may also end with result.
Can any one tell me How to avoid duplicate letters in a string.
Like Chris - Chriss (Count for 's' should be 1 and the 2 nd 's' to be omited)
MOVE 1 TO X.
MOVE 1 TO Y.
MOVE 0 TO SCORE.
PERFORM UNTIL
W-FIRST-NM(X:1) = SPACES
PERFORM UNTIL
V-FIRST-NM(Y:1) = SPACES
IF W-FIRST-NM(X:1) = V-FIRST-NM(Y:1)
ADD 1 TO SCORE
END-IF
ADD 1 TO Y
END-PERFORM
MOVE 1 TO Y
ADD 1 TO X
END-PERFORM.
IF SCORE > 3(To Be Decided)
DISPLAY 'SCORE :' SCORE.
DISPLAY 'W : ' W-FIRST-NM.
DISPLAY 'V : ' V-FIRST-NM.
PERFORM MATCH .
GO TO EXIT. |
|
Back to top |
|
|
enrico-sorichetti
Superior Member
Joined: 14 Mar 2007 Posts: 10873 Location: italy
|
|
Back to top |
|
|
dick scherrer
Moderator Emeritus
Joined: 23 Nov 2006 Posts: 19244 Location: Inside the Matrix
|
|
|
|
Hello,
Your code shows the "compare" processing one byte at a time. Please re-read my suggestion - the length the first time thru will be the length of the shorter string, not 1. I believe that when the process proceeds and the length is reduced to 1, the search for a duplicate is over.
I'm leaning towards a length of 2 also being "the end", but that would be up to your "rules".
The way i would proceed, i would not have a "score" - either i would have a hit or not by comparing some piece of the string (like the original request showed) against the other.
Even if you kept the majority of your process in cobol, you might consider writing this bit of code in assembler and making it a callable sub-routine. Actually, i'd make this a sub-routine regardless of the language used to code it. As most organizations no longer have professional assembler coders on-staff, assembler may or may not be an option for you.
No matter how you implement, i suggest that the complete "rule" be documented and understood before the code is attempted. |
|
Back to top |
|
|
dick scherrer
Moderator Emeritus
Joined: 23 Nov 2006 Posts: 19244 Location: Inside the Matrix
|
|
|
|
Hello,
Looking over my previous reply:
Quote: |
Your code shows the "compare" processing one byte at a time. Please re-read my suggestion - the length the first time thru will be the length of the shorter string, not 1. I believe that when the process proceeds and the length is reduced to 1, the search for a duplicate is over.
I'm leaning towards a length of 2 also being "the end", but that would be up to your "rules".
The way i would proceed, i would not have a "score" - either i would have a hit or not by comparing some piece of the string (like the original request showed) against the other.
|
Re-reading now, it reads (to me anyway) a bit too strong. I did not mean to "direct", but suggest. Of course the implementation would completely depend on your "rules".
You mentioned "Can any one tell me How to avoid duplicate letters in a string. ". What if the name has valid double-letters (william, "Claes Norreen", etc? |
|
Back to top |
|
|
skkp2006
New User
Joined: 14 Jul 2006 Posts: 93 Location: Chennai,India
|
|
Back to top |
|
|
karthick Ramesh
New User
Joined: 19 Mar 2008 Posts: 12 Location: India
|
|
|
|
Hi,
Can any one tell me How to avoid duplicate letters in a string.
eg:
praveen -> praven
peter -> petr
simmon -> simon[/code][/quote] |
|
Back to top |
|
|
dick scherrer
Moderator Emeritus
Joined: 23 Nov 2006 Posts: 19244 Location: Inside the Matrix
|
|
|
|
Hello,
Quote: |
Can any one tell me How to avoid duplicate letters in a string. |
maybe not until you do a complete job of defining the "rules".
Please look at your original request and now this one. They are not the same.
Once all of the rules are known, we may be able to offer suggestions. |
|
Back to top |
|
|
karthick Ramesh
New User
Joined: 19 Mar 2008 Posts: 12 Location: India
|
|
|
|
Hi Dick/All,
We have arrived at rule which gives the fuzzy logic compare on two names. This piece of example is taken from Java and can any one tell me how is this possible in COBOL?
This mainly focus on the largest ordered substring common between both strings.If the names were "Chrisr" and "Christopher" the score would be 6.
First, a 2D array is set up that is large enough for both words length +1 - the initial values are 0 for the first row and columns.
This is indexed from 0 to length in java - for cobol this will have to be from 1 to length +1
The algorithm then goes through each word letter in each word and computes the substring length - i've left out the actual algorithm as i coudn't word it in english very well! Basically, it compares the characters and if they are the same it takes the score at position x-1, y-1 and adds 1 to it. If the score at position x-1, y or y-1, x is higher than x,y this is carried over to represent the longest string so far.The below tables show this in action:
christopher
b00000000000
h01111111111
r01222222222
r01222222223
i01233333333
s01234444444
Here, B is not present in christopher and so its entry is 0 - as soon as an h is encountered the score increases and by checking if the entries to the left or above are greater the highest score is carried to the final position
christopher
c11111111111
h12222222222
r12333333333
r12333333334
i12344444444
s12345555555
So based on this the last digit from the Array is taken as the Score, from which the match b/w two strings are decided. |
|
Back to top |
|
|
dick scherrer
Moderator Emeritus
Joined: 23 Nov 2006 Posts: 19244 Location: Inside the Matrix
|
|
|
|
Hello Karthick,
Please re-post your last reply. What has been posted is not consistent.
You mention:
Quote: |
If the names were "Chrisr" and "Christopher" the score would be 6. |
However, your "arrays" do not contain "chrisr". Somewhere a "b" got into the values - why? What caused the letters to be "re-arranged" from the input sequence ("chrris")? You have made 4 and 5 red, but i see no 6?
When you re-post the arrays, please use the "Code" tag near the top of the Reply Panel - it makes for better readability.
Until an understandable rule is posted, it will be most difficult to offer suggestions.
Once you re-post, i'll clean up the topic. |
|
Back to top |
|
|
|