parsec: provide suggestions in IllegalItemError error message#7283
parsec: provide suggestions in IllegalItemError error message#7283Scott-Owen-James wants to merge 7 commits intocylc:masterfrom
Conversation
💯 |
| filtered_keys: list[tuple[str, float]] = [] | ||
| key_counter = Counter(key) | ||
| for possible_key in possible_keys: | ||
| possible_key_counter = Counter(possible_key) | ||
|
|
||
| # simple ratio for whole key | ||
| similarity = key_counter & possible_key_counter | ||
| ratio = (similarity.total() * 2) / (len(key) + len(possible_key)) | ||
|
|
||
| # possible more accurate ratios for individual words | ||
| parsed_possible_key = possible_key.split(" ") | ||
| for possible_word in parsed_possible_key: | ||
| possible_word_counter = Counter(possible_word) | ||
| similarity = key_counter & possible_word_counter | ||
| word_ratio = (similarity.total() * 2) | ||
| word_ratio = word_ratio / (len(key) + len(possible_word_counter)) | ||
| if word_ratio > ratio: | ||
| ratio = word_ratio | ||
| filtered_keys.append((possible_key, ratio)) | ||
|
|
||
| filtered_keys.sort(key=lambda x: x[1], reverse=True) | ||
| if filtered_keys[0][1] < 0.2: | ||
| return [] | ||
|
|
||
| final_keys = [filtered_keys[0]] | ||
| if len(filtered_keys) > 1 and filtered_keys[1][1] > final_keys[0][1] * .9: | ||
| final_keys.append(filtered_keys[1]) | ||
|
|
||
| return [ | ||
| final_keys[i][0] | ||
| for i in range(0, len(final_keys)) | ||
| ] |
There was a problem hiding this comment.
Not sure I can easily follow this logic. Is it looking for any keys or words within keys which closely match the length of the given key, or words within the given key?
I have no idea if that is robust to the types of keys that will be seen here. I would be tempted to look up some fuzzy matching algorithms to see if there is anything simple that could match by letter rather than length?
Search ranking algorithms can be made fairly simple with some regex.
There was a problem hiding this comment.
It works finding the letters shared between the entered key and the possible key, totalling them, then dividing them by the total number of letters. The second part about more accurate ratios for individual words is for the case that a person only wrote part of a possible key i.e., a key of 'execution' would find 'execution polling intervals', 'execution retry delays'. It does match by letter but I don't know about using regex.
There was a problem hiding this comment.
Proper fuzzy search would be great, but probably quite hard to implement.
Because this is a one-off use, we don't really want to add an external dependency to do this, so unless there's something in the Python standard library, I'm happy with something crude which does the job.
There was a problem hiding this comment.
I think I follow this now. A docstring showing step by step might be good since it's non trivial to follow, at least for me.
I believe what it's doing is breaking down all possible keys and the given key into lists of contained chars, so something like 'foo' becomes {'f': 1, 'o': 2} then intersects the given key against the possible keys:
similarity = key_counter & possible_key_counter
Leaving just the numbers of each char that are present in both, then ranks them by total common character count and checks the similarity of the top ranked result is over a threshold.
I guess this is a type of fuzzy matching, I've not seen it done like this before but it seems like it should work, especially given the relatively constrained number of possible keys.
There was a problem hiding this comment.
Docstring added to clarify.
| # You should have received a copy of the GNU General Public License | ||
| # along with this program. If not, see <http://www.gnu.org/licenses/>. | ||
| from cylc.flow.parsec.util import filter_keys | ||
| import pytest |
There was a problem hiding this comment.
This could use a docstring too
| filtered_keys: list[tuple[str, float]] = [] | ||
| key_counter = Counter(key) | ||
| for possible_key in possible_keys: | ||
| possible_key_counter = Counter(possible_key) | ||
|
|
||
| # simple ratio for whole key | ||
| similarity = key_counter & possible_key_counter | ||
| ratio = (similarity.total() * 2) / (len(key) + len(possible_key)) | ||
|
|
||
| # possible more accurate ratios for individual words | ||
| parsed_possible_key = possible_key.split(" ") | ||
| for possible_word in parsed_possible_key: | ||
| possible_word_counter = Counter(possible_word) | ||
| similarity = key_counter & possible_word_counter | ||
| word_ratio = (similarity.total() * 2) | ||
| word_ratio = word_ratio / (len(key) + len(possible_word_counter)) | ||
| if word_ratio > ratio: | ||
| ratio = word_ratio | ||
| filtered_keys.append((possible_key, ratio)) | ||
|
|
||
| filtered_keys.sort(key=lambda x: x[1], reverse=True) | ||
| if filtered_keys[0][1] < 0.2: | ||
| return [] | ||
|
|
||
| final_keys = [filtered_keys[0]] | ||
| if len(filtered_keys) > 1 and filtered_keys[1][1] > final_keys[0][1] * .9: | ||
| final_keys.append(filtered_keys[1]) | ||
|
|
||
| return [ | ||
| final_keys[i][0] | ||
| for i in range(0, len(final_keys)) | ||
| ] |
There was a problem hiding this comment.
Proper fuzzy search would be great, but probably quite hard to implement.
Because this is a one-off use, we don't really want to add an external dependency to do this, so unless there's something in the Python standard library, I'm happy with something crude which does the job.
Co-authored-by: Oliver Sanders <[email protected]>
Co-authored-by: Oliver Sanders <[email protected]>
|
Not sure if it's intended, but this includes keys from global.cylc. This would be fine, except it seemed to give a different response than expected. The illegal items I have tested in flow.cylc give pretty good suggestions, however global.cylc gave me this, where neither suggestion is even close (note I have updated the error text for testing):
Two things there, 1). 'template variables' will be valid soon so not a perfect example but 2). there are less allowed keys for global.cylc so it has less to pick from. When given a user key in global.cylc which is closer to a match, it does find and suggest that match:
So, I think some playing with the threshold value is needed so we don't suggest completely different keys, unless that is a design choice that we would rather recommend something than nothing? |
|
Despite the threshold needing some tweaking for excluding unrelated suggestions, it is quite impressive and far more helpful that it only suggests keys for the section the illegal key is found in. So the same illegal key in different sections of the document produce different suggestions:
This turns it from just a bit of a spell checker to something pretty powerful imo. |
| filtered_keys.append((possible_key, ratio)) | ||
|
|
||
| filtered_keys.sort(key=lambda x: x[1], reverse=True) | ||
| if filtered_keys[0][1] < 0.2: |
There was a problem hiding this comment.
I think this threshold value needs some tweaking as it allows a key to be suggested if even a few letters are the same.
IllegalItemError: [install]template variable - template variable is not a valid configuration, did you mean max depth (score: 0.46)
There was a problem hiding this comment.
Yeah, you could probably tweak it to get higher or lower accuracy, the reason I went with 20% is because I was aiming for some to clear a very low bar, while I was testing some of them did only reach that high but were valid, such as typos in short words, or someone entering a different word in a multi word key.
| if len(filtered_keys) > 1 and filtered_keys[1][1] > final_keys[0][1] * .9: | ||
| final_keys.append(filtered_keys[1]) |
There was a problem hiding this comment.
Whats the reason for seeing if the second recommended key is close to the first, rather than just using the same threshold value for both? This works fine but seems unnecessary.
There was a problem hiding this comment.
This is to avoid returning one very accurate key, and then a second very innaccurate i.e. the best match is 90% and the second best was 40%, it's mainly there to allow multiple options to be returned without accidentally returning 17.
It should give the relevant suggestions for file being worked on (i.e, should work for flow.cylc and global.cylc). The example you gave: Is a config being added in #7223 Was this the result of a mixed-checkout or rebase of some form? |
When someone has an error returns the likely list in error messages when they are incorrectly set to save users having to dig through the docs to find out what the correct format is.
closes #4662
Check List
CONTRIBUTING.mdand added my name as a Code Contributor.setup.cfg(andconda-environment.ymlif present).?.?.xbranch.