New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with polish signs (letters) like ąśćęóżźł using named entity recognition interface #342
Comments
|
This should be easy to fix, there is a regex for splitting words in react-nlp-annotate (i think) Thanks for reporting. Do you have a sample string or regex for splitting? (so we can write a test?) |
|
This senstence as result should be splitted as spaces and commas "Chrząszcz brzmi w trzcinie w Szczebrzeszynie, w szczękach chrząszcza trzeszczy miąższ. " I looked for react-nlp-annotate and found this function with this regex. I don't have experience with JS so could not test this myself, but in R lang simple [\w] catches all polish letters (but also numeric values, so for words only i use \U0104 and etc stringToSequence = (doc: string, sepRe: RegExp = /[a-zA-ZÀ-ÿ]+/g) RegExp could be like [a-zA-ZÀ-ÿ \U0104\U0106\U0118\U0141\U015A\U0143\U00D3\U0179\U017B\U0105\U0107\U0119\U0142\U015B\U0144\U00F3\U017A\U017C] This are utf-8 code for all polish special letters both lower and upper case. |
|
Great! We should have this fixed easily :) The relevant file containing the Regex is: string-to-sequence.js. (the relevant snippet was pasted by @kiwimic above) We'll need to be able to pass a regex as a prop into that library to do custom regexes, but as @kiwimic suggested we should be able to just paste in his regex codes and we'll automatically be working for polish. The full process for getting this into the UDT would be...
I want to give a couple days for someone else to take a stab at this so to increase the |
|
I've updated the text_entity_recognition specification to allow for a custom word splitting regex. Of course, that's out of scope for just fixing the polish signs, but relevant to the issue of testing and custom sequence splitting. |
|
See also: #366 (comment) |
|
@kiwimic this can now be fixed by putting It's a bit annoying this needs to be done in the JSON setup, so I've created an issue to address getting it into the regular configuration: #374 |
|
I think it's reasonable to add polish signs to the default of react nlp annotate as well, I'll reopen the issue to address that feature. |




Having problem with polish letters. universal-data-tool (web version and windows app) is splitting words if there is a polish sing (utf-8). And treath is as an distinct word. In example from .png photo "potrzebuję" and "różyczkę" are single words.

The text was updated successfully, but these errors were encountered: