The Wayback Machine - https://web.archive.org/web/20220409072927/https://github.com/UniversalDataTool/universal-data-tool/issues/342
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with polish signs (letters) like ąśćęóżźł using named entity recognition interface #342

Open
kiwimic opened this issue Oct 23, 2020 · 7 comments
Labels
bug enhancement good first issue hacktoberfest

Comments

@kiwimic
Copy link

@kiwimic kiwimic commented Oct 23, 2020

Having problem with polish letters. universal-data-tool (web version and windows app) is splitting words if there is a polish sing (utf-8). And treath is as an distinct word. In example from .png photo "potrzebuję" and "różyczkę" are single words.
udt

    "\U0104",        #Ą
    "\U0106",        #Ć
    "\U0118",        #Ę
    "\U0141",        #Ł
    "\U015A",        #Ś
    "\U0143",        #Ń
    "\U00D3",        #Ó
    "\U0179",        #Ź
    "\U017B",        #Ż
    "\U0105",        #ą
    "\U0107",        #ć
    "\U0119",        #ę
    "\U0142",        #ł
    "\U015B",        #ś
    "\U0144",        #ń
    "\U00F3",        #ó
    "\U017A",        #ź
    "\U017C"),      #ż,
@kiwimic kiwimic changed the title Problem with polish signs (letters) like ąśćęóżźł Problem with polish signs (letters) like ąśćęóżźł using named entity recognition interface Oct 23, 2020
@seveibar
Copy link
Collaborator

@seveibar seveibar commented Oct 23, 2020

This should be easy to fix, there is a regex for splitting words in react-nlp-annotate (i think)
We could even make it customizable.

Thanks for reporting. Do you have a sample string or regex for splitting? (so we can write a test?)

@seveibar seveibar added bug enhancement good first issue labels Oct 23, 2020
@kiwimic
Copy link
Author

@kiwimic kiwimic commented Oct 23, 2020

This senstence as result should be splitted as spaces and commas

"Chrząszcz brzmi w trzcinie w Szczebrzeszynie, w szczękach chrząszcza trzeszczy miąższ. "

I looked for react-nlp-annotate and found this function with this regex. I don't have experience with JS so could not test this myself, but in R lang simple [\w] catches all polish letters (but also numeric values, so for words only i use \U0104 and etc

stringToSequence = (doc: string, sepRe: RegExp = /[a-zA-ZÀ-ÿ]+/g)

RegExp could be like [a-zA-ZÀ-ÿ \U0104\U0106\U0118\U0141\U015A\U0143\U00D3\U0179\U017B\U0105\U0107\U0119\U0142\U015B\U0144\U00F3\U017A\U017C]

This are utf-8 code for all polish special letters both lower and upper case.

@seveibar
Copy link
Collaborator

@seveibar seveibar commented Oct 23, 2020

Great! We should have this fixed easily :)

The relevant file containing the Regex is: string-to-sequence.js. (the relevant snippet was pasted by @kiwimic above)

We'll need to be able to pass a regex as a prop into that library to do custom regexes, but as @kiwimic suggested we should be able to just paste in his regex codes and we'll automatically be working for polish.

The full process for getting this into the UDT would be...

  1. Open a PR to react-nlp-annotate adding the polish characters, upon merging it'll automatically publish a new npm module! You can test that it's working with yarn storybook and creating a story with polish characters by putting some text in one of the *.story.js files.
  2. Add the new react-nlp-annotate version to the UniversalDataTool with yarn add react-nlp-annotate and open a PR to this repo. The new UDT is published on merge!

I want to give a couple days for someone else to take a stab at this so to increase the 🚌 factor!

@seveibar
Copy link
Collaborator

@seveibar seveibar commented Oct 23, 2020

I've updated the text_entity_recognition specification to allow for a custom word splitting regex. Of course, that's out of scope for just fixing the polish signs, but relevant to the issue of testing and custom sequence splitting.

@seveibar
Copy link
Collaborator

@seveibar seveibar commented Nov 2, 2020

See also: #366 (comment)

@seveibar
Copy link
Collaborator

@seveibar seveibar commented Nov 12, 2020

@kiwimic this can now be fixed by putting [a-zA-ZÀ-ÿ\\u0104\\u0106\\u0118\\u0141\\u015A\\u0143\\u00D3\\u0179\\u017B\\u0105\\u0107\\u0119\\u0142\\u015B\\u0144\\u00F3\\u017A\\u017C]+ in the wordSplitRegex property of the dataset via #373

image

image

It's a bit annoying this needs to be done in the JSON setup, so I've created an issue to address getting it into the regular configuration: #374

@seveibar seveibar closed this Nov 12, 2020
@seveibar
Copy link
Collaborator

@seveibar seveibar commented Nov 12, 2020

I think it's reasonable to add polish signs to the default of react nlp annotate as well, I'll reopen the issue to address that feature.

@seveibar seveibar reopened this Nov 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug enhancement good first issue hacktoberfest
2 participants