Problem with polish signs (letters) like ąśćęóżźł using named entity recognition interface #342

kiwimic · 2020-10-23T13:01:05Z

Having problem with polish letters. universal-data-tool (web version and windows app) is splitting words if there is a polish sing (utf-8). And treath is as an distinct word. In example from .png photo "potrzebuję" and "różyczkę" are single words.

    "\U0104",        #Ą
    "\U0106",        #Ć
    "\U0118",        #Ę
    "\U0141",        #Ł
    "\U015A",        #Ś
    "\U0143",        #Ń
    "\U00D3",        #Ó
    "\U0179",        #Ź
    "\U017B",        #Ż
    "\U0105",        #ą
    "\U0107",        #ć
    "\U0119",        #ę
    "\U0142",        #ł
    "\U015B",        #ś
    "\U0144",        #ń
    "\U00F3",        #ó
    "\U017A",        #ź
    "\U017C"),      #ż,

The text was updated successfully, but these errors were encountered:

seveibar · 2020-10-23T17:44:45Z

This should be easy to fix, there is a regex for splitting words in react-nlp-annotate (i think)
We could even make it customizable.

Thanks for reporting. Do you have a sample string or regex for splitting? (so we can write a test?)

kiwimic · 2020-10-23T19:58:44Z

This senstence as result should be splitted as spaces and commas

"Chrząszcz brzmi w trzcinie w Szczebrzeszynie, w szczękach chrząszcza trzeszczy miąższ. "

I looked for react-nlp-annotate and found this function with this regex. I don't have experience with JS so could not test this myself, but in R lang simple [\w] catches all polish letters (but also numeric values, so for words only i use \U0104 and etc

stringToSequence = (doc: string, sepRe: RegExp = /[a-zA-ZÀ-ÿ]+/g)

RegExp could be like [a-zA-ZÀ-ÿ \U0104\U0106\U0118\U0141\U015A\U0143\U00D3\U0179\U017B\U0105\U0107\U0119\U0142\U015B\U0144\U00F3\U017A\U017C]

This are utf-8 code for all polish special letters both lower and upper case.

seveibar · 2020-10-23T21:01:07Z

Great! We should have this fixed easily :)

The relevant file containing the Regex is: string-to-sequence.js. (the relevant snippet was pasted by @kiwimic above)

We'll need to be able to pass a regex as a prop into that library to do custom regexes, but as @kiwimic suggested we should be able to just paste in his regex codes and we'll automatically be working for polish.

The full process for getting this into the UDT would be...

Open a PR to react-nlp-annotate adding the polish characters, upon merging it'll automatically publish a new npm module! You can test that it's working with yarn storybook and creating a story with polish characters by putting some text in one of the *.story.js files.
Add the new react-nlp-annotate version to the UniversalDataTool with yarn add react-nlp-annotate and open a PR to this repo. The new UDT is published on merge!

I want to give a couple days for someone else to take a stab at this so to increase the 🚌 factor!

seveibar · 2020-10-23T22:09:37Z

I've updated the text_entity_recognition specification to allow for a custom word splitting regex. Of course, that's out of scope for just fixing the polish signs, but relevant to the issue of testing and custom sequence splitting.

seveibar · 2020-11-02T18:10:33Z

See also: #366 (comment)

seveibar · 2020-11-12T17:49:19Z

@kiwimic this can now be fixed by putting [a-zA-ZÀ-ÿ\\u0104\\u0106\\u0118\\u0141\\u015A\\u0143\\u00D3\\u0179\\u017B\\u0105\\u0107\\u0119\\u0142\\u015B\\u0144\\u00F3\\u017A\\u017C]+ in the wordSplitRegex property of the dataset via #373

It's a bit annoying this needs to be done in the JSON setup, so I've created an issue to address getting it into the regular configuration: #374

seveibar · 2020-11-12T17:50:15Z

I think it's reasonable to add polish signs to the default of react nlp annotate as well, I'll reopen the issue to address that feature.

kiwimic changed the title ~~Problem with polish signs (letters) like ąśćęóżźł~~ Problem with polish signs (letters) like ąśćęóżźł using named entity recognition interface Oct 23, 2020

seveibar added bug enhancement good first issue labels Oct 23, 2020

seveibar added the hacktoberfest label Oct 23, 2020

seveibar mentioned this issue Nov 2, 2020

Some CSV Files don't import all rows #366

Open

seveibar closed this Nov 12, 2020

seveibar reopened this Nov 12, 2020

Mar	APR	May
	09
2021	2022	2023

UniversalDataTool / universal-data-tool Public

Problem with polish signs (letters) like ąśćęóżźł using named entity recognition interface #342

Problem with polish signs (letters) like ąśćęóżźł using named entity recognition interface #342

kiwimic commented Oct 23, 2020

seveibar commented Oct 23, 2020 •

edited

kiwimic commented Oct 23, 2020

seveibar commented Oct 23, 2020

seveibar commented Oct 23, 2020

seveibar commented Nov 2, 2020

seveibar commented Nov 12, 2020

seveibar commented Nov 12, 2020

UniversalDataTool / universal-data-tool Public

Problem with polish signs (letters) like ąśćęóżźł using named entity recognition interface #342

Problem with polish signs (letters) like ąśćęóżźł using named entity recognition interface #342

Comments

kiwimic commented Oct 23, 2020

seveibar commented Oct 23, 2020 • edited

kiwimic commented Oct 23, 2020

seveibar commented Oct 23, 2020

seveibar commented Oct 23, 2020

seveibar commented Nov 2, 2020

seveibar commented Nov 12, 2020

seveibar commented Nov 12, 2020

seveibar commented Oct 23, 2020 •

edited