The Wayback Machine - https://web.archive.org/web/20220408231053/https://github.com/UniversalDataTool/universal-data-tool/issues/366
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some CSV Files don't import all rows #366

Open
pranav70 opened this issue Nov 2, 2020 · 15 comments
Open

Some CSV Files don't import all rows #366

pranav70 opened this issue Nov 2, 2020 · 15 comments
Labels
bug good first issue

Comments

@pranav70
Copy link

@pranav70 pranav70 commented Nov 2, 2020

I am unable to import more than 11 records at a time when I am using a CSV file with a single column as present in the file attached (convert excel to CSV for use) herewith. It always adds first 11 records from the file. It is very shocking to see an open-source project with 1.2k+ stars having such a naive issue/bug.
@hysios @beru @pgrimaud @seveibar @miguelcarvalho13
Note: This file contains 14 rows and 1 header row and having a total number of lines 33.

Test.xlsx

@seveibar
Copy link
Collaborator

@seveibar seveibar commented Nov 2, 2020

@pranav70 can you provide a CSV file? The CSV format has a lot of variations. It could be your converter

@pranav70
Copy link
Author

@pranav70 pranav70 commented Nov 2, 2020

@hysios
Copy link
Contributor

@hysios hysios commented Nov 2, 2020

Test.xlsx is a badly format in my excel, below is screenshot in my pc
image

@pranav70
Copy link
Author

@pranav70 pranav70 commented Nov 2, 2020

@seveibar
Copy link
Collaborator

@seveibar seveibar commented Nov 2, 2020

@pranav70 for me to help you, you'll need to provide the CSV file you're using here. I don't have Microsoft Excel and don't use Windows. It seems @hysios wasn't able to open it- so you may have an invalid file.

The CSV needs to be RFC4180 compatible, which is the standard we use (and most CSV parsing libraries AFAIK)

As for JSON, check https://github.com/UniversalDataTool/udt-format

JSONL is not currently supported, but is planned.

@pranav70
Copy link
Author

@pranav70 pranav70 commented Nov 2, 2020

Here is the file that you can use.

Test.zip

@seveibar
Copy link
Collaborator

@seveibar seveibar commented Nov 2, 2020

I think importing a CSV could be more clear (created a new issue and linked).

(The txt file I'm uploading is actually a CSV, CSV evidently isn't supported by github)

TestWithCorrectedHeader.txt

Screen Capture_select-area_20201102123505

Screen Capture_select-area_20201102123454

The file imports correctly but has the wrong number of samples (according to the number of rows in LibreOffice).

@seveibar
Copy link
Collaborator

@seveibar seveibar commented Nov 2, 2020

I'm able to reproduce the bug. There are 15 rows in the CSV as shown below, but the UDT only displays 11 samples after importing.

Screen Capture_select-area_20201102125401

image

@seveibar seveibar changed the title Unable to import more than 11 records at a time using csv file Some CSV Files don't import all rows Nov 2, 2020
@pranav70
Copy link
Author

@pranav70 pranav70 commented Nov 2, 2020

I'm able to reproduce the bug. There are 15 rows in the CSV as shown below, but the UDT only displays 11 samples after importing.

Yes, that's exacty my issue. Also, the header should be "document" is nowhere documented in the docs.

@seveibar seveibar added bug good first issue labels Nov 2, 2020
@seveibar
Copy link
Collaborator

@seveibar seveibar commented Nov 2, 2020

Hey @hysios :) wanna take a stab at the fix? It's been a while since I gave you a video update shout out!

@seveibar
Copy link
Collaborator

@seveibar seveibar commented Nov 2, 2020

Could be a parameter we're passing to the CSV parsing library or something

@pranav70
Copy link
Author

@pranav70 pranav70 commented Nov 2, 2020

If I want to tag "@yourrightscamp" in the text which is like "@AJPlus @Kaepernick7 @yourrightscamp", then you can't tag it properly but you can tag "7 @yourrightscamp". You can tag I think it is due to improper tokenization as you are considering continous special characters as a single token.

tokenization_bug

bug

@seveibar
Copy link
Collaborator

@seveibar seveibar commented Nov 2, 2020

@pranav70 yep you need custom tokenization for that. Issue #342 is about custom tokenization being provided as an interface parameter. Are you interested in code contribution? Both of these issues are good introductory issues if you are. Otherwise I'm guessing someone will get them soon.

@pranav70
Copy link
Author

@pranav70 pranav70 commented Nov 2, 2020

@pranav70 yep you need custom tokenization for that. Issue #342 is about custom tokenization being provided as an interface parameter. Are you interested in code contribution? Both of these issues are good introductory issues if you are. Otherwise I'm guessing someone will get them soon.

Yes, absolutely I would love to contribute. I have good experience in regular expressions that I can help you with. I have other issues/suggestions to improve this product that we can discuss further.

@seveibar
Copy link
Collaborator

@seveibar seveibar commented Nov 2, 2020

Feel free to hit me up on Slack! There's also a contribution guide. We're trying to get a lot better at NLP so any help is appreciated :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug good first issue
3 participants