The Wayback Machine - https://web.archive.org/web/20200909221419/https://github.com/dotnet/spark/pull/491
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement ML Features: Word2Vec #491

Merged
merged 20 commits into from Apr 29, 2020
Merged

Implement ML Features: Word2Vec #491

merged 20 commits into from Apr 29, 2020

Conversation

@GoEddie
Copy link
Contributor

GoEddie commented Apr 15, 2020

We are excited to review your PR.

So we can do the best job, please check:

  • There's a descriptive title that will make sense to other developers some time from now.
  • There's associated issues. All PR's should have issue(s) associated - unless a trivial self-evident change such as fixing a typo. You can use the format Fixes #nnnn in your description to cause GitHub to automatically close the issue(s) when your PR is merged.
  • Your change description explains what the change does, why you chose your approach, and anything else that reviewers should know.
  • You have included any necessary tests in the same PR.

Hi,

This implements Word2Vec and Word2VecModel so that the Word2Vec example can be completed (see https://spark.apache.org/docs/2.2.0/mllib-feature-extraction.html#word2vec), this is for issue #381

Thanks,

Ed Elliott

@GoEddie GoEddie mentioned this pull request Apr 15, 2020
6 of 40 tasks complete
@imback82
Copy link
Contributor

imback82 commented Apr 16, 2020

@GoEddie Each E2E test is taking at least one more minute with this change. Could you check and see if you can reduce it.

GOEddieUK
@GoEddie GoEddie changed the title Implement ML Features: Word2Vec [WIP] Implement ML Features: Word2Vec Apr 16, 2020
GOEddieUK added 2 commits Apr 16, 2020
GOEddieUK
GOEddieUK
@GoEddie
Copy link
Contributor Author

GoEddie commented Apr 16, 2020

@GoEddie Each E2E test is taking at least one more minute with this change. Could you check and see if you can reduce it.

@imback82 I removed the extra time from TestWord2VecTest but Word2VecModel still adds about 10 seconds because calling model.Fit causes spark to dump out loads of log messages, if I turn off logging (or set it to Error) then it is much faster - is it ok to disable logging or is it needed for other tests? I could set it to None, call model.Fit then set it back to Warn (or info?) again.

I tested with pyspark and that does the same thing, if logging is on then it adds the same time so it isn't a dotnet spark thing.

@imback82
Copy link
Contributor

imback82 commented Apr 17, 2020

if I turn off logging (or set it to Error) then it is much faster - is it ok to disable logging or is it needed for other tests?

Cool, can we set it to error for other tests? Also, please ping me when you remove WIP. Thanks!

GOEddieUK
@GoEddie GoEddie changed the title [WIP] Implement ML Features: Word2Vec Implement ML Features: Word2Vec Apr 17, 2020
@GoEddie
Copy link
Contributor Author

GoEddie commented Apr 17, 2020

Hi @imback82 - can you review it again now the build is faster again?


Word2VecModel model = word2vec.Fit(documentDataFrame);

const int expectedSynonyms = 2;

This comment has been minimized.

@imback82

imback82 Apr 18, 2020

Contributor

Why do we expect 2? (sorry I am not familiar with this model).

This comment has been minimized.

@GoEddie

GoEddie Apr 20, 2020

Author Contributor

findSynonyms takes the word to check and the maximum amount of synonyms to return so 2 is checking that the result is limited to 2 rows.

{
_spark = fixture.Spark;
//Calling Word2Vec.Fit is really slow with logging on, makes the test really slow
_spark.SparkContext.SetLogLevel("OFF");

This comment has been minimized.

@imback82

imback82 Apr 18, 2020

Contributor

Oh, even ERROR doesn't help?

GOEddieUK
Copy link
Contributor

imback82 left a comment

LGTM except for one nit comment, thanks @GoEddie!

src/csharp/Microsoft.Spark/ML/Feature/Word2Vec.cs Outdated Show resolved Hide resolved
GOEddieUK
@GoEddie
Copy link
Contributor Author

GoEddie commented Apr 28, 2020

Thanks @imback82 I have fixed that indentation issue!

GoEddie and others added 2 commits Apr 28, 2020
Co-Authored-By: Steve Suh <suhsteve@gmail.com>
GOEddieUK
Copy link
Member

suhsteve left a comment

LGTM

@imback82 imback82 merged commit 1e88f27 into dotnet:master Apr 29, 2020
1 check passed
1 check passed
dotnet.spark Build #20200428.4 succeeded
Details
@MikeRys MikeRys mentioned this pull request May 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
3 participants
You can’t perform that action at this time.