Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upImplement ML Features: Word2Vec #491
Conversation
|
@GoEddie Each E2E test is taking at least one more minute with this change. Could you check and see if you can reduce it. |
@imback82 I removed the extra time from TestWord2VecTest but Word2VecModel still adds about 10 seconds because calling model.Fit causes spark to dump out loads of log messages, if I turn off logging (or set it to Error) then it is much faster - is it ok to disable logging or is it needed for other tests? I could set it to None, call model.Fit then set it back to Warn (or info?) again. I tested with pyspark and that does the same thing, if logging is on then it adds the same time so it isn't a dotnet spark thing. |
Cool, can we set it to error for other tests? Also, please ping me when you remove WIP. Thanks! |
|
Hi @imback82 - can you review it again now the build is faster again? |
...csharp/Microsoft.Spark.E2ETest/IpcTests/ML/Feature/Word2VecModelTests.cs
Outdated
Show resolved
Hide resolved
...csharp/Microsoft.Spark.E2ETest/IpcTests/ML/Feature/Word2VecModelTests.cs
Outdated
Show resolved
Hide resolved
...csharp/Microsoft.Spark.E2ETest/IpcTests/ML/Feature/Word2VecModelTests.cs
Outdated
Show resolved
Hide resolved
|
|
||
| Word2VecModel model = word2vec.Fit(documentDataFrame); | ||
|
|
||
| const int expectedSynonyms = 2; |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
GoEddie
Apr 20, 2020
Author
Contributor
findSynonyms takes the word to check and the maximum amount of synonyms to return so 2 is checking that the result is limited to 2 rows.
src/csharp/Microsoft.Spark.E2ETest/IpcTests/ML/Feature/Word2VecTests.cs
Outdated
Show resolved
Hide resolved
| { | ||
| _spark = fixture.Spark; | ||
| //Calling Word2Vec.Fit is really slow with logging on, makes the test really slow | ||
| _spark.SparkContext.SetLogLevel("OFF"); |
This comment has been minimized.
This comment has been minimized.
|
Thanks @imback82 I have fixed that indentation issue! |
|
LGTM |

Formed in 2009, the Archive Team (not to be confused with the archive.org Archive-It Team) is a rogue archivist collective dedicated to saving copies of rapidly dying or deleted websites for the sake of history and digital heritage. The group is 100% composed of volunteers and interested parties, and has expanded into a large amount of related projects for saving online and digital history.

GoEddie commentedApr 15, 2020
We are excited to review your PR.
So we can do the best job, please check:
Fixes #nnnnin your description to cause GitHub to automatically close the issue(s) when your PR is merged.Hi,
This implements Word2Vec and Word2VecModel so that the Word2Vec example can be completed (see https://spark.apache.org/docs/2.2.0/mllib-feature-extraction.html#word2vec), this is for issue #381
Thanks,
Ed Elliott