Verify word embedding model downloader #5532
Labels
enhancement
New feature or request
good first issue
Good for newcomers
NLP
Issues / questions around text processing
P2
Priority of the issue for triage purpose: Needs to be fixed at some point.
up-for-grabs
A good issue to fix if you are trying to contribute to the project


Internal user reported a stall during the .Fit() of the word embedding transform.
On first use of the word embedding transform, it downloads the word embedding model from the CDN.
To test:
Check local folder, and ~/.local/share/mlnet-resources/WordVectors/ for a file named
wiki.en.vecExample code:
The code here shows a full example of the
FeaturizeTextfor use with theApplyWordEmbedding. Specifically, it creates the tokens for theApplyWordEmbeddingby removing numbers, keeping diacritics, and lowercases to match how the fastText model was created. The text cleaning reduces the out-of-vocabulary (OOV) issue in the word embedding. For any specific dataset, these options can be tested.Side note:
We should make a sample of
FeaturizeTextwithApplyWordEmbedding. I wrote the above since I couldn't locate one to link-to in this issue.Additional user report: #5450 (comment)
The text was updated successfully, but these errors were encountered: