Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upvectorize_all for stri_detect_* #404
Comments
|
Good idea, but this can also be easily implemented with the |
|
To clarify: I was not suggesting using something like my hacky function but wondering if faster implementation in RCPP would make sense. I am also not sure what Let me state my argument a bit different: My use case is text processing of large character vectors to use in ML Models. If certain words of large wordlists appear in a text column, I can modify a feature column in my data.table. Therefore, using Since I use |
|
So in other words, you're advocating for a set of functions for:
Let's call them Could you provide some "emulated" examples - like virtual calls on some specific inputs and the desired outputs you'd like to see? |
|
Sure. I have added examples with word boundaries and fruit <- c("banana pineapple", "apple banana pear", "applebanana pear")
# Case 1
stri_match_any(str = fruit, patterns = c("\\bbanana\\","\\bapple\\b"))
[1] TRUE TRUE FALSE
# Same as:
stri_detect_regex(fruit, "(\\bbanana\\b)|(\\bapple\\b)")
[1] TRUE TRUE FALSE
# Case 2
stri_match_all(str = fruit, patterns = c("\\bpear\\b","\\bapple\\b"))
[1] FALSE TRUE FALSE
# Same as
stri_detect_regex(fruit, "(?=.*\\bpear\\b)(?=.*\\bapple\\b)")
[1] FALSE TRUE FALSE |
|
If this is only about searching for fixed patterns, possibly a Trie-like data structure could do the trick, especially if the number of patterns was large. Matching of whole words could be done using ICU's BreakIterator, internally. At a first glance, I'm afraid that any other implementation will not be significantly more efficient than running |
|
I mean, I kind of like the idea of these functions, generally. |

Formed in 2009, the Archive Team (not to be confused with the archive.org Archive-It Team) is a rogue archivist collective dedicated to saving copies of rapidly dying or deleted websites for the sake of history and digital heritage. The group is 100% composed of volunteers and interested parties, and has expanded into a large amount of related projects for saving online and digital history.

I very much like that option in
stri_replace_*and am wondering whystri_detect_*does not have it.I have built a function that does it for me and adds the functionality to combine matches with a logical operator, but It would be great to access this with full C++ speed.