Algorithm for string processing

Question

I am looking for a algorithm for string processing, I have searched for it but couldn't find a algorithm that meets my requirements. I will explain what the algorithm should do with an example.

There are two sets of word sets defined as shown below:

**Main_Words**: swimming, driving, playing
**Words_in_front**: I am, I enjoy, I love, I am going to go

The program will search through a huge set of words as soon it finds a word that is defined in Main_Words it will check the words in front of that Word to see if it has any matching words defined in Words_in_front.

i.e If the program encounters the word "Swimming" it has to check if the words in front of the word "Swimming" are one of these: I am, I enjoy, I love, I am going to go.

Are there any algorithms that can do this?

It depends ... what approach have you already tried? What language would you use to implement this? — maditya
– maditya, Commented Mar 30, 2013 at 9:36
I want to implement this using java. I know i can find the words defined in the main_words, i am not sure of the logic that i should use to check the words in front. — Sindu_
– Sindu_, Commented Mar 30, 2013 at 9:38
Is your "huge set of words" that you're searching through a long string of text, or literally a set? Also, how long might the longest words_in_front member be? — Noyo
– Noyo, Commented Mar 30, 2013 at 9:52
From how you described the problem, I'd say you could simply use if and .equals(). What's wrong with that? — jogojapan
– jogojapan, Commented Mar 30, 2013 at 11:25

Noyo · Accepted Answer · 2013-03-30 15:59:49Z

A straightforward way to do this would be to just do a linear scan through the text, always keeping track of the last N+1 words (or characters) you see, where N is the number of words (or characters) in the longest phrase contained in your words_in_front collection. When you have a "main word", you can just check whether the sequence of N words/characters before it ends with any of the prefixes you have.

This would be a bit faster if you transformed your words_in_front set into a nicer data structure, such as a hashmap (perhaps keyed by last letter in the phrase..) or a prefix/suffix tree of some sort, so you wouldn't have to do an .endsWith over every single member of the set of prefixes each time you have a matching "main word." As was stated in another answer, there is much room for optimization and a few other possible implementations, but there's a start.

SomeWittyUsername · Accepted Answer · 2013-03-30 09:36:33Z

1

Create a map/dictionary/hash/associative array (whatever is defined in your language) with key in Main_Words and Words_in_front are the linked list attached to the entry pointed by the key. Whenever you encounter a word matching a key, go to the table and see if in the attached list there are words that match what you have in front.

That's the basic idea, it can be optimized for both speed and space.

answered Mar 30, 2013 at 9:36

SomeWittyUsername

18.4k3 gold badges46 silver badges87 bronze badges

Comments

Matthew Strawbridge · Accepted Answer · 2013-03-30 10:33:12Z

1

You should be able to build a regular expression along these lines:

I (am|enjoy|love|am going to go) (swimming|driving|playing)

answered Mar 30, 2013 at 10:33

Matthew Strawbridge

20.8k10 gold badges81 silver badges97 bronze badges

Collectives™ on Stack Overflow

Algorithm for string processing

3 Answers 3

Comments

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Related