Bayesian spam filter interface

Question

In essence, what I'm looking to do is split a string into one-word and two-word segments, and train classifiers against them. First it lowercasifies the string, then strips all non-alphanumeric characters.

Then, it splits the strings into a list of one-word entries. Afterward, it creates a new *[]string which it fills with all sequential two-word entries.

In particular, I'm concerned about the way I'm creating the two-word *[]string slice, and the way I'm using pointers. The first is an efficiency issue, and the second is a, well, I suppose memory-efficiency issue.

const (
    Spam     bayesian.Class = "Spam"
    Spamlike bayesian.Class = "Spamlike"
    NotSpam  bayesian.Class = "NotSpam"
)

type SpamFilter struct {
    singleWordClassifier *bayesian.Classifier
    doubleWordClassifier *bayesian.Classifier
}

func CreateNewSpamFilter() *SpamFilter {
    v1 := bayesian.NewClassifier(Spam, Spamlike, NotSpam)
    v2 := bayesian.NewClassifier(Spam, Spamlike, NotSpam)
    tmp := SpamFilter{v1, v2}
    return &tmp
}

func LoadSpamFilterFromFile(fileName1 string, fileName2 string) *SpamFilter {
    tmp1, err1 := bayesian.NewClassifierFromFile(fileName1)
    tmp2, err2 := bayesian.NewClassifierFromFile(fileName2)
    switch {
    case err1 != nil:
        panic(err1)
    case err2 != nil:
        panic(err2)
    }
    return &SpamFilter{tmp1, tmp2}
}

func splitStringToSingleWords(input string) *[]string {
    reg, err := regexp.Compile("[^A-Za-z0-9 ]+")
    if err != nil {
        panic(err)
    }

    tmp := reg.ReplaceAllString(input, "")
    split := strings.Fields(tmp)
    return &split
}

func SplitStringToWords(input string) (*[]string, *[]string) {
    splitSingle := splitStringToSingleWords(strings.ToLower(input))
    var splitDouble = make([]string, len(*splitSingle)-1)

    for i := 0; i < len(*splitSingle)-1; i++ {
        splitDouble[i] = (*splitSingle)[i] + " " + (*splitSingle)[i+1]
    }

    return splitSingle, &splitDouble
}

func (filter *SpamFilter) TrainStringsInSpamFilter(trainType bayesian.Class, singleWords *[]string, doubleWords *[]string) *SpamFilter {
    (*filter).singleWordClassifier.Learn(*singleWords, trainType)
    (*filter).doubleWordClassifier.Learn(*doubleWords, trainType)
    return filter
}

func (filter *SpamFilter) TestStringThroughSpamFilter(testString string) (*[]float64, *[]float64) {
    splitSingle, splitDouble := SplitStringToWords(testString)
    p1, _, _ := (*filter).singleWordClassifier.ProbScores(*splitSingle)
    p2, _, _ := (*filter).doubleWordClassifier.ProbScores(*splitDouble)
    return &p1, &p2
}

How does this look, generally-speaking? What should I have done differently?

Also, copied here:

How is my use of pointers (is it, well, correct)?
Am I segmenting the strings in the most efficient possible way? I intend to use this on a rather large dataset.
Will this be threadsafe/goroutine-friendly for large datasets?

Will this be threadsafe/goroutine-friendly for large datasets? What happens when you try? Also, define 'large'. — Mast
– Mast ♦, Commented Jul 14, 2015 at 8:47
@Mast It seems to run fine, but I'm not sure how to test whether it's blocking. And by large I mean... possibly up to 2,000-10,000 items at a time. — user75327
– user75327, Commented Jul 14, 2015 at 8:48
please provide reproducible code and examples and possibly a test suite. — mh-cbon
– mh-cbon, Commented Dec 20, 2019 at 10:21
to check for thread safety of your program, run and build it using -race argument. — mh-cbon
– mh-cbon, Commented Dec 20, 2019 at 11:07

sineemore · Accepted Answer · 2018-08-13 16:53:20Z

You've missed import block, so I will assume, that you are using github.com/jbrukh/bayesian as bayesian in provided snippet.

Passing slices by value

Hm. I see how you've used pointers with slices. Let me tell you a bit about the slice internals.

In Go slice values are backed by a small struct simular to this one in C:

struct Slice {
    Elem *ptr;
    int cap;
    int len;
};

cap and len fields stand for slice capacity and length. ptr holds a pointer for continuous memory of slice elements.

So, when you copy a slice by value, you actually copy it's header, not it's content:

a := []string{"Hello,", "World!"}

// Copying slice by value and changing it
b := a
b[1] = "Code Review!"

fmt.Println(a[0], a[1])

The provided snippet will output Hello, Code Review! because both a's and b's ptr point to the very same memory.

It's totally fine to pass a slice by value. Most standard library code do it like this.

To follow up here is a nice post Go Slices: usage and internals at Golang Blog about slices and arrays internals, I suggest you to take a look. It explains a lot about how they are made and why it is okay to pass them by value.

Tip: if you really need a copy of a slice use copy.

Also, don't dereference struct values like in (*filter).singleWordClassifier.

Use filter.singleWordClassifier directly.

Stack Exchange Network

Bayesian spam filter interface

1 Answer 1

Passing slices by value

You must log in to answer this question.

Hot Network Questions

Bayesian spam filter interface

1 Answer 1

Passing slices by value

You must log in to answer this question.

Related

Hot Network Questions