Optimising single-delimiter string tokenisation

Question

I am trying to optimise my tokenizing of tab delimited strings:

static void split_line(string &line, input_record &rec)
{
    int col = 0;
    char *row = &line[0];
    char *token = strtok(row, "\t");
    while (token)
    {
        switch(++col)
        {
            case 2 : rec.sequence = token; break;
            case 4 : rec.content = token; break;
            case 9 : rec.position = atoi(token); return;
        }
        token = strtok(NULL, "\t");
    }
}

where

struct typedef
{
    string sequence;
    string content;
    unsigned int position;
} input_record

This function is called on the result from each getline from a parent, causing split_line to be called over 100 million times.

Each line has at least 15 columns (of which 2, 4 and 9 are useful), and I assign the relevant tokens to variables in a struct. The first 10 columns of a string contain roughly 150 characters (variable length) and I am handling millions of records. Currently, it takes ~0.8µs to process a string, and is a bottleneck in my code.

Does anyone have advice for squeezing more speed out of this?

And strtok is rarely a good solution. For single separators, it's probably slower than the alternatives (like std::find) as well. — James Kanze
– James Kanze, Commented Feb 4, 2015 at 18:32
And what are the types of the values you are assigning? If they're std::string, there's probably a significant amount of copying (and rescanning, to determine the length). — James Kanze
– James Kanze, Commented Feb 4, 2015 at 18:33
@JamesKanze The call to strtok - the other operations barely register. I am indeed assigning to strings, although I tried directly assigning the strtok return and didn't register any speed-up (this is worth trying again, I may have been stupid). I will try std::find but, I can't imagine it is faster? Thanks man! — PidgeyBAWK
– PidgeyBAWK, Commented Feb 4, 2015 at 18:41
Then this is not a clean cut'n'paste. What are the ampersands doing in the declaration of split_line? And you have a colon on your switch statement. And how is string defined? — kdopen
– kdopen, Commented Feb 5, 2015 at 0:27

R Sahu · Accepted Answer · 2015-02-05 15:55:01Z

I tried the following change to your function. It does not use strtok. It does the tokenizing in the function itself. I got a little bit of speed up.

void split_line2(string &line, input_record &rec)
{
    int col = 0;
    char *start = &line[0];
    char sep = ','; // Using ',' instead of '\t' for my testing.
    for ( char* iter = start; *iter != '\0'; ++iter )
    {
       if ( *iter == sep )
       {
          *iter = '\0';
          switch(++col)
          {
             case 2 : rec.sequence = start; break;
             case 4 : rec.content = start; break;
             case 9 : rec.position = atoi(start); return;
          }
          start = iter+1;
       }
    }
}

Here's the test program and results from my testing:

#include <iostream>
#include <string>
#include <ctime>
#include <cstring>

using std::string;

struct input_record
{
    string sequence;
    string content;
    unsigned int position;
};

void split_line1(string &line, input_record &rec)
{
    int col = 0;
    char *row = &line[0];
    char *token = strtok(row, ",");
    while (token)
    {
        switch(++col)
        {
            case 2 : rec.sequence = token; break;
            case 4 : rec.content = token; break;
            case 9 : rec.position = atoi(token); return;
        }
        token = strtok(NULL, ",");
    }
}

void split_line2(string &line, input_record &rec)
{
    int col = 0;
    char *start = &line[0];
    char sep = ',';
    for ( char* iter = start; *iter != '\0'; ++iter )
    {
       if ( *iter == sep )
       {
          *iter = '\0';
          switch(++col)
          {
             case 2 : rec.sequence = start; break;
             case 4 : rec.content = start; break;
             case 9 : rec.position = atoi(start); return;
          }
          start = iter+1;
       }
    }
}

void test1(std::string &line, int n)
{
   clock_t start = clock();

   for ( int i = 0; i < n; ++i )
   {
      input_record rec;
      split_line1(line, rec);
   }

   clock_t end = clock();
   std::cout << "Time: " << (end-start)/CLOCKS_PER_SEC << std::endl;
}

void test2(std::string &line, int n)
{
   clock_t start = clock();

   for ( int i = 0; i < n; ++i )
   {
      input_record rec;
      split_line2(line, rec);
   }

   clock_t end = clock();
   std::cout << "Time: " << (end-start)/CLOCKS_PER_SEC << std::endl;
}

int main(int argc, char** argv)
{
   std::string line(argv[1]);
   int n = atoi(argv[2]);
   test1(line, n);
   test2(line, n);
}

Results:

~/Stack-Overflow/cpp>>./test-507 "1,abcd,3,xyz,4,5,6,7,8,9,10,11" 50000000
Time: 3
Time: 2

~/Stack-Overflow/cpp>>./test-507 "1,abcd,3,xyz,4,5,6,7,8,9,10,11" 100000000
Time: 6
Time: 4

~/Stack-Overflow/cpp>>./test-507 "1,abcd,3,xyz,4,5,6,7,8,9,10,11" 200000000
Time: 13
Time: 9

~/Stack-Overflow/cpp>>./test-507 "1,abcd,3,xyz,4,5,6,7,8,9,10,11" 400000000
Time: 26
Time: 18

Update

If I change the struct from containing std::string to char*, there is substantial savings.

struct input_record
{
    char* sequence;
    char* content;
    unsigned int position;
};

~/Stack-Overflow/cpp>>./test-507 "1,abcd,3,xyz,4,5,6,7,8,9,10,11" 400000000
Time: 16
Time: 8

If that is an option, that will be save you a bunch of time.

if you change the struct to char* it will not work because the tokens are not persistent between calls to strtok, you need to allocate space for the token and copy it e.g. strdup. — AndersK
– AndersK, Commented Feb 5, 2015 at 17:41
@CyberSpock, That is definitely something the OP has to be cognizant of. Depending on their use case, using char* may or may not be an option. — R Sahu
– R Sahu, Commented Feb 5, 2015 at 17:44
OP has string's in his struct so there it is not a problem. i was just pointing out that maybe the time difference is less if you have to keep allocating memory also than just setting a couple of pointers. — AndersK
– AndersK, Commented Feb 5, 2015 at 17:46
Thanks for the edit, noticed the bug in your code there! Also, good point @CyberSpock. I did try this, and managed to get a 2x speedup! — PidgeyBAWK
– PidgeyBAWK, Commented Feb 6, 2015 at 15:49
Please let me know where you saw the bug. I want to update the answer with the correction. — R Sahu
– R Sahu, Commented Feb 6, 2015 at 16:11

R Sahu · Accepted Answer · 2015-02-04 18:36:23Z

1

You will be able to avoid one call to strtok by returning from the function or breaking from the while loop when col == 9. Of course, you don't need to check whether ++col < 10 in that case.

int col = 0;
char *row = &line[0];
char *token = strtok(row, "\t");
while (token)
{
    ++col;
    switch(col):
    {
        case 2 : input_struct.a = token; break;
        case 4 : input_struct.b = token; break;
        case 9 : input_struct.f = atoi(token); return;
    }
    token = strtok(NULL, "\t");
}

or

int col = 0;
char *row = &line[0];
char *token = strtok(row, "\t");
while (token)
{
    ++col;
    switch(col):
    {
        case 2 : input_struct.a = token; break;
        case 4 : input_struct.b = token; break;
        case 9 : input_struct.f = atoi(token);
    }
    if ( col == 9 )
    {
       break;
    }
    token = strtok(NULL, "\t");
}

answered Feb 4, 2015 at 18:36

R Sahu

3,57213 silver badges20 bronze badges

\$\begingroup\$ Thanks! That was a silly mistake of mine, I have edited accordingly in the question. \$\endgroup\$

PidgeyBAWK
– PidgeyBAWK

2015-02-04 18:44:39 +00:00
Commented Feb 4, 2015 at 18:44
\$\begingroup\$ @PidgeyBAWK, was that an error in posting or was that an error in your working code? \$\endgroup\$

R Sahu
– R Sahu

2015-02-04 18:49:23 +00:00
Commented Feb 4, 2015 at 18:49
\$\begingroup\$ Unfortunately the posting - I renamed a few variables etc when coding it up for clarity and must have missed it. Thanks again. \$\endgroup\$

PidgeyBAWK
– PidgeyBAWK

2015-02-04 18:50:44 +00:00
Commented Feb 4, 2015 at 18:50

Add a comment |

vnp · Accepted Answer · 2015-02-04 18:39:48Z

0

Untested suggestion: get rid of a switch. Making a decision for each token surely eats up time.

char * tokens[10];
while (token && col < 10) {
    tokens[col++] = token;
    token = strtok(NULL, "\t");
}
input_struct.a = tokens[2];
input_struct.b = tokens[4];
input_struct.f = atoi(tokens[9]);

answered Feb 4, 2015 at 18:39

vnp

58.7k4 gold badges55 silver badges144 bronze badges

\$\begingroup\$ I have tested this, and it was slightly slower at ~0.895-0.91µs. Thanks for your suggestion though, it was certainly worth a try. \$\endgroup\$

PidgeyBAWK
– PidgeyBAWK

2015-02-04 18:54:34 +00:00
Commented Feb 4, 2015 at 18:54

Add a comment |

Dieter Lücking · Accepted Answer · 2015-02-04 19:56:30Z

You may use some non intrusive splitting into desired tokens which is not altering the input string (strtok does). An (ugly) one might be:

#include <cstring>
#include <iostream>

inline const char* skip_token(std::size_t skip, char token, const char* line) {
    const char* result = line;
    while(skip && result) {
        result = std::strchr(result, token);
        if(result) {
            ++result;
            --skip;
        }
    }
    return result;
}

int main() {
    const char* line = "1-2-3-4-5-6-7-8-9-10-11-12   ";
    const char* first = nullptr;
    const char* last = nullptr;

    first = skip_token(1, '-', line);
    if(first) {
        last = skip_token(1, '-', first);
        if(last) {
            std::cout << std::string(first, last - 1) << '\n';
            first = skip_token(1, '-', last + 1);
            if(first) {
                last = skip_token(1, '-', first);
                if(last) {
                    std::cout << std::string(first, last - 1) << '\n';
                    first = skip_token(4, '-', last + 1);
                    if(first) {
                        last = skip_token(1, '-', first);
                        if(last) {
                            std::cout << std::string(first, last - 1) << '\n';
                        }
                    }
                }
            }
        }
    }
}

(In this sample '-' is the separator)

That's an interesting suggestion. I'm assuming you're suggesting intrusive splitting creates overhead due to mutating the input string? I will try this as soon as possible, thanks! — PidgeyBAWK
– PidgeyBAWK, Commented Feb 4, 2015 at 20:01

Chris Dodd · Accepted Answer · 2015-02-04 20:15:45Z

A couple of notes -- you say you have tab separated columns, but then you use strtok which won't actually work correctly for that if you have empty columns. strtok will treat two or more consecutive delimiters as a single separator. This may be what you want, or your code may be accidentally working because you never have empty columns. If you want to allow for empty columns (and want them to be counted as columns), you should use strsep instead of strtok

If you should be using strsep, or if you never have double tabs (so it doesn't matter), you might be able to make the code slightly faster by using strchr instead of strsep and unrolling your loop:

char *token = strchr(line, '\t');  /* skip column 1 */
if (token) {  /* column 2 */
    input_struct.a = ++token;
    token = strchr(token, '\t'); }
if (token) {
    *token++ = '\0';
    token = strchr(token, '\t'); }
if (token) {  /* column 4 */
    input_struct.b = ++token;
    token = strchr(token, '\t'); }
if (token) {
    *token++ = '\0';
    token = strchr(token, '\t'); }
if (token) token = strchr(token+1, '\t'); /* skip col 6 */
if (token) token = strchr(token+1, '\t');
if (token) token = strchr(token+1, '\t'); /* skip col 8 */
if (token) input_struct.f = atoi(++token);

This has the advantage of not inserting NULs where they aren't needed, and not having hard-to-predict branches.

Thanks for your comments. With regards to your solution, I don't think it works correctly (maybe you did not test it?) For example, input_struct.a will be equal to everything from the 2nd column onwards. I think this may be a problem with '\0' placement? — PidgeyBAWK
– PidgeyBAWK, Commented Feb 5, 2015 at 10:52
@PidgeyBAWK: only if there's no tab after the second column (no third column). Otherwise, the tab will be replaced by a NUL, terminating it. There is an issue that all the pointers point into line, so if you reuse that buffer, they'll all be corrupted, but that is the case with your original code too. — Chris Dodd
– Chris Dodd, Commented Feb 5, 2015 at 15:14

Stack Exchange Network

Optimising single-delimiter string tokenisation

5 Answers 5

You must log in to answer this question.

Hot Network Questions

Optimising single-delimiter string tokenisation

5 Answers 5

You must log in to answer this question.

Related

Hot Network Questions