8
\$\begingroup\$

I am building a web server from scratch and trying to make certain tasks faster by embedding C code into the mix for performance. Specifically I'm worried about how the std::string class with the .find() and other functions compare to straight pointer arithmetic.

#include <iostream>
#include <map>
#include <string>

std::map<std::string, std::string> http_request;

void parse_header( void * );

int main()
{

    char * msg= "GET / HTTP/1.1\r\n"
                "Host: 192.241.213.46:6880\r\n"
                "Upgrade-Insecure-Requests: 1\r\n"
                "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n"
                "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/602.4.8 (KHTML, like Gecko) Version/10.0.3 Safari/602.4.8\r\n"
                "Accept-Language: en-us\r\n"
                "Accept-Encoding: gzip, deflate\r\n"
                "Connection: keep-alive\r\n\r\n";

    parse_header( msg );

}

void parse_header( void *msg )
{
    char *head = (char *) msg;
    char *mid;
    char *tail = head;

    if( sizeof( msg ) == 0 )
    {
        return;
    }

    // Find request type
    while( *head++ != ' ');
    http_request[ "Type" ] = std::string( ( char * ) msg ).substr( 0 , ( head - 1) - tail );

    // Find path
    tail = head;
    while( *head++ != ' ');
    http_request[ "Path" ] = std::string( ( char * ) msg ).substr( tail - ( char *)msg , ( head - 1) - tail );

    // Find HTTP version
    tail = head;
    while( *head++ != '\r');
    http_request[ "Version" ] = std::string( ( char * ) msg ).substr( tail - ( char *)msg , ( head - 1) - tail );

    // Map all headers from a key to a value
    while( true )
    {
        tail = head + 1;
        while( *head++ != '\r' );
        mid = strstr( tail, ":" );   

        // Look for the failed strstr   
        if( tail > mid )
            break;

        http_request[ std::string( ( char * ) msg ).substr( tail - ( char *)msg , ( mid ) - tail  ) ] = std::string( ( char * ) msg ).substr( mid + 2 - ( char *) msg , ( head - 3 ) - mid );
    }

    // Determine if successful
    std::cout << http_request[ "Host" ] << std::endl; 
    std::cout << http_request[ "Upgrade-Insecure-Requests" ] << std::endl; 
    std::cout << http_request[ "Accept" ] << std::endl; 
    std::cout << http_request[ "User-Agent" ] << std::endl; 
    std::cout << http_request[ "Accept-Language" ] << std::endl; 
    std::cout << http_request[ "Accept-Encoding" ] << std::endl; 
    std::cout << http_request[ "Connection" ] << std::endl; 
}

Data comes in via the linux socket.send() function, so it is a void * type. I directly send that data to a parse_header function to create a std::map for easy access. Is it necessary to be this low-level, or can this speed be obtained by the STL?

Note: I killed OO design for a minimal example. The end-goal has an HTTP class.

Note++: I'd prefer to have this code review pertain to performance rather than c++ semantics, I'd like it to be as conforming as possible while also being fast and lightweight. This seems to be the purest form of the cluge of C/C++.

\$\endgroup\$
5
  • \$\begingroup\$ There is something wrong: sizeof(msg) == 0, I don't think this is what you meant. \$\endgroup\$ Commented Mar 6, 2017 at 8:12
  • \$\begingroup\$ You are right. sizeof the pointer was not what I was meaning! Thanks! \$\endgroup\$ Commented Mar 6, 2017 at 17:40
  • \$\begingroup\$ This is a false worry. The string implementation is efficient. \$\endgroup\$ Commented Mar 6, 2017 at 18:33
  • \$\begingroup\$ IF you are worried abut speed use nginx \$\endgroup\$ Commented Mar 6, 2017 at 18:35
  • \$\begingroup\$ @Loki I use nginx for the webserver, this code is for handling API requests, sorry for the confusion. I benchmarked the code against Jorge Omar Medra's example and my code comes out roughly ~20us or so faster( vs ~50us avg), which is most likely due to less function calls. I guess the question now is: the c++ code provided is more maintainable because everyone hates pointers, but is the difference in speed worth it if I'm the only programmer? I was thinking if you were getting thousands of requests a second 30us is more beneficial than 50us. \$\endgroup\$ Commented Mar 6, 2017 at 18:41

2 Answers 2

2
\$\begingroup\$

Note++: I'd prefer to have this code review pertain to performance rather than c++ semantics

I get it, I really do (he says unconvincingly); but I also really think that you're going about this backwards. Step 1 is always to "make it work"; you don't get to "make it fast, make it cheap" until after you get it working. So you should momentarily forget about performance and just work on fixing the bugs and infelicities; and then if it turns out to be slow, you find the bottlenecks and improve them. (I'm not going as far as to say you need a profiler; you can find bottlenecks with eyeballs and intuition too. But there's no point in micro-polishing code until the bugs are out.)


So: sizeof(msg)==0 is never true. It looks like you're wanting to check the size of the message itself (like, number of bytes in the message); so, you need to add a function parameter to get the message size. (void *msg, size_t msg_size) would do fine. In C++17 the new party line is probably to use a std::string_view.


while( *head++ != ' '); is a guaranteed segfault, as soon as I throw a malformed request at your server. Don't even give me the opportunity to segfault your code. Say what you mean:

while (head != msg_end && *head == ' ') ++head;

The repeated subexpression std::string( ( char * ) msg ) is a red flag: both because you're creating a string over and over (which does heap allocation, which is slow) and because you're using a C-style cast (type-casts are always red flags in both C and C++) and because you're casting away const. /me checks the function signature again — No, you're not casting away const. But you should be! A parsing function shouldn't be allowed to mutate the message buffer. Make that signature (const void *msg, size_t msg_size). Or even better: (const char *msg, const char *msg_end).


Iostreams are slow; don't use std::cout << ... if you care about micro-optimizing performance. — But in fact you shouldn't be printing output from within your parsing function anyway! Let's just have the parsing function return the map of headers.

Also, your variable names seem confused. I'd try to make each extracted string run from head to tail, since that's how we usually talk about things in English. So:

void parse_header(const char *msg, const char *msg_end)
{
    const char *head = msg;
    const char *tail = msg;

    // Find request type
    while (tail != msg_end && *tail != ' ') ++tail;
    http_request["Type"] = std::string(head, tail);

    // Find path
    while (tail != msg_end && *tail == ' ') ++tail;
    head = tail;
    while (tail != msg_end && *tail != ' ') ++tail;
    http_request["Path"] = std::string(head, tail);

    // Find HTTP version
    while (tail != msg_end && *tail == ' ') ++tail;
    head = tail;
    while (tail != msg_end && *tail != '\r') ++tail;
    http_request["Version"] = std::string(head, tail);
    if (tail != msg_end) ++tail;  // skip '\r'
    // TODO: what about the trailing '\n'?

    // Map all headers from a key to a value
    head = tail;
    while (head != msg_end && *head != '\r') {
        while (tail != msg_end && *tail != '\r') ++tail;
        const char *colon = memchr(head, tail, ':');
        if (colon == NULL) {
            // TODO: malformed headers, what should happen?
            break;
        }
        const char *value = colon+1;
        while (value != tail && *value == ' ') ++value;
        http_request[ std::string(head, colon) ] = std::string(value, tail);
        head = tail+1;
        // TODO: what about the trailing '\n'?
    }
    return http_request;
}

Completely untested, mind you; but the most important things about the above code are that - it uses the appropriate std::string constructor instead of repeatedly allocating a gigantic string and then substr'ing parts of it; - it follows your original code (good idea!) in repeating the same basic idiom over and over, mechanically, so that if there ends up being a bug in the logic it's easy to see all the places to apply the fix.


Another thing to try if you have lots of headers, or if you're planning to do lots of lookups in the resulting map, would be to try using std::unordered_map instead of std::map. Less pointer-chasing often means faster code.

\$\endgroup\$
6
  • \$\begingroup\$ Thanks for the reminder about seg faults. I was assuming too much in my code. I also had no idea about that particular std::string constructor, so yes, that definitely simplified the code! I'm also guessing you'd rather message_end rather than message_length because you'd have to have more math to see if the pointer is in the right position, correct (i.e., tail - msg < msg_end - msg)? Finally, is &msg[ sizeof( msg ) - 1] the cheapest way to get the end of an array? \$\endgroup\$ Commented Mar 6, 2017 at 21:10
  • \$\begingroup\$ If you're interested, I also benchmarked the two which can be found here pastebin.com/bLBzapgi. This solution is approximately 3x faster than the original.. really goes to show how expensive the heap is. \$\endgroup\$ Commented Mar 6, 2017 at 22:09
  • \$\begingroup\$ @ShawnicHedgehog: I think you're still misunderstanding what sizeof actually does. It is not a synonym for strlen. \$\endgroup\$ Commented Mar 7, 2017 at 0:38
  • 1
    \$\begingroup\$ I am aware that sizeof returns the amount of bytes. I'm misusing sizeof in the OP with a pointer. On my machine a character is 1 byte. If I wanted it to be perfect I would do sizeof( data ) / sizeof ( data[0] ) which would find the number of elements in an array. It works in this example because I'm specifying const char array[] instead of const char *array. I thought we are not able to update the OP's code, otherwise I would've fixed it. \$\endgroup\$ Commented Mar 7, 2017 at 0:41
  • 1
    \$\begingroup\$ Didn't mean to sound torte, hard to set a tone over the internet. Thanks for the advice, strchr seems to be what I'm looking for. \$\endgroup\$ Commented Mar 7, 2017 at 1:11
2
\$\begingroup\$

First of all, The Http server send all the headers Line by Line. In your sample i see you created a dummy sample with the next buffer:

char * msg= "GET / HTTP/1.1\r\n"
            "Host: 192.241.213.46:6880\r\n"
            "Upgrade-Insecure-Requests: 1\r\n"
            "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n"
            "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/602.4.8 (KHTML, like Gecko) Version/10.0.3 Safari/602.4.8\r\n"
            "Accept-Language: en-us\r\n"
            "Accept-Encoding: gzip, deflate\r\n"
            "Connection: keep-alive\r\n\r\n";

I suppose that you first recive all the header before analyzing it with your method void parse_header( void *msg )... and that is the reason that the task to get each Key-Value is so hard because you are using parsers like this:

http_request[ std::string( ( char * ) msg ).substr( tail - ( char *)msg , ( mid ) - tail  ) ] = std::string( ( char * ) msg ).substr( mid + 2 - ( char *) msg , ( head - 3 ) - mid );

What will I do to improve your code:

  1. I will keep the map<string,string> to store all headers.

  2. Read the header one line by time and, before to continue with the other line, analyze it.

  3. I don't see any issue with the code to parse the first but I will modify it to analyze is a single line and simplify the parsers.

If you read and analyze one line by line, the code to map all the keys with values could be like this:

void parseFirstLine(string line)
{

    string key = "";
    string value = "";
    int  position, lpost;

    // Find request type
    position = line.find(' ');
    http_request[ "Type" ] = line.substr(0, position);
    position++; //Skip character ' '

    // Find path
    position = line.find(' ', lpost);
    http_request[ "Path" ] = line.substr(lpost, (position-lpost));
    position++; //Skip character ' '

    // Find HTTP version
    http_request[ "Version" ] = line.substr(position);
}

void parseHeader(string line)
{

    string key = "";
    string value = "";

    if(data.size() == 0) return;

    int posFirst = line.find(":",0); //Look for separator ':'

    key = line.substr(0, posFirst);
    value = line.substr(posFirst + 1);

    http_request[key] = value;
}

(*) This code could have some bugs because I'm writing this code without any developer environment to review this.

Is your decision where you get each line to analyze, at the moment you get it from socket reading or after finishing of read all the header.

\$\endgroup\$
2
  • \$\begingroup\$ I think you misunderstood.. I am the Http server; this is server-side code. Data is received using the socket recv function man7.org/linux/man-pages/man2/recv.2.html. This function requires a data buffer and a known data buffer length. To my knowledge it would be incredibly more difficult to read 1 line at a time since you have to assign a length, and you don't know the length because the first line is variable, and the specific headers required change if authorization is necessary. \$\endgroup\$ Commented Mar 6, 2017 at 17:39
  • \$\begingroup\$ Yes, @Shawnic-Hedgehog, i understood you and it is the same case for both sides. The server always recives a Http request which the header are segment by lines followed by the contest, the response has the same structure. I have some sample to read the Request line by line. I have a sample of how to read it, but use the lib curl: github.com/jorgemedra/HTTPSClient/blob/master/HTTPSClient/…. My post is in spanish. \$\endgroup\$ Commented Mar 6, 2017 at 18:14

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.