0

I have a c++ code that needs to store some data whenever an event is triggered. The data contains about 3000 floating point values. So each of these values needs to be written in a file when the event is triggered. On an average, the event is triggered at every 25 milliseconds. So, we need to finish writing 3000 floating point values within 25 milliseconds before another event is triggered. We can assume that each log is 10 minutes long i.e. after data for 10 minutes are logged in a file, another file is created to perform similar writing task.

Which is the most suitable way to write this amount of data. csv or json can be quite time taking and data writing will not be done within 25ms. Is there any other method of file writing that can be used in c++ and can scale with such large amount of data?

(Edit : ) Hardware Information :

SSD : NVMe

CPU : Intel Core I9

Cores : 24

RAM : 32Gb

(Edit 2:) Here is a summary of how the code is working:

struct DataStruct
{
    std::vector<float>vec;
    std::map<int, float>mp;
    float dt;
    std::map<std::string, std::vector<float>>strVec;        
};

void StoreData(std::ostream &outputFile, DataStruct *data[10])
{
   for(i = 0;i < 10;i++) // Assume that there are 10 objects
   {
      // Each object is described by 300 features.
      // These features are stored in data
      // As you can see the datatype in the structure is 
      // heterogenous. Some of these data types are vector, some 
      // are maps. But total number of variables in the structure 
      // is 300
   }
}
8
  • Some perspective: you need a data rate of roughly 480K bytes per second for 32 bit floats, and twice that for 64 bit, assuming a binary format. You should plan for faster rates than that. Commented Jun 13 at 0:57
  • Given the requirements here, you can write them to a file as binary: write(1,floats,size(floats)*8). You could also write them to /dev/null and still comply with the requirements. Commented Jun 13 at 5:23
  • Gave an answer for this very generic solution, this is a huge candidate to mark as duplicate, very likely there is already an answer like this one, but if not this will serve as a base. Commented Jun 13 at 5:50
  • Is there a specific reason why you don't want to give us more context? Tell us how the data is created - is there a specific device which creates the events? How is the data processed afterwards? What is the overall goal of the system? Commented Jun 13 at 18:43
  • 1
    @AndrewHenle Output needs to be read after the code execution is finished i.e. we can read the output offline. Commented Jun 16 at 8:33

4 Answers 4

3

The needed raw bandwidth is 480kB/s if you have 32 bit floats, and that should not really stress any modern disk. Spinnings disks are typically in the 100MB/s range, and an NVMe disk will be much faster. Do note that all disks tend to perform best with large sequential operations, and this is especially true for spinning disks.

Csv

Using a text based format should take roughly three times as much space/bandwidth, so about 1.5MB/s, and that should still not be a problem. It will likely make it easer to read the data in some third party system. Text based formats will also require a bit more computation, but that should probably not be a problem with a powerful 4Ghz processor.

Json

Json should make it easier to add metadata to your format, and be about as large as csv. You also have the option to base64 encode your data for a more modest size increase over binary formats.

Plain data

Just write the data directly to file. This is simple to do, but makes it a bit more difficult to add any kind of metadata to each datachunk, especially if you intend it to be read from some other software.

Protocol buffers

Protobuf is a fairly common format for encoding various kinds of data in a binary format. It does support length delimited messages, and that makes it fairly easy to append messages to a stream. It should be fairly easy to read in some other system since it is supported in multiple languages.

Other things to consider

Buffering

You can probably not guarantee that every write completes in 25ms, so you probably want to ensure that writing can never interfere with data gathering, or anything else that needs doing. So you probably want some kind of buffering to absorb any variance in write times. Note that the OS can probably provide write buffering, and that might be enough. But write buffering might not be enabled by default.

Compression

Chances are that your data is not completely random, and that might make it possible to reduce the amount by compressing it. There are algorithms like lz4 specifically designed for speed, but regular deflate will likely work just fine. In some cases it might help to store the difference between values rather than the values themselves, just keep floating point inaccuracies in mind when doing any kind of computation. Note that compression tend to reduce the size difference between text based and binary formats.

Summary

I would probably go for protobuf serialization, since I'm familiar with it, and it is designed for speed. Probably with some form of compression just to reduce the file size.

6
  • Spinnings disks are typically in the 100MB/s range Be careful as that's only for streaming data. For random IO patterns that number drops - and that reduction in bandwidth can literally be multiple orders of magnitude. Worst case would be a consumer-grade 5k RPM SATA drive that can only do 50 IOPS. Assuming a 512k block size, that's 25 kb/sec. Even a top-end SAS drive with 4 kb sectors that could do 300 IOPS can only handle 1.2 MB/sec of random IO operations. Commented Jun 13 at 18:31
  • But this is a case of sequential writing? I feel you are questioning very simple concepts with very advanced concepts. As if we were talking about how gravity is 9.8m/s^2 and you question it with quantum mechanics or relativity. Commented Jun 14 at 12:11
  • @AndrewHenle For a drive wriitng only 50 Kilobyte per second consequential I have a bit of space reserved in my garbage bin. Because that's where it belongs. Commented Jun 19 at 11:14
  • @gnasher729 Try benchmarking any spinning drive with pure random small-block IO operations first. And make sure you do it for long enough to eliminate the effects of any drive cache. Commented Jun 19 at 11:16
  • @AndrewHenle Yes, completely random IO will cause issues with a spinning disk. But the OP has both sequential data and an SSD, so that should not really be an issue for this use case. The OPs hardware seem so wildly overspeced that a cursory analysis is likely sufficient. For more demanding use cases you may need a much more through analysis, but I don't think this is the right place for that. Commented Jun 19 at 11:47
1

I don't think a standard format exists for something like this. The correct format is going to depend on a number of factors:

  • After the data is written, what will be reading the data? You don't want the format to be too obscure for downstream processing unless you know those processes will have the horsepower to handle it.
  • What hardware limitations do you have? You've noted performance metrics, and that's a good start, but this device is writing data to something. An SSD drive is faster than an HDD. Writing to RAM is faster than that.
  • The speed of the hardware used to store this data will provide additional constraints, as will the processor speed and available memory.

All of these factors will require you to experiment. My advice is to start with the simplest code writing the simplest format you can think of, and then measure actual performance.

A question like this is not answerable in the form, "do exactly X". Instead, gather all the constraints you have to work with, and run experiments. Track performance metrics for your various attempts, and the file format will reveal itself in time (hopefully less than 25 milliseconds).

4
  • 1
    Standard format exists, and it's just the format used in memory. Array of floating points. You know, mantissa, exponent, all that. Hardware limitations don't quite matter for 480KB/s, any modern write device can handle that and has been able since the 2000s. Commented Jun 13 at 5:45
  • 1
    @TZubiri - ok, but is that a standard format that solves the OP's problem? That's the key point, in my opinion. There are other standards, too. And writing data is only half the story. Data is also typically read, and then processed. The OP should consider those needs and requirements as well. Instead, my answer describes an approach the OP can use to discover a format that suites their needs. Commented Jun 13 at 13:00
  • The requirement was that the data needs to be written, no other information was given. A filesystem is a key, value store, which is probably what would be implemented with a json or csv anyways. What would the header field be? "Id,float"? "float"? Just skip the lossy and problematic float to string conversion, and store it as is. Welcome to programming, this is literally a hello world level challenge. Commented Jun 14 at 12:18
  • @TZubiri I remember reading a "programming note" by Apple which just stated "the file system is not a database". Commented Jun 19 at 10:33
1

Use a standard JSON library, then each call writes an array of 3,000 numbers plus whatever metadata you want. That amount is trivial even for a spinning hard drive. JSON is an absolutely standardised and portable format.

If the 25ms is an average and events can sometimes come much closer together you create a queue containing unformatted data.

With 300,000 numbers every 25ms you want an SSD drive and do some measuring.

0

We can just write them to a file.

int main(){
    int SIZE = 1000000;
    float f[SIZE];

    for(int i=0;i<SIZE;i++){f[i]=3.14159;}

    //we write to stdout (fd 1), so we can pipe to a file with ./program > floatsfile
    //but you can replace with a file descriptor for a separate file
    write(1,&f,8*SIZE); //assuming 64bit floats on an x86_64 arch
}

Benchmark:

root:~$ time ./a.out > floats2

real    0m0.008s
user    0m0.000s
sys     0m0.007s

File size

root@local:~$ ls -lh floats2
-rw-r--r-- 1 root root 3.9M Jun 13 02:35 floats2

Reading file: (first two floats at position 0)

root@local:~$od -j 0 -N 8 -f floats2
0000000         3.14159         3.14159
0000010

This is 300 times more data than you need and it works in less than a milisecond. (Disabled data generation in tests, so I was just writing random memory interpreted as float, which is a bit more realistic and doesn't count time spent generating data.)

I have a standard 4GHZ processor, so it makes sense that 1 to 4MBs are written in less than a milisecond.

Reading the file in a program is roughly the reverse operation, read with read(), assign to an array of floats, dereference a pointer and you are good to go.

Note: One gotcha to consider is Endianness. If you write the data to disk and read it back on another machine with different cpu architecture, the order may be whacky, to detect this, write a known float to the beggining of the file and compare it to the expectations when reading. This is known as a magic byte and is standard in data structures and protocols.

3
  • 1
    A one-shot write test like this is most likely measuring the performance of the page cache and not the actual long-term sustainable bandwidth. Commented Jun 13 at 18:44
  • I'm getting 3.8 seconds for a 1billion float writes with more robust time measuring (clock_gettime(CLOCK_MONOTONIC) and different data on every float (++ as uints). So 3.8ms per million floats written. + whatever fixed cost the function call has, which we saw was less than 10ms. Commented Jun 14 at 12:05
  • The question is: Can whatever method handle "huge" datasets of 3,000 floating point numbers written "within milliseconds", actually 40 times per second. Just use clock() / (double) CLOCKS_PER_SEC which is plenty precise. If 40 times 3,000 numbers was any problem then you would get bored waiting for a billion floats to finish. Commented Jun 19 at 10:37

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.