Return to Answer

Additional example

Source Link

edited Aug 16, 2018 at 18:05

Similar to @Txangel's probabilistic solution but approaching 100x faster.

perl -ne 'print if (rand() < .01)' huge_file.csv > sample.csv

If you need high performance, an exact sample size, and are happy to live with a sample gap at end of the file, you can do something like the following (samples 1000 lines from a 1m line file):

perl -ne 'print if (rand() < .0012)' huge_file.csv | head -1000 > sample.csv

.. or indeed chain a second sample method instead of head.

Similar to @Txangel's probabilistic solution but approaching 100x faster.

perl -ne 'print if (rand() < .01)' huge_file.csv > sample.csv

Similar to @Txangel's probabilistic solution but approaching 100x faster.

perl -ne 'print if (rand() < .01)' huge_file.csv > sample.csv

If you need high performance, an exact sample size, and are happy to live with a sample gap at end of the file, you can do something like the following (samples 1000 lines from a 1m line file):

perl -ne 'print if (rand() < .0012)' huge_file.csv | head -1000 > sample.csv

.. or indeed chain a second sample method instead of head.

Source Link

answered Aug 16, 2018 at 17:57

geotheory

Similar to @Txangel's probabilistic solution but approaching 100x faster.

perl -ne 'print if (rand() < .01)' huge_file.csv > sample.csv