Skip to main content
Additional example
Source Link
geotheory
  • 357
  • 3
  • 13

Similar to @Txangel's probabilistic solution but approaching 100x faster.

perl -ne 'print if (rand() < .01)' huge_file.csv > sample.csv

If you need high performance, an exact sample size, and are happy to live with a sample gap at end of the file, you can do something like the following (samples 1000 lines from a 1m line file):

perl -ne 'print if (rand() < .0012)' huge_file.csv | head -1000 > sample.csv

.. or indeed chain a second sample method instead of head.

Similar to @Txangel's probabilistic solution but approaching 100x faster.

perl -ne 'print if (rand() < .01)' huge_file.csv > sample.csv

Similar to @Txangel's probabilistic solution but approaching 100x faster.

perl -ne 'print if (rand() < .01)' huge_file.csv > sample.csv

If you need high performance, an exact sample size, and are happy to live with a sample gap at end of the file, you can do something like the following (samples 1000 lines from a 1m line file):

perl -ne 'print if (rand() < .0012)' huge_file.csv | head -1000 > sample.csv

.. or indeed chain a second sample method instead of head.

Source Link
geotheory
  • 357
  • 3
  • 13

Similar to @Txangel's probabilistic solution but approaching 100x faster.

perl -ne 'print if (rand() < .01)' huge_file.csv > sample.csv