Timeline for How to randomly sample a subset of a file
Current License: CC BY-SA 3.0
12 events
| when toggle format | what | by | license | comment | |
|---|---|---|---|---|---|
| Dec 28, 2020 at 16:34 | comment | added | Sridhar Sarnobat | This is also useful if you're not operating on a "file" as such but a long-running stream, and you need to get output immediately. I'm writing a tool to migrate data between data centers which will run for a long time and I'd like to see a sample of the data but not all of it. | |
| Apr 12, 2019 at 15:17 | comment | added | Bruno Bronosky | If you need an exact number, you can always… Run this with a % greater than your need. Count the result. Remove lines matching count mod difference. | |
| Apr 15, 2018 at 18:42 | comment | added | Polymerase |
This is the best answer, the lines are picked randomly while respecting the chronological order of the original file, in case this is a requirement. In addition awk is more resource friendly than shuf
|
|
| Dec 6, 2016 at 18:35 | history | edited | Txangel | CC BY-SA 3.0 |
added 50 characters in body
|
| Dec 6, 2016 at 18:32 | comment | added | Txangel | @G-Man The question seems to talk about getting 10k lines from a million as an example. None of the answers around did work for me (because of the size of the files and hardware limitations) and I propose this as a reasonable compromise. It won't get you 10k lines out of a million but it might be close enough for most practical purposes. I've clarified it a bit more following your advise. Thanks. | |
| Dec 6, 2016 at 18:26 | history | edited | Txangel | CC BY-SA 3.0 |
added 95 characters in body
|
| Dec 5, 2016 at 21:48 | comment | added | G-Man Says 'Reinstate Monica' |
P.S. Simplistic approaches using $RANDOM won’t work correctly for files larger than 32767 lines. The statement “Using $RANDOM doesn’t reach the entire file” is a bit broad.
|
|
| Dec 5, 2016 at 21:47 | comment | added | G-Man Says 'Reinstate Monica' | If a user wants approximately 1% of the non-blank lines, this is a pretty good answer. But if the user wants an exact number of lines (e.g., 1000 out of a 1000000-line file), this fails. As the answer you got it from says, it yields only a statistical estimate. And do you understand the answer well enough to see that it is ignoring blank lines? This might be a good idea, in practice, but undocumented features are, in general, not a good idea. | |
| S Dec 5, 2016 at 21:07 | history | suggested | phk | CC BY-SA 3.0 |
formatting
|
| Dec 5, 2016 at 20:32 | review | Suggested edits | |||
| S Dec 5, 2016 at 21:07 | |||||
| Dec 5, 2016 at 20:24 | review | First posts | |||
| Dec 5, 2016 at 20:32 | |||||
| Dec 5, 2016 at 20:23 | history | answered | Txangel | CC BY-SA 3.0 |