Timeline for How to randomly sample a subset of a file

Current License: CC BY-SA 3.0

12 events

when toggle format	what		by	license	comment
Dec 28, 2020 at 16:34	comment	added	Sridhar Sarnobat		This is also useful if you're not operating on a "file" as such but a long-running stream, and you need to get output immediately. I'm writing a tool to migrate data between data centers which will run for a long time and I'd like to see a sample of the data but not all of it.
Apr 12, 2019 at 15:17	comment	added	Bruno Bronosky		If you need an exact number, you can always… Run this with a % greater than your need. Count the result. Remove lines matching count mod difference.
Apr 15, 2018 at 18:42	comment	added	Polymerase		This is the best answer, the lines are picked randomly while respecting the chronological order of the original file, in case this is a requirement. In addition `awk` is more resource friendly than `shuf`
Dec 6, 2016 at 18:35	history	edited	Txangel	CC BY-SA 3.0	added 50 characters in body
Dec 6, 2016 at 18:32	comment	added	Txangel		@G-Man The question seems to talk about getting 10k lines from a million as an example. None of the answers around did work for me (because of the size of the files and hardware limitations) and I propose this as a reasonable compromise. It won't get you 10k lines out of a million but it might be close enough for most practical purposes. I've clarified it a bit more following your advise. Thanks.
Dec 6, 2016 at 18:26	history	edited	Txangel	CC BY-SA 3.0	added 95 characters in body
Dec 5, 2016 at 21:48	comment	added	G-Man Says 'Reinstate Monica'		P.S.  Simplistic approaches using `$RANDOM` won’t work correctly for files larger than 32767 lines. The statement “Using `$RANDOM` doesn’t reach the entire file” is a bit broad.
Dec 5, 2016 at 21:47	comment	added	G-Man Says 'Reinstate Monica'		If a user wants approximately 1% of the non-blank lines, this is a pretty good answer. But if the user wants an exact number of lines (e.g., 1000 out of a 1000000-line file), this fails. As the answer you got it from says, it yields only a statistical estimate. And do you understand the answer well enough to see that it is ignoring blank lines? This might be a good idea, in practice, but undocumented features are, in general, not a good idea.
S Dec 5, 2016 at 21:07	history	suggested	phk	CC BY-SA 3.0	formatting
Dec 5, 2016 at 20:32	review	Suggested edits
S Dec 5, 2016 at 21:07
Dec 5, 2016 at 20:24	review	First posts
Dec 5, 2016 at 20:32
Dec 5, 2016 at 20:23	history	answered	Txangel	CC BY-SA 3.0