Skip to main content
Source Link
back2dos
  • 30.1k
  • 3
  • 76
  • 114

I think you should first make a distinction between (sufficiently) formatted code that only needs to be actually designated as such, and (too) poorly formatted code, which needs manual formatting anyway.

Formatted code has breaklines and indentation. That is: if a line is preceded by a single breakline, you have a good candidate. If it has leading whitespaces on top of that, you have a very good candidate.

Normal text uses two breaklines or two spaces and a breakline for formatting, so there's a clear criterion for distinction.

In LISP code you will not find semicolons, in Ruby code you may not find parenthesis, in pseudo code you might not find much at all. But in any (non-esoteric) language you will find decent code to be formatted with breaklines and indentation. There's nothing as universal as that. Because in the end code is, written to be read by humans.

So first, search for potential lines of code. Also, lines of code usually come in groups. If you have one, there's a good chance that the one above or below is a line of code as well.

Once you have singled out potential lines of code, you can check them against quantifiable criteria and pick some threshold:

  • frequence of non-word characters
  • frequence of identifiers: very short words or very long words with CamelCase or under_score style
  • repetition of uncommon words

Also, now that there is programmers and cs, stackoverflow's scope is clearly narrowed down. One might consider denoting all language tags as languages. And when posting, you'd be asked to either pick at least one language tag, pick the language-agnostic tag or to explicitly omit it.

In the first case you know which languages to look for, in the second case, you might want to look for pseudo-code and in the last case, there probably won't be any code, because it's a question related to some technology or framework or such.

Post Made Community Wiki by back2dos