Timeline for Simple method for reliably detecting code in text?

Current License: CC BY-SA 3.0

26 events

when toggle format	what		by	license	comment
Jun 8, 2015 at 14:48	comment	added	jkd		Nested parenthesis
Oct 3, 2014 at 21:39	comment	added	Loren Pechtel		Another pattern: ()
Jul 13, 2011 at 13:09	vote	accept	Jeff Atwood
Jul 8, 2011 at 17:53	comment	added	Jiaaro		I think you should add another item for Line length. Code tends to have short (<80 chars) lines separated by line breaks, which is a departure from most written language.
Jun 29, 2011 at 5:50	comment	added	tylerl		for what it's worth, you can generally refine your algorithm on a language-by-language basis -- detecting bash code may be very different than detecting erlang code, but the author will almost always tell you what language he's using by his choice of tags.
Jun 29, 2011 at 4:57	comment	added	Benoit		You won't detect my `SELECT DISTINCT name FROM people WHERE id IS NOT NULL`.
Jun 28, 2011 at 18:05	comment	added	Ziv		I just for the sake of completion, in addition to camelCase you should check for underscored_lowercase_names.
Jun 28, 2011 at 17:36	comment	added	Scott Chamberlain		@Ken Bloom that is why there is a requirement to only parse the "Big Ten" by tags, LISP is on page 16 of tags
Jun 28, 2011 at 16:20	history	edited	Ken Bloom	CC BY-SA 3.0	added 308 characters in body
Jun 28, 2011 at 16:13	comment	added	Ken Bloom		None of these heuristics will identify LISP code, but I think we should be editing this list to add other features to the list.
Jun 28, 2011 at 15:46	comment	added	Ken Bloom		@TomWIJ: You can keep track of the frequency of each of these features, and consider some tests a potential false positive if they only happen once. A SpamAssassin-like approach might work.
Jun 28, 2011 at 14:24	comment	added	James P.		I'd start off by looking at tags on what languages the question is about to limit what processing is needed. Then use clues like this to form an approximation of where code is situated. Say, if a line has a clue that says it's a piece of code then backtrack to attempt find the first line of code. Then step after line 3 like debugger and repeat the process this time looking forwards until there's some indication that we've rearched the end of the code snippet. This way you could handle cases where lines of code are separated by plain text.
Jun 28, 2011 at 12:35	comment	added	David Murdoch		Also, you could implement HINTS in the textbox's margin whenever code is detected. Many code editors (Visual Studio w/ ReSharper, DreamWeaver, etc) do something similar when they find errors/warnings/suggestions in code.
Jun 28, 2011 at 11:45	comment	added	PhiLho		1. Indeed, even if some languages are proud to allow omitting them (JavaScript, Lua, Scala...) or use a different symbol (or none at all, like Python). 2. Some people like to write func (x) (or func( x ) and other variants). But sure the majority omits the space. 3. As pointed out, dot doesn't work well in some cases (URLs, IP addresses, typo). 5. Perhaps more reliable if detected at the start of a (typed) line. Also lines starting with # or two dashes. 5. + and & aren't so uncommon as abbreviations, I think. -- Overall, a good set of suggestions, I just wanted to help to refine them.
Jun 28, 2011 at 11:43	comment	added	JasonFruit		Also think about catching multiple-words-with-more-than-one-hyphen; that would help identify Lispy languages that might not otherwise be caught by these rules.
Jun 28, 2011 at 11:37	comment	added	PhiLho		Add "Usage of $ before non numeric words: $var is common in Perl and PHP (and Ruby?)."
Jun 28, 2011 at 10:48	comment	added	JoséNunoFerreira		and now i remembered this: you could add up the number of "rules" it triggers, to get it more accurate. basically, if you spot "()", a "WHILE" and a "!=", that's probably not three typos, it's code.
Jun 28, 2011 at 10:45	comment	added	JoséNunoFerreira		additionally, specific keywords that many languages have could help: WHILE, ELSE, IF, LOOP, BREAK, etc.
Jun 28, 2011 at 10:44	comment	added	Nobody		You could also try to detect camel- or pascal- case words, since those do not appear in regular language, except in typos. Also words such as `a_variable_name` for instance.
Jun 28, 2011 at 10:29	comment	added	mmmmmm		re the . as a typo - there would be no harm in flagging that as the author ought to edit anyway.
Jun 28, 2011 at 10:27	comment	added	Tamara Wijsman		Tips: 3 has a very low weight, because a dot between words can be the result of a typo. 5 should not match URLs. For 6 the ampersand is also frequently used outside the code context this you might also weight that character less. Double check if the highlighter works, because it can highlight non-code text as I sometimes see in Notepad++.
Jun 28, 2011 at 9:43	history	made wiki			Post Made Community Wiki by Omar Kooheji
Jun 28, 2011 at 9:13	comment	added	thorsten müller		I thought of the syntax highlighter too. This would work even better, if it would be limited to consecutive key words. If there are more than 5 key words in row and all most other words followed by () (=functions) or = (= variable assignments) or come after type identifiers (declarations)
Jun 28, 2011 at 8:42	history	edited	Yevgeniy Brikman	CC BY-SA 3.0	added 147 characters in body
Jun 28, 2011 at 8:37	history	edited	Yevgeniy Brikman	CC BY-SA 3.0	added 147 characters in body
Jun 28, 2011 at 8:31	history	answered	Yevgeniy Brikman	CC BY-SA 3.0

toggle format