Skip to main content
26 events
when toggle format what by license comment
Jun 8, 2015 at 14:48 comment added jkd Nested parenthesis
Oct 3, 2014 at 21:39 comment added Loren Pechtel Another pattern: ()
Jul 13, 2011 at 13:09 vote accept Jeff Atwood
Jul 8, 2011 at 17:53 comment added Jiaaro I think you should add another item for Line length. Code tends to have short (<80 chars) lines separated by line breaks, which is a departure from most written language.
Jun 29, 2011 at 5:50 comment added tylerl for what it's worth, you can generally refine your algorithm on a language-by-language basis -- detecting bash code may be very different than detecting erlang code, but the author will almost always tell you what language he's using by his choice of tags.
Jun 29, 2011 at 4:57 comment added Benoit You won't detect my SELECT DISTINCT name FROM people WHERE id IS NOT NULL.
Jun 28, 2011 at 18:05 comment added Ziv I just for the sake of completion, in addition to camelCase you should check for underscored_lowercase_names.
Jun 28, 2011 at 17:36 comment added Scott Chamberlain @Ken Bloom that is why there is a requirement to only parse the "Big Ten" by tags, LISP is on page 16 of tags
Jun 28, 2011 at 16:20 history edited Ken Bloom CC BY-SA 3.0
added 308 characters in body
Jun 28, 2011 at 16:13 comment added Ken Bloom None of these heuristics will identify LISP code, but I think we should be editing this list to add other features to the list.
Jun 28, 2011 at 15:46 comment added Ken Bloom @TomWIJ: You can keep track of the frequency of each of these features, and consider some tests a potential false positive if they only happen once. A SpamAssassin-like approach might work.
Jun 28, 2011 at 14:24 comment added James P. I'd start off by looking at tags on what languages the question is about to limit what processing is needed. Then use clues like this to form an approximation of where code is situated. Say, if a line has a clue that says it's a piece of code then backtrack to attempt find the first line of code. Then step after line 3 like debugger and repeat the process this time looking forwards until there's some indication that we've rearched the end of the code snippet. This way you could handle cases where lines of code are separated by plain text.
Jun 28, 2011 at 12:35 comment added David Murdoch Also, you could implement HINTS in the textbox's margin whenever code is detected. Many code editors (Visual Studio w/ ReSharper, DreamWeaver, etc) do something similar when they find errors/warnings/suggestions in code.
Jun 28, 2011 at 11:45 comment added PhiLho 1. Indeed, even if some languages are proud to allow omitting them (JavaScript, Lua, Scala...) or use a different symbol (or none at all, like Python). 2. Some people like to write func (x) (or func( x ) and other variants). But sure the majority omits the space. 3. As pointed out, dot doesn't work well in some cases (URLs, IP addresses, typo). 5. Perhaps more reliable if detected at the start of a (typed) line. Also lines starting with # or two dashes. 5. + and & aren't so uncommon as abbreviations, I think. -- Overall, a good set of suggestions, I just wanted to help to refine them.
Jun 28, 2011 at 11:43 comment added JasonFruit Also think about catching multiple-words-with-more-than-one-hyphen; that would help identify Lispy languages that might not otherwise be caught by these rules.
Jun 28, 2011 at 11:37 comment added PhiLho Add "Usage of $ before non numeric words: $var is common in Perl and PHP (and Ruby?)."
Jun 28, 2011 at 10:48 comment added JoséNunoFerreira and now i remembered this: you could add up the number of "rules" it triggers, to get it more accurate. basically, if you spot "()", a "WHILE" and a "!=", that's probably not three typos, it's code.
Jun 28, 2011 at 10:45 comment added JoséNunoFerreira additionally, specific keywords that many languages have could help: WHILE, ELSE, IF, LOOP, BREAK, etc.
Jun 28, 2011 at 10:44 comment added Nobody You could also try to detect camel- or pascal- case words, since those do not appear in regular language, except in typos. Also words such as a_variable_name for instance.
Jun 28, 2011 at 10:29 comment added mmmmmm re the . as a typo - there would be no harm in flagging that as the author ought to edit anyway.
Jun 28, 2011 at 10:27 comment added Tamara Wijsman Tips: 3 has a very low weight, because a dot between words can be the result of a typo. 5 should not match URLs. For 6 the ampersand is also frequently used outside the code context this you might also weight that character less. Double check if the highlighter works, because it can highlight non-code text as I sometimes see in Notepad++.
Jun 28, 2011 at 9:43 history made wiki Post Made Community Wiki by Omar Kooheji
Jun 28, 2011 at 9:13 comment added thorsten müller I thought of the syntax highlighter too. This would work even better, if it would be limited to consecutive key words. If there are more than 5 key words in row and all most other words followed by () (=functions) or = (= variable assignments) or come after type identifiers (declarations)
Jun 28, 2011 at 8:42 history edited Yevgeniy Brikman CC BY-SA 3.0
added 147 characters in body
Jun 28, 2011 at 8:37 history edited Yevgeniy Brikman CC BY-SA 3.0
added 147 characters in body
Jun 28, 2011 at 8:31 history answered Yevgeniy Brikman CC BY-SA 3.0