Timeline for Simple method for reliably detecting code in text?
Current License: CC BY-SA 3.0
26 events
| when toggle format | what | by | license | comment | |
|---|---|---|---|---|---|
| Jun 8, 2015 at 14:48 | comment | added | jkd | Nested parenthesis | |
| Oct 3, 2014 at 21:39 | comment | added | Loren Pechtel | Another pattern: () | |
| Jul 13, 2011 at 13:09 | vote | accept | Jeff Atwood | ||
| Jul 8, 2011 at 17:53 | comment | added | Jiaaro | I think you should add another item for Line length. Code tends to have short (<80 chars) lines separated by line breaks, which is a departure from most written language. | |
| Jun 29, 2011 at 5:50 | comment | added | tylerl | for what it's worth, you can generally refine your algorithm on a language-by-language basis -- detecting bash code may be very different than detecting erlang code, but the author will almost always tell you what language he's using by his choice of tags. | |
| Jun 29, 2011 at 4:57 | comment | added | Benoit |
You won't detect my SELECT DISTINCT name FROM people WHERE id IS NOT NULL.
|
|
| Jun 28, 2011 at 18:05 | comment | added | Ziv | I just for the sake of completion, in addition to camelCase you should check for underscored_lowercase_names. | |
| Jun 28, 2011 at 17:36 | comment | added | Scott Chamberlain | @Ken Bloom that is why there is a requirement to only parse the "Big Ten" by tags, LISP is on page 16 of tags | |
| Jun 28, 2011 at 16:20 | history | edited | Ken Bloom | CC BY-SA 3.0 |
added 308 characters in body
|
| Jun 28, 2011 at 16:13 | comment | added | Ken Bloom | None of these heuristics will identify LISP code, but I think we should be editing this list to add other features to the list. | |
| Jun 28, 2011 at 15:46 | comment | added | Ken Bloom | @TomWIJ: You can keep track of the frequency of each of these features, and consider some tests a potential false positive if they only happen once. A SpamAssassin-like approach might work. | |
| Jun 28, 2011 at 14:24 | comment | added | James P. | I'd start off by looking at tags on what languages the question is about to limit what processing is needed. Then use clues like this to form an approximation of where code is situated. Say, if a line has a clue that says it's a piece of code then backtrack to attempt find the first line of code. Then step after line 3 like debugger and repeat the process this time looking forwards until there's some indication that we've rearched the end of the code snippet. This way you could handle cases where lines of code are separated by plain text. | |
| Jun 28, 2011 at 12:35 | comment | added | David Murdoch | Also, you could implement HINTS in the textbox's margin whenever code is detected. Many code editors (Visual Studio w/ ReSharper, DreamWeaver, etc) do something similar when they find errors/warnings/suggestions in code. | |
| Jun 28, 2011 at 11:45 | comment | added | PhiLho | 1. Indeed, even if some languages are proud to allow omitting them (JavaScript, Lua, Scala...) or use a different symbol (or none at all, like Python). 2. Some people like to write func (x) (or func( x ) and other variants). But sure the majority omits the space. 3. As pointed out, dot doesn't work well in some cases (URLs, IP addresses, typo). 5. Perhaps more reliable if detected at the start of a (typed) line. Also lines starting with # or two dashes. 5. + and & aren't so uncommon as abbreviations, I think. -- Overall, a good set of suggestions, I just wanted to help to refine them. | |
| Jun 28, 2011 at 11:43 | comment | added | JasonFruit | Also think about catching multiple-words-with-more-than-one-hyphen; that would help identify Lispy languages that might not otherwise be caught by these rules. | |
| Jun 28, 2011 at 11:37 | comment | added | PhiLho | Add "Usage of $ before non numeric words: $var is common in Perl and PHP (and Ruby?)." | |
| Jun 28, 2011 at 10:48 | comment | added | JoséNunoFerreira | and now i remembered this: you could add up the number of "rules" it triggers, to get it more accurate. basically, if you spot "()", a "WHILE" and a "!=", that's probably not three typos, it's code. | |
| Jun 28, 2011 at 10:45 | comment | added | JoséNunoFerreira | additionally, specific keywords that many languages have could help: WHILE, ELSE, IF, LOOP, BREAK, etc. | |
| Jun 28, 2011 at 10:44 | comment | added | Nobody |
You could also try to detect camel- or pascal- case words, since those do not appear in regular language, except in typos. Also words such as a_variable_name for instance.
|
|
| Jun 28, 2011 at 10:29 | comment | added | mmmmmm | re the . as a typo - there would be no harm in flagging that as the author ought to edit anyway. | |
| Jun 28, 2011 at 10:27 | comment | added | Tamara Wijsman | Tips: 3 has a very low weight, because a dot between words can be the result of a typo. 5 should not match URLs. For 6 the ampersand is also frequently used outside the code context this you might also weight that character less. Double check if the highlighter works, because it can highlight non-code text as I sometimes see in Notepad++. | |
| Jun 28, 2011 at 9:43 | history | made wiki | Post Made Community Wiki by Omar Kooheji | ||
| Jun 28, 2011 at 9:13 | comment | added | thorsten müller | I thought of the syntax highlighter too. This would work even better, if it would be limited to consecutive key words. If there are more than 5 key words in row and all most other words followed by () (=functions) or = (= variable assignments) or come after type identifiers (declarations) | |
| Jun 28, 2011 at 8:42 | history | edited | Yevgeniy Brikman | CC BY-SA 3.0 |
added 147 characters in body
|
| Jun 28, 2011 at 8:37 | history | edited | Yevgeniy Brikman | CC BY-SA 3.0 |
added 147 characters in body
|
| Jun 28, 2011 at 8:31 | history | answered | Yevgeniy Brikman | CC BY-SA 3.0 |