Revisions to Validating allowed characters or validating disallowed characters

added 44 characters in body

Source Link

edited Apr 11, 2012 at 9:42

11.6k
2
45
49

Any implementation trying to detect "malicious characters" is flawed, when you look at the combined properties of such an implementation:

A "valid" subset of a character set is not so easy to define. Newline is a control character, and you definitely want to allow that in comments. You'd have work cut out for days or weeks to create a sensible subset of Unicode (and combinations of characters) which could be considered "valid" across the globe.
An "invalid" subset is also difficult, if you're making anything even remotely complex. For example, you don't want literal quotes in SQL, but you also don't want literal ampersands or inequality signs in HTML, or backslash in JavaScript. If you have a series of input and output languages, the only way to be sure is to escapeescape user input/output for each one, and test that the escaping works.
The set is valid only for a single version of a single character seta single version of a single character set, so it's not future-proof.
You still need to test the full character range to see if there are any security holes.
If you're not careful with what you accept, you'll end up annoying users, and they will leave. If you're lucky, one in a thousand will file a bug report.

I'd go so far as to say that validating allowed characters reduces security, because it encourages sloppy implementation (lack of testing/escaping). If you escape where necessary you can instead just test the "nasty" characters, and if they work, you have pretty much guaranteed that other nasty characters will also be harmless to the system.

All this is of course not to say that some characters are nonsensical in some fields, such as two in a numeric field. But even this is often not trivial:

Different languages have different decimal and thousand separators. 1,000 == 1 in much of Europe, and 1'000 is a valid way to write 1000 in some places. You don't want to tell those users their way of writing is wrong.
Leading zeros, plus signs, hash and asterisk are all valid characters in a phone number. Some countries include an inland prefix ((0) in Switzerland) which you have to use within the country, and have to exclude when using a language code.
Contrary to many sloppy web developers, email addresses can contain dash and plus characters, and a whole lot of others which are routinely rejected as if to prove that the company policy is that they would do just fine if it wasn't for all those pesky customers.
Names can contain quote characters (Gerard 't Hooft), numbers (John Doe the 5th), punctuation (John Doe, M.D.), and SQL:

Any implementation trying to detect "malicious characters" is flawed, when you look at the combined properties of such an implementation:

A "valid" subset of a character set is not so easy to define. Newline is a control character, and you definitely want to allow that in comments. You'd have work cut out for days or weeks to create a sensible subset of Unicode (and combinations of characters) which could be considered "valid" across the globe.
An "invalid" subset is also difficult, if you're making anything even remotely complex. For example, you don't want literal quotes in SQL, but you also don't want literal ampersands or inequality signs in HTML, or backslash in JavaScript. If you have a series of input and output languages, the only way to be sure is to escape user input/output for each one.
The set is valid only for a single version of a single character set, so it's not future-proof.
You still need to test the full character range to see if there are any security holes.
If you're not careful with what you accept, you'll end up annoying users, and they will leave. If you're lucky, one in a thousand will file a bug report.

I'd go so far as to say that validating allowed characters reduces security, because it encourages sloppy implementation (lack of testing/escaping). If you escape where necessary you can instead just test the "nasty" characters, and if they work, you have pretty much guaranteed that other nasty characters will also be harmless to the system.

All this is of course not to say that some characters are nonsensical in some fields, such as two in a numeric field. But even this is often not trivial:

Different languages have different decimal and thousand separators. 1,000 == 1 in much of Europe, and 1'000 is a valid way to write 1000 in some places. You don't want to tell those users their way of writing is wrong.
Leading zeros, plus signs, hash and asterisk are all valid characters in a phone number. Some countries include an inland prefix ((0) in Switzerland) which you have to use within the country, and have to exclude when using a language code.
Contrary to many sloppy web developers, email addresses can contain dash and plus characters, and a whole lot of others which are routinely rejected as if to prove that the company policy is that they would do just fine if it wasn't for all those pesky customers.
Names can contain quote characters (Gerard 't Hooft), numbers (John Doe the 5th), punctuation (John Doe, M.D.), and SQL:

Any implementation trying to detect "malicious characters" is flawed, when you look at the combined properties of such an implementation:

A "valid" subset of a character set is not so easy to define. Newline is a control character, and you definitely want to allow that in comments. You'd have work cut out for days or weeks to create a sensible subset of Unicode (and combinations of characters) which could be considered "valid" across the globe.
An "invalid" subset is also difficult, if you're making anything even remotely complex. For example, you don't want literal quotes in SQL, but you also don't want literal ampersands or inequality signs in HTML, or backslash in JavaScript. If you have a series of input and output languages, the only way to be sure is to escape user input/output for each one, and test that the escaping works.
The set is valid only for a single version of a single character set, so it's not future-proof.
You still need to test the full character range to see if there are any security holes.
If you're not careful with what you accept, you'll end up annoying users, and they will leave. If you're lucky, one in a thousand will file a bug report.

I'd go so far as to say that validating allowed characters reduces security, because it encourages sloppy implementation (lack of testing/escaping). If you escape where necessary you can instead just test the "nasty" characters, and if they work, you have pretty much guaranteed that other nasty characters will also be harmless to the system.

All this is of course not to say that some characters are nonsensical in some fields, such as two in a numeric field. But even this is often not trivial:

Different languages have different decimal and thousand separators. 1,000 == 1 in much of Europe, and 1'000 is a valid way to write 1000 in some places. You don't want to tell those users their way of writing is wrong.
Leading zeros, plus signs, hash and asterisk are all valid characters in a phone number. Some countries include an inland prefix ((0) in Switzerland) which you have to use within the country, and have to exclude when using a language code.
Contrary to many sloppy web developers, email addresses can contain dash and plus characters, and a whole lot of others which are routinely rejected as if to prove that the company policy is that they would do just fine if it wasn't for all those pesky customers.
Names can contain quote characters (Gerard 't Hooft), numbers (John Doe the 5th), punctuation (John Doe, M.D.), and SQL:

deleted 3 characters in body

Source Link

edited Apr 4, 2012 at 12:06

l0b0

11.6k
2
45
49

Any implementation trying to detect "malicious characters" is flawed, when you look at the combined properties of such an implementation:

A "valid" subset of a character set is not so easy to define. Newline is a control character, and you definitely want to allow that in comments. You'd have work cut out for days or weeks to create a sensible subset of Unicode (and combinations of characters) which could be considered "valid" across the globe.
An "invalid" subset is also difficult, if you're making anything even remotely complex. For example, you don't want literal quotes in SQL, but you also don't want literal ampersands or inequality signs in HTML, or backslash in JavaScript. If you have a series of input and output languages, the only way to be sure is to escape user input/output for each one.
The set is valid only for a single version of a single character set, so it's not future-proof.
You still need to test the full character range to see if there are any security holes.
If you're not careful with what you accept, you'll end up annoying users, and they will leave. If you're lucky, one in a thousand will file a bug report.

I'd go so far as to say that validating allowed characters reduces security, because it encourages sloppy implementation (lack of testing/escaping). If you escape where necessary you can instead just test the "nasty" characters, and if they work, you have pretty much guaranteed that other nasty characters will also be harmless to the system.

All this is of course not to say that some characters are nonsensical in some fields, such as two in a numeric field. But even this is often not trivial:

Different languages have different decimal and thousand separators. 1,000 == 1 in much of Europe, and 1'000 is a valid way to write 1000 in some places. You don't want to tell those users their way of writing is wrong.

Leading zeros, plus signs, hash and asterisk are all valid characters in a phone number. Some countries include an inland prefix ((0) in Switzerland) which you have to use within the country, and have to exclude when using a language code.

Contrary to many sloppy web developers, email addresses can contain dash and plus characters, and a whole lot of others which are routinely rejected as if to prove that the company policy is that they would do just fine if it wasn't for all those pesky customers.

Names can contain quote characters (Gerard 't Hooft), numbers (John Doe the 5th), punctuation (John Doe, M.D.), and SQL:

Any implementation trying to detect "malicious characters" is flawed, when you look at the combined properties of such an implementation:

A "valid" subset of a character set is not so easy to define. Newline is a control character, and you definitely want to allow that in comments. You'd have work cut out for days or weeks to create a sensible subset of Unicode (and combinations of characters) which could be considered "valid" across the globe.
An "invalid" subset is also difficult, if you're making anything even remotely complex. For example, you don't want literal quotes in SQL, but you also don't want literal ampersands or inequality signs in HTML, or backslash in JavaScript. If you have a series of input and output languages, the only way to be sure is to escape user input/output for each one.
The set is valid only for a single version of a single character set, so it's not future-proof.
You still need to test the full character range to see if there are any security holes.
If you're not careful with what you accept, you'll end up annoying users, and they will leave. If you're lucky, one in a thousand will file a bug report.

I'd go so far as to say that validating allowed characters reduces security, because it encourages sloppy implementation (lack of testing/escaping). If you escape where necessary you can instead just test the "nasty" characters, and if they work, you have pretty much guaranteed that other nasty characters will also be harmless to the system.

All this is of course not to say that some characters are nonsensical in some fields, such as two in a numeric field.

Any implementation trying to detect "malicious characters" is flawed, when you look at the combined properties of such an implementation:

A "valid" subset of a character set is not so easy to define. Newline is a control character, and you definitely want to allow that in comments. You'd have work cut out for days or weeks to create a sensible subset of Unicode (and combinations of characters) which could be considered "valid" across the globe.
An "invalid" subset is also difficult, if you're making anything even remotely complex. For example, you don't want literal quotes in SQL, but you also don't want literal ampersands or inequality signs in HTML, or backslash in JavaScript. If you have a series of input and output languages, the only way to be sure is to escape user input/output for each one.
The set is valid only for a single version of a single character set, so it's not future-proof.
You still need to test the full character range to see if there are any security holes.
If you're not careful with what you accept, you'll end up annoying users, and they will leave. If you're lucky, one in a thousand will file a bug report.

I'd go so far as to say that validating allowed characters reduces security, because it encourages sloppy implementation (lack of testing/escaping). If you escape where necessary you can instead just test the "nasty" characters, and if they work, you have pretty much guaranteed that other nasty characters will also be harmless to the system.

All this is of course not to say that some characters are nonsensical in some fields, such as two in a numeric field. But even this is often not trivial:

Different languages have different decimal and thousand separators. 1,000 == 1 in much of Europe, and 1'000 is a valid way to write 1000 in some places. You don't want to tell those users their way of writing is wrong.

Leading zeros, plus signs, hash and asterisk are all valid characters in a phone number. Some countries include an inland prefix ((0) in Switzerland) which you have to use within the country, and have to exclude when using a language code.

Contrary to many sloppy web developers, email addresses can contain dash and plus characters, and a whole lot of others which are routinely rejected as if to prove that the company policy is that they would do just fine if it wasn't for all those pesky customers.

Names can contain quote characters (Gerard 't Hooft), numbers (John Doe the 5th), punctuation (John Doe, M.D.), and SQL:

deleted 3 characters in body

Source Link

edited Apr 4, 2012 at 11:50

l0b0

11.6k
2
45
49

Any implementation trying to detect "malicious characters" is flawed, when you look at the combined properties of such an implementation:

A "valid" subset of a character set is not so easy to define. Newline is a control character, and you definitely want to allow that in comments. You'd have work cut out for days or weeks to create a sensible subset of Unicode (and combinations of characters) which could be considered "valid" across the globe.
An "invalid" subset is just asalso difficult, if you're making anything even remotely complex. For example, you don't want literal quotes in SQL, but you also don't want literal ampersands or inequality signs in HTML, or backslash in JavaScript. If you have a series of input and output languages, the only way to be sure is to escape user input/output for each one.
The set is valid only for a single version of a single character set, so it's not future-proof.
You still need to test the full character range to see if there are any security holes.
If you're not careful with what you accept, you'll end up annoying users, and they will leave. If you're lucky, one in a thousand will file a bug report.

I'd go so far as to say that validating allowed characters reduces security, because it encourages sloppy implementation (lack of testing/escaping). If you escape where necessary you can instead just test the "nasty" characters, and if they work, you have pretty much guaranteed that other nasty characters will also be harmless to the system.

All this is of course not to say that some characters are nonsensical in some fields, such as two in a numeric field.

Any implementation trying to detect "malicious characters" is flawed, when you look at the combined properties of such an implementation:

A "valid" subset of a character set is not so easy to define. Newline is a control character, and you definitely want to allow that in comments. You'd have work cut out for days or weeks to create a sensible subset of Unicode (and combinations of characters) which could be considered "valid" across the globe.
An "invalid" subset is just as difficult, if you're making anything even remotely complex. For example, you don't want literal quotes in SQL, but you also don't want literal ampersands or inequality signs in HTML, or backslash in JavaScript. If you have a series of input and output languages, the only way to be sure is to escape user input/output for each one.
The set is valid only for a single version of a single character set, so it's not future-proof.
You still need to test the full character range to see if there are any security holes.
If you're not careful with what you accept, you'll end up annoying users, and they will leave. If you're lucky, one in a thousand will file a bug report.

I'd go so far as to say that validating allowed characters reduces security, because it encourages sloppy implementation (lack of testing/escaping). If you escape where necessary you can instead just test the "nasty" characters, and if they work, you have pretty much guaranteed that other nasty characters will also be harmless to the system.

All this is of course not to say that some characters are nonsensical in some fields, such as two in a numeric field.

Any implementation trying to detect "malicious characters" is flawed, when you look at the combined properties of such an implementation:

A "valid" subset of a character set is not so easy to define. Newline is a control character, and you definitely want to allow that in comments. You'd have work cut out for days or weeks to create a sensible subset of Unicode (and combinations of characters) which could be considered "valid" across the globe.
An "invalid" subset is also difficult, if you're making anything even remotely complex. For example, you don't want literal quotes in SQL, but you also don't want literal ampersands or inequality signs in HTML, or backslash in JavaScript. If you have a series of input and output languages, the only way to be sure is to escape user input/output for each one.
The set is valid only for a single version of a single character set, so it's not future-proof.
You still need to test the full character range to see if there are any security holes.
If you're not careful with what you accept, you'll end up annoying users, and they will leave. If you're lucky, one in a thousand will file a bug report.

I'd go so far as to say that validating allowed characters reduces security, because it encourages sloppy implementation (lack of testing/escaping). If you escape where necessary you can instead just test the "nasty" characters, and if they work, you have pretty much guaranteed that other nasty characters will also be harmless to the system.

All this is of course not to say that some characters are nonsensical in some fields, such as two in a numeric field.

added 366 characters in body

Source Link

edited Apr 4, 2012 at 11:43

l0b0

11.6k
2
45
49

Loading

Source Link

answered Apr 4, 2012 at 11:36

l0b0

11.6k
2
45
49

Loading

Stack Exchange Network

Return to Answer