| Full source and
copyright held at:
http://www.technologyreview.com/articles/04/08/wo_garfinkel080404.asp?p=0
Technology Review
Fingerprinting Your Files
"Hash" functions identify digital content with mathematical certainty-but
is that enough to foil the hackers?
By Simson Garfinkel
The Net Effect 8/04/2004
Three cryptographers at Stanford University recently came up with
a clever solution to the persistent problem of identity theft on
the Internet. Wily hackers in Russia, China, and other countries
send out piles of e-mail messages looking like they came from some
financial institution such as Citibank or Paypal. Millions of consumers
get these messages, which have embedded HTML links in them that
take the unsuspecting recipient to look-alike websites run in faraway
places. You're prompted to enter a username and password and then-wham-the
hacker has the keys to your bank account.
But good usernames and passwords typed at bad websites isn't the
only such threat that consumers face. A potentially larger problem
is that many people use the same username and password combination
at multiple sites. This makes memorization easier, but it means
that an unscrupulous website operator can take a list of usernames
and passwords from, say, an Internet sweepstakes site and use it
to try to break into online bank accounts.
So Stanford cryptographers Blake Ross, Dan Boneh, and John Mitchell
have designed a clever plug-in for Internet Explorer that solves
this problem by scrambling what you type into the password field
so every website sees a different password-a password that's based
both on what you type and on the domain of the website itself.
Now, lots of people use some variant on this strategy. Their Hotmail
password might be "nosmis-hotmail" while their Yahoo! Personals
password is "nosmis-Yahoo!" But any strategy like this is pretty
simple to decipher. The password scrambling method that the Stanford
trio has devised is based on a mathematical function called a cryptographic
hash-a kind of one-way function that transforms what the user types
into a jumble of numbers and letters in a way that cannot be reversed.
Because the Stanford system calculates the cryptographic hash of
both the website's domain and the user's password, the hacker gets
different passwords than the legitimate ones.
One company that's using cryptographic hashes in a very public
way is Yahoo! Last year, Yahoo! redesigned the login process to
its website to make it sniff-proof. The standard way to do this
is to use encryption. But encryption can be slow-especially when
you are running one of the most popular sites on the Internet.
So what Yahoo! did instead was to modify its login page to use
a so-called challenge-response system based on a cryptographic hash.
When you try to log in, Yahoo!'s server downloads to your browser
a cryptographic hash function written in JavaScript. Along with
this function is a "challenge"-a short sequence of letters and numbers.
When you type your password into the login screen, your browser
takes your password, appends these characters provided by Yahoo!,
and calculates the cryptographic hash of the resulting string. The
browser then sends the resulting value back to Yahoo!, no encryption
needed. Even if you are at a cybercafe having your Web traffic sniffed
by Belgium hackers, there's no way for the bad guys to take the
resulting hash value and derive your original password.
This clever "challenge-response" system is also at the base of
the Mobil Speedpass system: it's what makes the Speedpass radio
frequency identification (RFID) tag so difficult to clone. Other
RFID systems don't use challenge-response, which makes attacking
them comparatively easy.
But what is this cryptographic hash function, anyway?
The Incredibly Useful Hash
Cryptographic hash functions are one of the fundamental building
blocks of today's digital economy. Nevertheless they remain in many
ways a mystery-both to the cryptographers who create them, and to
the general public that uses them every day.
Hash functions are sometimes called "fingerprint functions" because
they can be used to create a unique "fingerprint" of a digital file.
The fingerprints are usually 128-bit or 160-bit numbers that are
displayed as a sequence of hexadecimal digits. The fingerprint of
my name using the MD5 system, for example, is c55bbe0f3ba258f5b1cb6d5b62b0b360.
Hash functions are designed so that, in theory at least, no two
files will ever hash to the same value.
The MD5 is: Most of the hash functions used today are based on
a technique developed by MIT professor Ron Rivest in the 1980s.
(Rivest is probably best known for being the "R" in the RSA encryption
algorithm, the public key encryption algorithm that's built into
practically every Web browser.) At the time, Rivest and other mathematicians
were working out the details of the basic cryptographic operations
that we now take for granted. The hash functions were envisioned
as a kind of cryptographic compression system-a way to take a large
file and crunch it down to a short string of letters and numbers.
The idea was to use these fingerprints as a kind of surrogate for
the files themselves. Instead of digitally signing the entire file,
Rivest and others reasoned, you could digitally sign the hash. Because
public-key cryptography involves a lot of heavy-duty math, hash
functions make it almost as fast to sign an extremely long file
as to sign a short file.
One of the most basic things that you can do with a hash function
is to find out if a file has changed: just calculate the hash of
a file and write it down. Later on you calculate the hash again.
If the hash hasn't changed, then the odds are overwhelming that
the file hasn't changed either.
For example, say that you keep the finances of your small business
using QuickBooks and you want to go on vacation for a few days:
people need to use your computer but you want to make sure that
nobody modifies the QuickBooks data. One simple thing that you can
do is compute the cryptographic hash of the file before you leave
and write the number on an index card. When you come back from vacation,
just re-compute the hash. If the two values don't match, you know
that the file has been tampered with.
Of course, you don't need to stop with just one file. You could
compute the cryptographic hash of every file on your computer and
put them all into a new file-call that file hashes.txt. You could
then compute the hash of hashes.txt and write this "fingerprint"
on your note card. Repeat the process when you come back from vacation
and you'll have a fast way of knowing if any file on your entire
computer has changed. (You won't have any way of knowing which file
has changed, but that's a different problem.)
This idea of computing the hash of a hash is the basis of an intrusion
detection system called Tripwire that Purdue University computer
science professor Gene Spafford and his graduate student Gene Kim
invented back in the early 1990s. (Spafford and I have co-authored
five books on computer science.) Today, many different programs
use this Tripwire approach to assure the integrity of computer files
and databases.
Computing hashes of hashes is also the basis of a secure timestamp
service invented by Stuart Haber and Scott Stornetta while the two
were at Bellcore in 1990. The service, called Surety, makes it possible
to generate a cryptographically secure and unforgeable proof that
a given document, photograph, or other file existed at a particular
time on a particular date and that it hasn't been changed since.
The Surety technique works by computing a "hash tree" based on
the hash codes of every document being time-stamped. The root of
the tree is then published in a well-known location-it could, for
example, be printed in a classified advertisement in the New York
Times. You can prove that your document existed on the day in question
by showing that your document's fingerprint was needed to generate
the fingerprint-of-fingerprints that appeared in the newspaper.
Other companies and even the U.S. Postal Service have since created
their own electronic time-stamp service. But all of these systems
rely on an organization that acts as a "trusted third-party" that
in effect signs your document using their private key. The problem
with this approach is that the third-party needs to be completely
trustworthy: if that third-party decides to create a signature with
the wrong date, or some hacker manages to steal the third-party's
private key, there is no way to tell a fraudulent signature from
a valid one. It's also possible to create fraudulent Surety signatures,
of course, but you would need to either go back in time and change
what was printed in The New York Times, or else travel all over
the world, find every copy that was printed, and change the old
fingerprint-of-fingerprints to the new one.
How Hash Functions Work
So that's why hash functions are helpful. Now, let's see what they
actually look like. Among the most widely used hash functions today
are the so-called MD5 (for Message Digest #5). MD5 produces a hash
that is 128 bits long and that is commonly written as sequence of
32 hexadecimal (base 16) digits. If you were to take my name and
process it with MD5, you would get this seemingly random string:
c55bbe0f3ba258f5b1cb6d5b62b0b360
Or, to state it with more mathematical formality:
MD5("Simson Garfinkel")= c55bbe0f3ba258f5b1cb6d5b62b0b360
Each of those hexadecimal characters represents 4 bits; the MD5
value of my name is actually: 1100010101011011101111100000111100111011101
0001001011000111101011011000111001011011011 010101101101100010101100001011001101100000
Most people work with the hexadecimal representation because it's
pretty easy to look two hashes and tell if they are the same or
different.
MD5 works by splitting the file up into lots of small pieces, and
then taking each of those chunks and performing hundreds of mathematical
operations that shuffle, invert, transpose, and otherwise process
the bits into an unrecognizable mess. The word "unrecognizable"
in this description is key. The fundamental requirement of a good
hash function is that it should be impossible to predict the fingerprint
of a file without actually going to the effort of computing that
fingerprint -there must be no short-cuts. If there were, you might
be able to run the hash function backwards and create a file that
had a specific hash-for example, the hash of another file. Indeed,
the entire security of hash functions falls apart utterly if it
is possible to generate two files that have the same hash.
The beauty of the hash function is that even a tiny modification
to the input produces a dramatic change in the output. Mathematically,
the functions are designed so that every bit in the output will
have a 50 percent chance of changing for every single bit changed
in the input.
Let's look at another MD5 hash, this one of a slightly different
representation of my name:
MD5("Simson L. Garfinkel")= df876e8e6f548d5be698fab7f06dd278
Merely adding "L." produces a completely different hash. If you
compare the two hashes bit-for-bit you'll find that 63 out of the
128 positions have changed from a 0-to-1 or a 1-to-0, and the other
65 have remained unchanged.
Unfortunately, the whole theory of cryptographic hash functions
has a huge problem. The use of these functions requires that there
be no so-called "collisions". Either accidentally or on purpose,
there should be no two files that have the same cryptographic fingerprint.
And as it turns out, this is an impossible requirement.
The reason is pretty simple. File fingerprints are a fixed size,
which means that there is a finite number of possible fingerprints.
Files, on the other hand, can be any size. Thus, there are more
possible files than fingerprints, and so there must be at least
one fingerprint that is the fingerprint of multiple files. The mathematical
term for this is the "pigeonhole principle." Indeed, even if you
restrict yourself to files that are just nine characters long, there
are still 256 times the number of possible files as the number of
possible fingerprints.
The reason that the pigeonhole principle doesn't render hash functions
completely pointless is that there are an astounding number of possible
fingerprints-far more, in fact, than the number of files on the
planet. (With MD5 there are 2128 possible fingerprints.
Now, the total number of computer hard drives that have ever been
manufactured is only around 229. If every hard drive
had a million unique files-a gross overestimation-there would still
be only 249 individual files. That's a much, much, much
smaller number than 2128.)
The SHA-1 Controversy
For tutorial purposes, I have used the MD5 hash function. But
these days MD5 is considered passé-instead most of the world is
moving over to the U.S. government's Secure Hash Algorithm, known
as SHA-1, a standard adopted by the National Institutes of Standards
and Technology (NIST) back in the early 1990s.
Today SHA-1 is a widely respected algorithm, but it has a troubled
history. Back in 1993, the U.S. government was trying to get industry
to adopt the so-called Clipper Chip-a secret encryption system designed
by the National Security Agency. During the so-called "crypto wars"
that raged around Clipper, NIST proposed that the U.S. government
adopt its own Secure Hash Algorithm as part of the Federal Information
Processing Standards. For technical reasons, hash functions should
have twice as many bits as the encryption algorithms that they work
with. Clipper was an 80-bit encryption algorithm, so the standard
was designed to produce a 160-bit fingerprint.
One might think that the governments' standard, with its 160-bit
fingerprint, would be more secure than the 128-bit MD5. But like
Clipper itself, SHA was designed by the National Security Agency-and
both NIST and the NSA declined to explain the principles that were
used in its design. Some people wondered if the NSA might have hidden
some kind of "back door" inside the algorithm so that the agency
could generate collisions on demand. Such a back door could be used,
for example, to produce faked digital signatures-something that
the Central Intelligence Agency might find useful. A fake digital
signature might be used, for example, to sign an electronic order
giving an U.S. spy access to a database in a foreign country.
Lots of cryptographers and other academics analyzed the SHA algorithm
and couldn't find anything wrong with it. On May 11, 1993, NIST
proclaimed SHA as the nation's Secure Hash Algorithm. But the ink
was barely dry on this decree when NIST announced that it had made
a mistake. For reasons that would not be revealed at the time, NIST
published a modified version of the Secure Hash Algorithm-the algorithm
that we now call SHA-1.
The conspiracy theorists in the cryptography community (and there
are many) had a field day. Was SHA so powerful that the NSA had
decided that it had to be "dumbed down?" Or had NSA perhaps planted
a back door in SHA-and somebody at NIST had found out? Were both
algorithms equally secure, and the cryptographers at the NSA were
just messing with people's minds?
In August 1998, the world more-or-less learned the answer to the
SHA vs. SHA-1 mystery. Florent Chabaud and Antoine Joux, two French
cryptographers, came up with a theoretical attack against the first
version of SHA-an attack against which SHA-1 just happened to be
secure. Almost certainly, the folks at NSA knew about this attack
and proposed SHA-1 as a countermeasure. What's interesting here
is that NSA's cryptographers probably didn't know about the attack
when SHA was first proposed in 1993-which means that the world's
top cryptographic agency was only five years ahead of the cryptographers
in academia.
Today hash functions are also commonly used to generate repeatable
but unpredictable "random numbers," for converting typed passwords
into values suitable for using as encryption keys. Instead of storing
passwords directly, many computer systems store the hash of a password.
This prevents somebody who breaks into a computer from learning
everybody's password.
Hash functions have been proposed as a way to fight spam and as
the basis for digital cash systems. Mathematician Peter Wayner published
a book called Translucent Databases a few years ago in which he
showed how hash functions could be used for storing information
in a database in a way that's protected by the organization that's
running the database. A college admissions department, for example,
could store student social security numbers in the database so that
these numbers could still be used as identifiers on applications,
but so that nobody in the admissions office could sit down at a
terminal and get a list of students and their numbers. So far, though,
none of those approaches have really gotten off the ground.
All in all, cryptographic hashes are one of the most interesting
and useful mathematical techniques that cryptographers have come
up with over the past 20 years-and we're still finding new uses
for them all the time.
Simson Garfinkel is the author of nine books on computing, including
Database Nation.
|