Free Republic
Browse · Search
News/Activism
Topics · Post Article

Skip to comments.

The New Internet Backup and the Multimillion-Dollar MD5 Digital Signature Gamble
Micro Resource Group ^ | Thursday, June 06, 2002 | Jeff Roehl

Posted on 06/06/2002 4:41:09 PM PDT by FoxPro

The New Internet Backup and the Multimillion-Dollar MD5 Digital Signature Gamble

You probably don't backup your computer. I don't, and my job revolves around it. Basically, there is very little on my computer that I care about. My current projects/pictures/documents are all in known directories. Every once in a while I copy these directories to another computer, and I am all right. I know if my drive crashes, it will take some time to become useful again. I am willing to take that risk versus being more vigilant.

Well backing up computers. over the Internet, is slowly going through this rather interesting revolution, and I want to reveal exactly how this is happening, without being vague about the rather interesting methodology (as I was in my previous postings).

The trick to backing up thousands of computers with a relatively small amount of bandwidth and storage space is that a good amount of what is actually on my computer is most likely on your computer. Now if you where to take a thousand random computers, and compare all their files to each other, it might be safe to say that the vast majority of the thousand computers files would be exact duplicates. This means, if you just backed up all files once, and never backed up a duplicate file, you could backup thousands computers very efficiently (of course if you are doing huge amounts of CAD drawings, or have several very large and active databases, this wouldn't work well, but most computers don't). This is the foundation of the new Internet backup.

So in essence, this new backup is not completely backing up your computer, it is only backing up files that it hasn't found on other peoples computers. If you used this service, and your computer crashed, you may have actually only backed up a few dozen spreadsheets, pictures and financial files from your computer, but these new backup services can restore your complete system. The reason they can do this is that they found, and acquired, the rest of your files before you ever started using the service. Their restoration is mostly made up of duplicate files from other computers using the same backup service.

How do they do this?

This is where things get interesting, and even almost paranormal.

The MD5 backup gamble.

Or is it a gamble? It depends on if you buy into the technological underpinnings of MD5 hashing, also know as "Digital Signatures", let me explain.

Professor Ronald Rivest, of MIT, developed the MD5 algorithm in 1991.

This algorithm takes any character string or "file" and produces a 32 byte "message digest" from the string. This "message digest" is said to be unique to the input string, because no other string will ever produce the same output. Lets look at a couple of examples.

A hash digest of the letter "a" = "0cc175b9c0f1b6a831c399e269772661"
A hash digest of the letter "b" = "92eb5ffee6ae2fec3ad71c777531578f"
A hash digest of my resume = "c815947623767d07da494270632ecb6a"
A hash of all 1389 pages of the King James Version of the Bible = "e9ced9b30c8d6d9211cdc78eac8c9cce"

If we were to take this text of the bible, and find the first mention of "King Herod" (Mathew 2:1) and replaced the "H" in Herod with a lower case "h" as in "King herod", the bibles text would hash to "906fd7ff9b23b62b08e91b6e58317689", a completely different result. So at least to this point, we can see how this tool, like a bar code in your grocery store identifies a can of soup, can uniquely identify digital information.

Why is this useful, for several reasons. When you type your password into MS Windows, it doesn't keep the password, it keeps its hash. The next time you type in your password, it must hash the same or it is invalid. Since a hash is irreversible (you cant deduce the password from the hash), this is thought to be very secure.

When you download something from the Internet, and you get something called a certificate. This is a hash, and guarantees that the file you are downloading is exactly the same as the authors published file, so when it is executed, it doesn't turn out to be some malicious virus, or anything else un-desirable.

To add validity to this, it seems that no 2 different bodies of binary data have ever hashed to the same 32 byte value, and it's not for not trying. Cryptology people have been trying to break this algorithm for years now. It is called a hash "collision", if 2 different files return the same value. If anybody can actually make a collision happen, and prove it, they will become instant celebrities in the field of cryptology.

So this is the clever foundation of the new backup, the ability to compare the complete contents of any file with just a string of digits like this "c815947623767d07da494270632ecb6a". All you have to do is go through the computers drive, hash all the files into a database, and send the hashes to the server and figure out what files you don't have, and back them up. For the files the backup system already has, all they do is set a marker to the file, and keep the location and name of the file. It is really that simple.

What happens if somebody finds 2 different files that hash to the same value? Will these multi-million dollar companies be in trouble? Assuming there are collisions found, how many collisions would make you, and more importantly the general public, distrust these new backup systems? Could this technology become so universally utilized (in security, communications and dozens of other uses), and then found to be broken, affect the technology economy? A lot of R&D is riding on this question, and only time will tell.

I am assuming that these companies don't want us talking about this potential problem, because they don't openly explain what they are doing and how they are doing it.

As the MD5 algorithm becomes increasingly more important in everyday technological matters (whether obvious or not), it would be interesting to really put this system through its paces, and that is exactly what we are attempting to do. We have developed a system to derive an MD5 hash from all files on any computer system, and compare them to all other hashes, of all other computers processed. There are any numbers of reasons to do this, only one of which is to see if we can come up with a file that hashes to the same value, but is a dissimilar size.

The information that can be derived from doing this can be utilized to:

      • Backup your most important files
      • Updating your operating system
      • Investigate what is happening on your computer, or any computer for that matter
      • Find lost/forgotten files
      • Find out what your kids are really up to
      • Help the FBI/CIA identify unique information on suspected terrorists computers
      • Several other things we haven't thought of yet

Hopefully this will all culminate in the Uniquefile(“filespec”) function, a 32-bit ActiveX and OCX, COM object for use in your Visual C++, Visual Basic, Visual FoxPro, Delphi, and C++ applications.

To investigate this further, check out File Isolation. It is endlessly intriguing and free.



TOPICS: Business/Economy; Crime/Corruption; Technical
KEYWORDS: backup; digitalsignatures; internet; md5

1 posted on 06/06/2002 4:41:09 PM PDT by FoxPro
[ Post Reply | Private Reply | View Replies]

To: FoxPro
Your web page has a message board with a single introductory message, which includes the phrase: What is the business end of this? What makes the shareholders think their investment will make money?

The mathematics behind these is pretty solid. If others want more information on this, see for example Ronald Rivest. RFC 1321: The MD5 Message-Digest Algorithm. RSA Data Security Inc., April 1992, or Cryptographic hash functions.

That you post with a bit of F.U.D. (scare tactics in the choice of the words "the Multimillion-Dollar ... Gamble", for instance) and with no history or background of yourself or organization, no business model (though there apparently is one), no references and little to go on except an apparent enticement for something "free" that will be scanning my entire computer raises my guard. I'd recommend others be a bit on guard here, until more is known.

Finally, if other readers of this are interested in backup over the internet, I can recommend SystemSafe, by NetMass. I won't leave a link here - you can find it on Google.com if you search. There are likely other good choices in this product category as well.

2 posted on 06/06/2002 5:19:38 PM PDT by ThePythonicCow
[ Post Reply | Private Reply | To 1 | View Replies]

To: ThePythonicCow
I didnt say anything about making money. You should let me worry about the shareholders. We are not a one product group, we are doing this for a mutitude of different purposes. If you want to see my resume, go to goggle.com and type in "resume Roehl"
3 posted on 06/06/2002 5:33:28 PM PDT by FoxPro
[ Post Reply | Private Reply | To 2 | View Replies]

To: FoxPro
Advertising a commercial product on FreeRepublic? Sorry, I'm not impressed. Also, even with today's hardware the computing power required to calculate MD5's for every file on the typical system is pretty significant.

From I/O bound to CPU bound - the classic tradeoff.

4 posted on 06/06/2002 7:01:24 PM PDT by The Duke
[ Post Reply | Private Reply | To 3 | View Replies]

To: The Duke
calculate MD5's for every file

2 to 4 hours at night while you are sleeping.

Advertising a commercial product

We dont see it as a "commercial product". We had a lot of fun putting this together. I guess only the Government should do this.

5 posted on 06/06/2002 7:22:31 PM PDT by FoxPro
[ Post Reply | Private Reply | To 4 | View Replies]

To: FoxPro
The MD5 hash has 340,282,366,920,938,463,463,374,607,431,768,211,456 possible signatures, which is greater than the number of atoms in the universe.
6 posted on 06/06/2002 8:31:36 PM PDT by Huusker
[ Post Reply | Private Reply | To 5 | View Replies]

To: Huusker
Otherwise known as 2 ** 128. That doesn't mean that there isn't some practical way to manufacture a string to match a given number.

For example, let's say the md5 sum was computed by simply looking at the binary representation of the data as one big number, module 2**128. Then any data pattern ending in the same last 128 bits would have that same sum, and md5 would be worthless.

Or for a slightly more realistic example, consider the classic check sum -- gotten just by adding up each byte in the data. Any input data stream with the same bytes rearranged will have the same sum.

Which isn't to say that md5 is much of a risk for such. Just to say that it could be, and then the number of bits wouldn't necessarily help.

7 posted on 06/06/2002 8:57:35 PM PDT by ThePythonicCow
[ Post Reply | Private Reply | To 6 | View Replies]

To: FoxPro
And I should worry about my data. I'm sure you had fun, that is clearly displayed on both your recent posts on this work. However I need some more context, before I even consider using this. It's a matter of trust. What we know about you so far is indistinguishable from someone up to something we might not find acceptable.
8 posted on 06/06/2002 9:01:49 PM PDT by ThePythonicCow
[ Post Reply | Private Reply | To 3 | View Replies]

To: FoxPro
Why not just install a mirroring controller ($85) and a second hard drive ($125) and then have 100% backup locally? Ok, if your house burns down, you're SOL. If your tv antenna gets hit by a bolt of lightning, you might be SOL. If you mung up your registry, you'd have to recover the data files once you re-installed the operating system. I just don't see the gain involved in flying my files across the internet when for $200 I have a real-time, identical backup on-site. You can get somebody to do this for you for less than $100 labor and know that even if one of your drives dies, life goes on - go to CompUSA and buy another drive. Off-site backup makes sense for businesses but ask your "friendly" EMC rep how much he wants to do it for you...
9 posted on 06/06/2002 9:14:34 PM PDT by agitator
[ Post Reply | Private Reply | To 1 | View Replies]

To: agitator
I do my own backup of my machines, to various tape drives. But I have ended up spending alot of time and money on this over the years.

My wife's machine, which she used to track her small business, I have backed up via a web based service. One time, I set it up to know which files and directories to back up - certain accounting, tax and document stuff. Now everyday, without her even knowing, her files are backed up, offsite (far away offsite). Perhaps $20 is deducted from her business account each month. And when the day comes, the files will be there.

It's much like my machines at work. I used to back them up to tape, myself. Now my MIS department backs up thousands of machines to a big monster rig, and when a file is lost, it's there. Once you're working off a high speed connection (a couple hundred kbits at least, most any cable or SDL) always on connection, the entire economics of it change.

10 posted on 06/06/2002 10:48:26 PM PDT by ThePythonicCow
[ Post Reply | Private Reply | To 9 | View Replies]

To: ThePythonicCow
For busineses purposes, offsite backup makes a lot of sense; you don't have control over the fire producing/theft proclivities of your commercial neighbors. On the other hand, for home purposes, $20 a month (plus the cable tv-like rate increases) doesn't make a lot of sense for me when hardware mirrored drives will stop 95% of home data loss.
11 posted on 06/06/2002 11:20:26 PM PDT by agitator
[ Post Reply | Private Reply | To 10 | View Replies]

To: agitator
It's more than just business. If one already has the higher speed interconnect, and if one isn't computer savvy, and if one data loss would be a serious bother, then the $20 may well be worth it. And for the non-computer savvy, it may be the only solution that works.
12 posted on 06/06/2002 11:30:01 PM PDT by ThePythonicCow
[ Post Reply | Private Reply | To 11 | View Replies]

To: Huusker
And that is Md5s potential weakness. It has a limited amount of available combinations, purporting to be able to uniquely identify any infinite combination of these “atoms”. There is a logical problem right there.

I didn’t post this to discuss MD5 or computer backup. I like the new hash based backup systems. I was just explaining how it works. I would backup my computer using this method, and sleep well at night.

I think MD5 is the coolest computer thing I have ever seen, and use it everyday for all sorts of things.

13 posted on 06/07/2002 7:41:29 AM PDT by FoxPro
[ Post Reply | Private Reply | To 6 | View Replies]

To: ThePythonicCow
Well, you don’t know me, and don’t know if I am up to something evil. This I can understand. Would it help if we posted the source code? The only problem with source code is that I have built in mechanisms in the code that prevent hackers from spoofing the system by sending in globs of useless data that would clog up my database array. If I put up the source code, I would be showing how to thwart this safety check.

We are talking to some big companies to lend us legitimacy. We just put the web page up a couple of days ago so yes we are an unknown quantity. So is everything that is new or novel.

What we would really like is the new homeland security people in Washington to look at what we are doing. We believe that if we can get the hashes from a few thousand computers, that this would help security agencies with a very useful tool to find terrorists messages imbedded in graphics files (just to give one example of numerous possible applications for this).

Another good example is, lets say we have millions of file hashes, from thousands of computers. Shandra Levy disappears here in Washington, DC. If the police could have run file isolation on her computer as soon as they got access to it, they would have been able to get a lot more information, a lot quicker than they did having to go through her system without it. I know personally how tedious this is. So we see this as a nice police investigative tool.

Also, this would be a nice clearinghouse for any number of hashed based, non-duplicative, backup systems. There is not much since in having several different database systems doing the same thing, when one system would be most optimal.

We are just doing what we like to do. Will we be successful, probably not. Will that stop us from trying? Probably not.

14 posted on 06/07/2002 8:09:34 AM PDT by FoxPro
[ Post Reply | Private Reply | To 8 | View Replies]

To: agitator
I agree totally. If my house burned down, I think I would have a lot more things to worry about than what was on my PC.

I don’t backup over the Internet, because I think it is too expensive. When they can backup all my computers for 5$ a month, I may consider it. I was just explaining the new backup methodologies, given that nobody has been able to break the MD5 hash, and hopefully nobody will.

15 posted on 06/07/2002 8:16:10 AM PDT by FoxPro
[ Post Reply | Private Reply | To 9 | View Replies]

To: FoxPro
We are just doing what we like to do. Will we be successful, probably not. Will that stop us from trying? Probably not.

I thought you said you weren't a commercial venture?

16 posted on 06/08/2002 7:45:28 AM PDT by The Duke
[ Post Reply | Private Reply | To 14 | View Replies]

To: FoxPro
I'm not an expert, but it seems there are quite a few good hashing algorithms out there. Cosets of many different nonsystematic error correction codes could be used, for example. It's not even necessary to understand the geometry of finite fields and elliptical curves to design a good hash, as far as I know.
17 posted on 06/08/2002 9:05:40 AM PDT by apochromat
[ Post Reply | Private Reply | To 1 | View Replies]

To: The Duke
What is the definition of "commercial venture" and why is this an issue?
18 posted on 06/08/2002 12:37:11 PM PDT by FoxPro
[ Post Reply | Private Reply | To 16 | View Replies]

Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.

Free Republic
Browse · Search
News/Activism
Topics · Post Article

FreeRepublic, LLC, PO BOX 9771, FRESNO, CA 93794
FreeRepublic.com is powered by software copyright 2000-2008 John Robinson