Reversing CRC-32 values

machf · Post by **machf** » Fri Sep 21, 2007 8:21 pm

So, you may have already seen my app for reversing the CRC-32 values used for the sound names in Trespasser. Here's a description of the algorithms used on it and how they work.

First, let's start by describing what a CRC is. CRC stands for "Cyclic Redundancy Code". In data transmissions, CRCs are used as a way to detect possible errors during the transmission (there are other error detection codes aside from CRCs, though). The CRC for a data packet is calculated at the source and transmitted together with the data packet, then at the receiving end the CRC is recalculated and compared to the transmitted one. If they don't match, it means either the data or the CRC were corrupted during transmission, and they need to be retransmitted. It is possible for an error to go undetected, though, if the data and the CRC are corrupted in such a way that the received CRC matches the CRC recalculated from the received data; the probability of this kind of undetected error happening depends on the CRC polynomial used. The simplest CRC is a single parity bit. Commonly used ones are called CRC-16 and CRC-32, though technically ther are several possible ones. Take a look at this article on Wikipedia for some more details, I don't want to get to deep into it here.

Andres James had determined some years ago that Trespasser used the standard CRC-32 algorithm to calculate hash values for texture and sound names, though he didn't give details. I ran some tests and found out that it was first converting the strings to lowercase, before calculating the standard CRC-32 value.

So, why does Trespasser use CRC-32 hashes to store the sound names, instead of the actual names themselves? The answer appears to be: to save space on the audio data tables (the .tpa files). A CRC-32 value is only 4 bytes (32 bits) long, while apparently, Trespasser would use up to 32 characters for a sound name.

Some sound names themselves appear on the different levels where they are used. But, since Trespasser only stores the CRC-32 values in the tables and not all the sounds from the tables are used in the retail levels, we don't know the actual names of the unused sounds in the table. There have been trial-and-error attempts to find more sound names, which have been successful to some degree, but still several one remained unknown.

That's when I started to analyze the CRC-32 algorithm to see if there was any way to obtain the original string back from the CRC-32 value. At first sight, the answer seemed to be "no", since the CRC-32 algorithm discards data over the process...

Taking the table-driven algorithm, which processes the data 8 bits at a time and uses a 256-entry table of precalculated 32-bit values to obtain the CRC faster, I noticed the topmost 8 bits of each entry in that table were unique. So, if you start from the final CRC value, you can find which table entry was used to obtain it, and as you go back from the last character, you can do the same for a total of 4 entries. After that, you obtain a 4-byte sequence which has the same CRC-32 value you started with - it's not necessarily the original string (unless it was 4 characters long), but it can be used in its place. Of course, if they aren't 'printable' characters, they won't bee too useful for our Trespasser-specific purpose, but never mind... we'll deal with that later.

The 'lookback' process can be made faster by using a second table, instead of having to look through the entries on the original table for one with the matching upper 8 bits. Here's some JavaScript code that generates both tables simultaneously:

Reversing CRC-32 values

Reversing CRC-32 values

Re: Reversing CRC-32 values