What should I know before poking around an unknown archive file?

The game I play saves all its data in a .DAT file. There was some work done by people in exploring the file. There are also existing tools , but I'm not sure about their current state. I think it would be great to regret the data itself, but I never tried to examine the file, much less something like that before.

Is there anything I should know about studying the file format for data extraction before diving into it?

EDIT: I would like to get some very general tips, as exploring file formats seems interesting. I would like to get the X file and find out how to approach the problem of learning this.

+4
source share
5 answers
  • You will definitely need a hex editor before going too far. This will allow you to see the source data as numbers, and not as large empty blocks in any font text editor (or any text editor).
  • Try opening it in any archive extracts (e.g. zip, 7z, rar, gz, tar, etc.) to see if it's just a renamed file format (.PK3 is something like this).
  • Look for the headers of well-known file formats somewhere in the file, which will help you find where certain pieces of data are stored (ie, perform an β€œIPNG” search to find any (uncompressed) png files somewhere inside).
  • If you find where a certain piece of data is stored, pay attention to its location and length and see if you can find numbers equal to any of the values ​​near the beginning of the file, which usually act as pointers to the actual data.
  • Several times you just need to guess or intuitively understand what a certain meaning means, and if you are mistaken, well, keep moving. There you cannot do this.
  • I found that http://www.wotsit.org is especially useful for well-known file type formats, for searching for found headers in a .dat file.
+8
source

First back up the file. Once you limit the amount of damage you can do, just poke around, as Ed suggested.

+3
source

Looking at the level of reputation, I assume that the basic primer for hexadecimal numbers, judgments, representations for different types of data, and all this will be a little redundant. Of course, you need a good tool that can display data in hexadecimal format, as well as the ability to write quick scripts to test complex assumptions about the data structure. All of this should be obvious to you, but maybe it can help someone else, so I thought I mention them.

+3
source

One of the best ways to attack unknown file formats when you have some control over the content is to use a differential approach. Save the file, make a small and controlled change and save again. Make a binary file comparison to find the difference - preferably with a tool that can detect insertions and deletions. If you are dealing with an encrypted file, a small change will cause huge differences. If it is simply compressed, the difference will not be localized. And if the file format is trivial, a simple state change will result in a simple file change.

+3
source

Another thing is to look at some of the common compression methods, especially zip and gzip, and find out their "signatures." Most of these formats are "self-defined", so when they start to unpack, they can quickly check that what they are working on is in a format that they understand.

The prohibition of encryption, the file format of the archive - this is basically some kind of indexing mechanism (directory or sorting) and a way to find these elements from the archive through pointers in the index.

With the ubiquity of standard compression algorithms, it is basically a matter of searching where these blocks begin, and trying to find an index or table of contents.

Some will have an index in just one place (for example, the file system), others will simply precede each element in the archive with their identification information. But in the end, there is information about offsets from one block to another, there is information about data types (for example, if they store GIF files, GIFs also have a signature), etc.

Here are the patterns you are trying to track down in a file.

It would be nice if you could somehow get your hand on two versions of the data in the same format. For example, in the game you can get the original version from the CD and a newer fixed version. They can really highlight the information you are looking for.

+3
source

All Articles