What does data serialization do?

It's hard for me to understand what serialization does.

Let me simplify my problem. I have struct info in my c / C ++ programs and I can store this struct data in save.bin or send it via a socket to another computer.

 struct info { std::string name; int age; }; void write_to_file() { info a = {"Steve", 10}; ofstream ofs("save.bin", ofstream::binary); ofs.write((char *) &a, sizeof(a)); // am I doing it right? ofs.close(); } void write_to_sock() { // I don't know about socket api, but I assume write **a** to socket is similar to file, isn't it? } 

write_to_file just saves the struct info a object to disk, making this data persistent, right? And writing it to a socket is pretty much the same, isn't it?

In the above code, I don't think I used data serialization , but a data is always stored in save.bin , right?

Question

  • Then what is the point of serialization? Do I need it here? If so, how do I use it?

  • I always think that any files, .txt/.csv/.exe/... , are bits 01 in memory, which means they have a binary representation, naturally, so we can’t just send these files through the socket directly?

Sample code is welcome.

+7
source share
6 answers

but a data is constantly saved in save.bin, right?

Not! Your structure contains std::string . The exact implementation (and the binary data that you receive with the cast to char* is not defined by the standard, but the actual string data will always go somewhere outside the class frame, heaps are allocated, so you cannot save this data. With the serialization correctly executed string data is written to where the rest of the class ends, so you can read it from a file, which is what you need to serialize.

How to do it: you need to somehow encode the string, the easiest way is to write its length first, and then the string itself. When reading a file, first read the length, then read this number of bytes into a new string object.

I always think that any files, .txt / .csv / .exe / ..., are bits 01 in memory

Yes, but the problem is that it is not universally defined which bit represents which part of the data structure. In particular, there are low-rise and big-endian architectures ; they store the bit “differently”. If you naively read a file written in an inappropriate architecture, you will obviously get garbage.

+6
source

Simply writing binary images to memory is a form of serialization, and for trivial cases, it works. However, in general, you need to solve a few more problems that do not flush memory:

1. Pointers

If the data contains any pointer, you certainly cannot simply unload the load later, because the memory address pointed to by the pointers will not matter as soon as the program terminates and restarts. Many objects have “hidden” pointers ... for example, there is no way to reset std::vector in memory and reload it later correctly ... sizeof on std::vector does not explicitly include the size of the contained elements and therefore any structure containing std::vector , cannot be simply reset and rebooted. The same goes for std::string and all other std containers.

2. Portability

The structure and classes of C and C ++ are not defined in terms of the bytes they occupy in memory, and are not portable. This means that another compiler, another version of the compiler, or even the same version, but with different compilation options, can generate code in which the structure structure in memory is not the same.

If you need serialization to just save and reload data in the same program, and data that it should not live for long, then you can use memory dump. Just think about having millions of documents saved only by dumping structures, and now the new version of the compiler (which you force to use because it is supported only in the new OS version) has a different layout and you cannot load these documents.

In addition to the portability problems of the same system, note also that even a single integer can have a different representation in memory on different systems. It can be more or less; it may have a different byte order. Just using a memory dump means that what is saved cannot be loaded by another system. Even one.

3. Versioning

If the stored data will have a long service life, then it is likely that you will change the structure as the program develops, for example, you will add new fields, you will delete unused fields, you will change the general structure (for example, changing a vector in a linked list).

If your format is just memory images of existing data structures, it will be quite difficult to add, for example, the color field to the polygon object, so that the program can load old documents, assuming that the default value is the color that was used in the previous version.

Even writing a conversion program will be difficult, because you will have old code that can load old documents and new code that can save new documents, but you can’t just “merge” these two and get a program that downloads old ones and saves new ones (t .e., the source code of both programs will have a polygon structure, but with different fields, now what?).

+5
source

Your line will not be saved correctly. If you have different machines, their representations of integers may differ, different programming languages ​​will not have the same representations for strings, for example.

But when you have pointers to members, you will save the address of the pointer, not the pointer element, which means that you cannot get this data from the file again. What if your structure needs to change? All software using your data must change.

Yes, you can send files via a socket, but you will need some kind of protocol to make sure you know the file name and when you have reached the end of the file.

+3
source

Serialization does a lot of things. It supports persistence (the ability to leave the program, then return to it and get the same data), and the connection between processes and machines. This basically means converting your internal data into a sequence of bytes and to be useful, you must also support deserialization: converting a sequence of bytes back to data.

When you do this, it is important to realize that internally a program, data is not just a sequence of bytes. It has a format and structure: as represented by double , it differs from one machine to the next, for example; and more complex objects like std::string , not even in continuous memory. So, the first thing you need to do when you serialize is determine how each type is represented as a sequence of bytes. If you are communicating with another program, both programs must agree on this serial format; if so, so that you can re-read the data yourself, you can use any format you want (but I would recommend using a predefined standard format like XDR, at least to simplify the documentation).

What you cannot do is simply unload the image of the object in memory. Complex objects like std::string will have pointers in them, and these pointers will be pointless in another process. And even the representation of simple double types can change over time. (The transition from 32 bits to 64 led to a change in the size of long on most systems.) You must determine the format, and then generate its byte by byte, from the data that you have. For example, to write an XDR, you can use something like this:

 typedef std::vector<char> Buffer; void writeUInt( Buffer& dest, unsigned value ) { dest.push_back( (value >> 24) & 0xFF ); dest.push_back( (value >> 16) & 0xFF ); dest.push_back( (value >> 8) & 0xFF ); dest.push_back( (value ) & 0xFF ); } void writeInt( Buffer& dest, int value ) { writeUInt( dest, static_cast<unsigned>( value ) ); } void writeString( Buffer& dest, std::string const& value) { assert( value.size() <= 0xFFFFFFFF ); writeInt( dest, value.size() ) std::copy( value.begin(), value.end(), std::back_inserter( dest ) ); while ( dest.size() % 4 != 0 ) { dest.push_back( '\0' ); } } 
+3
source

You play the game. In a very tough mode. You reach the last level. Are you happy. 2 days of non-stop play pay off. The plot is coming to an end. You will find the motivation of the evil mastermind how you should become a hero and collect the sought-after epic artifact that awaits behind this last door. And which you got here without rebooting once.

Behind the scenes is a game object that looks like this:

 class GameState { int level; } 

And level 25 .

You really liked the game so far, but you do not want to start all over again if the last boss kills you. So intuitively, you press Ctrl+S But wait, you get an error message:

 Sorry, saving is disabled. 

What? So I have to start all over again if I die? How can it be.

Drumroll

The developers, although brilliant (they managed to hold you for two days in a row, right?) Did not implement serialization .

When the game restarts, the memory is cleared. This important GameState object that you spent 2 days increasing the level member by 25 destroyed.

How could you fix this? The memory is restored by the OS when the game is closed. Where can you store it? On an external server? (sockets) On disk? (write to file)

Well, why not.

 class GameState { int level; void save(const std::string& fileName) { /* write level to file */ } void load(const std::string& fileName) { /* read game state from file */ } }; 

When you press Ctrl+S the GameState object is saved in a file.

And, miraculously, when you download the game, the GameState object is read from this file. You no longer have to spend 2 days to return to this last boss. You are already here.

The real answer is:

Technically, writing serialization functions is quite difficult. I suggest you use a third-party. Google protocol buffers offer serialization, which is cross-platform and even cross-language. Many others exist.

1. What is the point of serialization? Do I need it here? If so, how do I use it?

As explained above, it maintains state between runs or between processes (possibly on different machines). Regardless of whether you need it or not, it depends on whether you need to save the state and reload it later.

2. I always think that any files, .txt / .csv / .exe / ..., are bits 01 in memory, which means they have a binary representation in a natural way, so we cannot just send these files through a socket directly?

They are. But you do not want to modify .exe whenever you play a new game.

+2
source

Besides the big edian or little endian, there is a problem of how the data is packaged for a given structure for this program with this compiler. If you save the entire structure, you cannot use any pointers; you will have to replace it with a character buffer large enough for your needs. If the other machine is the same architecture, then if you use #pragma pack (1), there will be no spaces between the fields of your structure, and you can make sure that the data will be displayed as if it were serialized, but without a size prefix for your string. You can skip the #pragma (1) package if you are sure that another program that will read the data has the exact settings for the same exact structure. In addition, the data does not match.

If you serialize memory first, you can speed up the serialization process. This can usually be accomplished with a buffer class and one boilerplate function for most types.

 template<typename T> buffer& operator<<(T data) { *(T*)buf = data; buf += sizeof(T); } 

Obviously, you'll need row-specific and larger data types. You can use memcpy for large structures and pass pointers to data. For strings, you will need a length prefix, as mentioned earlier.

For serious serialization needs, however, much remains to be considered.

+1
source

All Articles