How does Erlang hot-swap work in the middle of an action?

I am currently working on a live media server that will allow general consumers to send us live video. In our current environment, we saw programs sent to us with a duration of days, so the idea of ​​fixing a mistake (or adding a function) without disconnecting users is extremely convincing.

However, when I wrote the code, I realized that exchanging hot code does not make any sense if I do not write every process, so that all state is always executed inside gen_server, and all external modules that call gen_server should be as simple as possible.

Take the following example:

-module(server_template). -behaviour(gen_server). -export([start/1, stop/0]). -export([init/1, handle_call/3, handle_cast/2, handle_info/2, terminate/2, code_change/3]). start() -> gen_server:start_link({local, ?MODULE}, ?MODULE, [], []). init([]) -> {ok, {module1:new(), module2:new()}}. handle_call(Message, From, State) -> {reply, ok, State}. handle_cast(any_message, {state1, state2}) -> new_state1 = module1:do_something(state1), new_state2 = module2:do_something(state2), {noreply, {new_state1, new_state2}}. handle_info(_Message, _Server) -> {noreply, _Server}. terminate(_Reason, _Server) -> ok. code_change(_OldVersion, {state1, state2}, _Extra) -> new_state1 = module1:code_change(state1), new_state2 = module2:code_change(state2) {ok, {new_state1, new_state2}} 

According to what I could find when a new version of the code is loaded into the current working environment without using an OTP system, you can update it to the current version of the code by calling your module as an external function call, therefore my_module:loop(state) .

What I also see is that when performing a hot swap, the code_change/3 function is called and updates the state, so I can use this to make sure that each of my dependent modules transfers the last state that they gave me to the state for current version of the code. He does this because the supervisor knows about the current process, which allows you to pause the process so that it can call the code change function. Things are good.

However, if calling an external module always calls the current version of this module, then it seems to break if a hot swap is performed in the middle of a function. For example, my same gen_server is currently in the process of processing any_message stock, say, between running module1:do_something() and module2:do_something() .

If I understand things correctly, module2:do_something() will now call the new current version of the do_something function, which could potentially mean that I am passing ungraded data to the new version of module2:do_something() . This can easily cause problems if it changes a record, an array with an unexpected number of elements, or even if there is no value on the map that the code expects.

Do I really not understand how this situation works? If this is correct, this seems to indicate that I should track some types of version information for any data structure that can cross module boundaries, and each public function should check this version number and, if necessary, perform migration on demand .

This seems to be a very high order, which seems insane, error prone, so I wonder if I am missing something.

+7
erlang otp
source share
2 answers

Yes, you are absolutly right. No one said sharing hot code is easy. I worked at a telecommunications company where all code updates were performed on a live system (so that users were not disconnected in the middle of their calls). The right solution means a thorough review of all the scenarios you mentioned and preparation of code for each failure, then testing, and then troubleshooting, testing, etc. To test it correctly, you will need a system that starts the old version under boot (for example, in a test environment), then deploy the new code and check for failures.

In this specific example mentioned in your question, the easiest way to solve this problem is to write two versions of module2:do_something/1 , accepting the old state and accepting the new state. Then, dealing with an old state, for example, transforming it into a new state.

To do this, you also need to make sure that the new version of module2 deployed before any module can call it with the new state:

  • If an application containing module2 is dependent on another release_handler application, it will update that module first.

  • Otherwise, you may need to split the deployment into two parts, first updating the common functions so that they can handle the new state, and then deploy new versions of gen_servers and other modules that make module2 calls.

  • If you are not using a release handler, you can manually specify in which order the modules are loaded.

This is also the reason why Erlang recommends avoiding circular dependencies in function calls between modules, for example. when modA calls a function in modB that calls another function in modA .

For updates performed using the release handler, you can check the order in which release_handler will update the modules on the old system in relup , which is generated by release_handler based on the old and new versions . This is a text file containing all the instructions for updating, for example: remove (to remove modules), load_object_code (load a new module), load , purge , etc.

Please note that there is no strict requirement that all applications must follow the OTP principles for exchanging hot code to work, however, using gen_server and the corresponding supervisor simplifies this task for both the developer and the release handler.

If you are not using the OTP release, you cannot update it using the release handler, but you can still force the modules to reboot on your system and upgrade them to the new version. This works great if you do not need to add / remove Erlang applications, because for this you will need to change the definition of the release, and this cannot be done on a live system without the support of the release handler.

+7
source share

Release processing calls sys:suspend , which sends the gen_server message. The server will process requests until it processes the suspension message, and at this time it just sits and waits. Then a new version of the module is loaded into the system, sys:change_code , which tells the server to call the code_change to update it, and then the server sits and waits again. When the release handler calls sys:resume , it sends a message to the server, which tells it to return to work and start processing incoming messages again.

Release processing does this simultaneously for all servers that are module dependent. So, first everything is suspended, then a new module is loaded, then everyone was told to update themselves, and then, finally, everyone was told to resume work.

+1
source share

All Articles