February 2, 2016

Benefits of the erlang scheduler

Elixir has a certain advantage when it comes to writing actor code over other systems of that type. This advantage is thanks to running on top of erlang’s BEAM VM. What’s more it cannot be easily replicated when concurrency is based on system threads.

The situation in ruby #

Let’s consider these two actors:

# I'm using concurrent-ruby running on jruby for these ruby samples


# record_execution is a method that just atomically increments a named counter

class ActorA < Concurrent::Actor::RestartingContext
  def on_message(msg)
    record_execution(:actor_a)
    sleep(2) # slow operation
  end
end

class ActorB < Concurrent::Actor::RestartingContext
  def on_message(msg)
    record_execution(:actor_b)
    # do nothing, i.e. a fast operation
  end
end

Now let’s say we start these 2 actors and send each 100 messages. After 10 seconds our execution counts will probably look something like:

{:actor_a=>6, :actor_b=>100}

Ok, now let’s start 20 of each and again send 100 messages to each of those. We should get 20x the number of executions, right?

{:actor_a=>30, :actor_b=>214}

Well, it seems we managed to exhaust the number of threads in our thread pool, as the speedup is lower than 20x. This is of course because I set the thread pool size to 5, but that’s besides the point. A limit will always exist, whether you set it or not. The interesting thing is that ActorAs and ActorBs are not affected by exhausting this pool in the same way. In fact it turns out that the faster operation is affected more. Why is that?

This is all due to how message execution works in concurrent-ruby. Or rather how many actors are mapped to fewer threads. An actor is given a thread from the pool for the time it needs to handle one message. It picks one message from its queue, processes it, and surrenders the thread. It is then added back to the actors awaiting for a thread. Now let’s consider what happens when there is 5 threads, and 20 each of the slow and fast actors. The actors are given threads, let’s say randomly. The fast ones return almost immediately and the thread is available for pickup. The slow ones hog the thread for 2 seconds. This happens until slow actors hold all 5 threads. Now we have to wait 2 seconds until any actor can handle another message.

When we are in this state the fast actors are waiting too. The slow ones need to complete their operations and return the threads to the pool for any progress to occur. This is where the principle of not putting slow operations in your actors comes from. We can partly resolve this problem by setting up 2 thread pools for the different types of actors. For example when you give the slow actors 3 threads and the fast ones 2 you will get something like the following execution counts:

{:actor_a=>18, :actor_b=>2000}

The slow tasks are affected even more, but the fast tasks execute at full speed.

How’s the BEAM different? #

In erlang’s (and so Elixir’s) virtual machine all process scheduling is done by the machine itself. Erlang processes are a close analog of actors in other systems, but instead of being added by a library they are built into the core of erlang. The main benefit is that the machine retains full control over what executes when. Of course under the covers it maps these processes to some system threads that are constantly running. But because of the extra level of abstraction the selection of which process to run can be more fluent.

Another key enabler is that the BEAM knows which operations are “slow”. A most notable one is waiting for a message to arrive. This is unlocked by having most everything in the ecosystem conform to process semantics. Communication with other processes always occurs by sending messages. So when you sleep or perform IO you are actually sending a message and simply blocking until a reply arrives.

Let’s take a look at a comparable setup in Elixir:

defmodule ServerA do
  use GenServer

  # ...

  def handle_cast(:msg, state) do
    GenServer.cast(Counter, {:inc, :server_a})
    :timer.sleep(:timer.seconds(2))

    {:noreply, state}
  end
end

defmodule ServerB do
  use GenServer

  # ...

  def handle_cast(:msg, state) do
    GenServer.cast(Counter, {:inc, :server_b})
    {:noreply, state}
  end
end

Now when we start 20 of each and send each of those 100 messages we get the following execution counts:

%{server_a: 100, server_b: 2000}

This is of course all thanks to the scheduling mechanism described above. This benefit is not limited to obviously-slow things like :timer.sleep/1 or even performing IO. Even if you block on something like:

SlowServer.schedule_long_computation_and_wait_for_result

and it’s all happening in your custom process with a custom SlowServer, you’re still covered. All this communication is using message sends and waiting for a result is just waiting for a message. So the VM will be able to detect that the current process cannot now make progress and switch to a different one.

The ultimate benefit comes in the form of more straightforward code. You don’t need to fear innocuous calls like DB.fetch_me_a_record anymore. The VM will yield execution to another process for you. And when you want to get something from the database and do some processing, I think it’s cleaner to do

record = DB.fetch_me_a_record
process(record)

than

future { DB.fetch_me_a_record }.
  then { |record| process(record) }

Kudos

Benefits of the erlang scheduler

The situation in ruby #

How’s the BEAM different? #

Now read this

More math in code