Skip to content

Conversation

rgerganov
Copy link
Collaborator

Start a dedicated backend thread in the rpc-server and use message passing interface for submitting work to it. This will enable backend async operations and cross-server communication.

  • Self Reported Review Complexity:
    • Review Complexity : Low
    • Review Complexity : Medium
    • Review Complexity : High
  • I have read the contributing guidelines
@mofosyne mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Jun 13, 2024
Start a dedicated backend thread in the rpc-server and use message
passing interface for submitting work to it. This will enable backend
async operations and cross-server communication.
@slaren
Copy link
Member

slaren commented Jun 16, 2024

I may be wrong, but I suspect that the async queue will need to be implemented in the client side instead.

@rgerganov
Copy link
Collaborator Author

If we want to copy tensors across RPC servers then we need to handle at least two connections on the server side -- one from the scheduler and one from another RPC server. I considered the following options for implementing this:

  1. Using a single thread and async IO. I think this would be hard to implement in a cross-platform way without using 3rd party libraries.
  2. Using multiple threads and blocking IO. My assumption is that backend implementations are not guaranteed to be thread-safe, so we need to add synchronization when access the backend from multiple threads.
  3. Using a single thread for all backend ops and submitting work to it via thread-safe message queue. No synchronization needed as backend is confined to a single thread.

I think option 3 is bringing less complexity compared to option 2 so I opted for it but I am open to discussions.

I may be wrong, but I suspect that the async queue will need to be implemented in the client side instead.

Could you please elaborate?

@slaren
Copy link
Member

slaren commented Jun 17, 2024

I wouldn't say that the message queue doesn't require synchronization, it is still locking a mutex for every message. Whether that's more efficient than the other methods, I don't know, but it is probably not going to be the bottleneck regardless. Another option could be using select/poll, which is still a single thread with blocking I/O.

To implement the async interface of ggml-backend, my intuition is that it would be simpler to implement the queue on the client side, but I am not completely sure of that. I think it should be possible to create a generic adapter that sits on top of another backend and implements the asynchronous operations by running an asynchronous queue in a different thread. For APIs that support multi-device synchronization natively such as CUDA, it is still going to be more efficient to use the native implementation, but for other backends it should be possible to provide a generic implementation.

@rgerganov
Copy link
Collaborator Author

PR #8032 is based on this work, trying to make copying tensors across servers more efficient. However, I am observing performance degradation with TinyLlama and 2 CUDA servers running on localhost.

@slaren may be we should close this PR and continue the discussion on PR #8032?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix
4 participants
close