The openvocs architecture is a traditional client server architecture using two types of transmission protocols, one for signaling events, the other one for media distribution. The 2 types of protocols are reflected within the server architecture, which is using a HTTPs server and a Media Relay Server.
The network itself is undefined and may be routed or a flat VLAN based network. Both scenarios needs to be supported.
Signaling and Media protocol are the core definition of openvocs. These protocols transmit events and media between client and server.
To allow lightweight and simple client instances the following prerequisites are set. Openvocs performs media mixing within the backend and provides the individual mixed media stream over one media connection to each client. This allows to reuse common VoIP technologies for media distribution. In addition it supports a simple interface for client applications and therewith a wide range of potential client implementations.
Defining the protocols between Client and Server backend is enough to allow different vendors to implement different client as well as server instances. Our goal is a minimal definition of the environment to allow maximum flexibility in actual implementations.
Signaling messages are transmitted as JSON over websockets. Signaling is alligend to the events a user may initiate at the client. The subset of events is defined within the openvocs API description. Media messages are transmitted via DTLS-SRTP a secure industrie wide used standard for audio transmission.
Definition of Client Server interaction is the core of the openvocs protocol. It allows an open connectivity for different kind of client implementations, with an open system definition, allowing different kind of system implementations. Our definition allows both an open connectivity with an open system design. Nonetheless this kind of definition in not enough to build a VoCS system, therefore we implemented a reference system with a lot more detail definition for the systems.
Reference Architecture
To build the most lightweigth client possible, openvocs uses an HTML5, CSS, JavaScript based webclient with WebRTC based media distribution.
Our reference implementation focuses on the most high level implemention to build a VoCS system. The client is virtual it instantiates within the webbrowser. A client is always upto date, as it is basically a website, which is loaded from a HTTPs server.
User interaction within the system e.g. selection of participation states for Voiceloops or PTT to transmit audio are converted to some events. These events are transmitted over the signaling channel to the server. As signaling channel the websocket protocol is used. To frame events a JSON structure is used. Events are transmitted as JSON over websockets.
For media transmission SRTP is selected. This is the secure version of the RTP protocol and widely used within the telekommunication industrie. At the client side WebRTC is used to transmit the audio from Microphone to the backend, using the SRTP channel.
The arichtecture selected is based on Webtechnologies to support the High Level implementation of the reference architecture. Nonetheless the protocol suite is quite simple and allows dedicated client implementations based on the usage of JSON over websockets, as well as WebRTC based communication channels.
Breaking down the architecture one step further, the backend needs to be defined in terms of functionality. The High Level description above gives a good hint about the Client Server architecture to connect clients and backend, but to provide VoCS services a backend must support the VoCS specific building blocks.
VoCS specific building blocks are an Authentication and Authorization backend as well as a mixing backend.
For Authentication and Authorization the openvocs reference implementation uses an in memory database based on JSON values. This database allows multi domain usage and is multitenant. It allows different projects within a domain and is Multimission ready.
The mixing backend is build up of microservice based mixer instances. Each proxy connection will use a dedicated mixer to mix the audio stream an user selected over its interface.
The microservice cloud is using Multicast based mixing of Voiceloops.
Media within the system is transmitted over Multicast Voiceloops. Each Voiceloop is using a dedicated Multicast IP. All trafic for a specific Voiceloop will be forwarded to that IP. Forwarding is implemented within the media proxy and transparent for clients. Clients communicate with the proxy and the proxy is forwarding incoming and outgoing media to the client. When a Voiceloop is selected for talk, the media proxy forwards that Voiceloop to the specific Multicast group.
During login to the system each client will be associated with a dedicated Mixing service. The mixing service is basisically a Multicast mixing node. Each Voiceloop a user selects will be mixed and a single stream of audio is transmitted to the proxy, which forwards the audio back to the client.
The above image shows the selection of Voiceloop DEV1, which is mapped to Multicast Group B. In addition some Loops A,C,D are selected for monitoring and mixed within the user’s mixer instance. The mixer instance forwards the stream to the proxy, which again forwards to the client. Switching is implemented over an internal API, which is not shown here for simplicity. This mixing functionality implements the core of a VoCS system, multiparty multiconferencing. Our solution using a media proxy to forward streams to and from the backend allows simple client implementations. A client connection is basically a (voice) call to the system, but instead of calling to a conference room, the call is transmitted to a custom mutliconferencing backend.
Switching within the system is quite simple. The Signaling proxy receives a command from the client, checks if the user is allowed to perform the switch and switches the media proxy, or media mixer dependent on the loop state. A monitor switch means to either switch on or off monitoring for a multicast group and therewith the reception of that Voiceloop. Switching a Voiceloop to talk means to switch the media proxies outgoing stream to the Multicast group of the Voiceloop.
The reference architecture contains an HTTPS capable server, which has a signaling proxy implementation enabled, combined with a Media Proxy server, a Multicast based backend network and a Mixer Cloud. The Mixer Cloud is actually a set of mixer implementations, which register at the signaling proxy. Each mixer is able to serve one client. The system scales with the amount of mixer services. If a system needs to provide 100 positions in parallel, the cloud must be configured for 100 mixers. Signaling proxys are Webservers with HTTPS and Websocket support. Within the Webserver a VoCS implementation instance is loaded, which provides all signaling event handling as well as user authentication and authorization capabilities.
This setup is highly flexible and adaptable. For operational use cases we deploy 2 instances in parallel and each client connects to both instances. Therewith every service is build up redundant.
Our setup is as flexible as it could be to provide VoCS services. It is highly scalable, based on the amount of mixer instances used and able to provide redundant implementations over the client interfaces.