VoRS uses the Opus codec with 20ms frames with 48kHz 1ch 16-bit S-LE sound. It uses native libopus'es Packet Loss Concealment (PLC) feature when the number of lost frame does not exceed 32 count. DTX (discontinuous transmission) is also on. Each frame has a single byte stream identifier (unique identifier of the participant), 24-bit big-endian packet counter and 24-bit big-endian audio frame counter. Reordered packets are dropped. 24-bit counter is long enough for very long talk sessions. Audio frame counter is increased every 20ms data from microphone is read. When peer is muted, then no packets are sent, but audio frames are still counted. That gives ability to distinguish jitters and delays from lack of audio transmission. Each packet is encrypted with ChaCha20 and authenticated with SipHash24. Their keys are generated from BLAKE2b-XOF, which is fed with completed handshake's binding value. Then they are shared among the other participants. The stream identifier together with the packet counter is used as a nonce. It is tuned for 24Kbps bandwidth. But remember that it has additional 8B of MAC tag, 7B VoRS, 8B UDP and 40B IPv6 headers. Each client handshakes with the server over TCP connection using the Noise-NKhfs protocol pattern with curve25519, Kyber-1024, ChaCha20-Poly1305 and BLAKE2b algorithms. => Noise protocol framework => KEM-based hybrid forward secrecy * Client sends "VoRS v4" to the socket. Just a magic number. * All next messages are Netstring encoded strings. Most of them contain netstring encoded sequence of netstrings if multiple values are expected: NS(NS(arg0) || NS(arg1) || ...) => Netstring * Client sends initial Noise handshake message with his username, room name and optional BLAKE2b-256 hash of the room's password (or an empty string) as a payload: [USERNAME, ROOM, hash(PASSWD)]. * Server answers with final noise handshake message with the ["COOKIE", COOKIE], or ["ERR", MSG] failure message. It may reject a client if there are too many peers, its name is already taken or it provided an invalid room's password. * The 128-bit cookie is sent by client over UDP to the server every second. If UDP packets are lost, then no connection is possible and after a timeout the server drops the TCP connection. That cookie means: * confirmation of successful handshake on client side; * UDP hole punching of stateful firewall or NAT; * fact of client's UDP traffic ability to reach the server; * client's UDP address knowledge (after passing NAT, its port may differ from known to client one) * Server replies with ["SID", SID], where SID is single byte stream number client must use. * ["PING"] and ["PONG"] messages are then sent every ten seconds as a heartbeat. S <- C : e, es, e1, NS(NS(USERNAME) || NS(ROOM) || NS(hash(PASSWD))) S -> C : e, ee, ekem1, NS(NS("COOKIE") || NS(COOKIE)) S <- C : UDP(COOKIE) S -> C : NS(NS("SID") || NS(SID)) S <- C : NS(NS("PING")) S -> C : NS(NS("PONG")) S <> C : ... Every second the client sends UDP packet with his single-byte stream identifier, even if it's muted. That may help punching holes in stateful firewalls. Clients are notified about new peers appearance with "ADD" commands, telling their SIDs, usernames and keys. "DEL" notifies about leaving peers. S -> C : NS(NS("ADD") || NS(SID) || NS(USERNAME) || NS(KEY)) S -> C : ... S -> C : NS(NS("DEL") || NS(SID)) S -> C : ... "MUTED", "UNMUTED" notifies peer's mute toggling: S <- C : NS(NS("[UN]MUTED")) S -> C*: NS(NS("[UN]MUTED"), NS(SID)) "CHAT" broadcasts the message in the room: S <- C : NS(NS("CHAT"), NS(MSG)) S -> C*: NS(NS("CHAT"), NS(SID), NS(MSG))