VoRS uses the Opus codec with 20ms frames with 48kHz 1ch 16-bit S-LE sound. It uses native libopus'es Packet Loss Concealment (PLC) feature when the number of lost frame does not exceed 32 count. DTX (discontinuous transmission) is also on. Each frame has a single byte stream identifier (unique identifier of the participant), 24-bit big-endian packet counter and 24-bit big-endian audio frame counter. Reordered packets are dropped. 24-bit counter is long enough for very long talk sessions. Audio frame counter is increased every 20ms data from microphone is read. When peer is muted, then no packets are sent, but audio frames are still counted. That gives ability to distinguish jitters and delays from lack of audio transmission. Each packet is encrypted with ChaCha20 and authenticated with SipHash24. Their keys are generated with HKDF taken on handshake's state. Then they are shared among the other participants. The stream identifier together with the packet counter is used as a nonce. It is tuned for 24Kbps bandwidth. But remember that it has additional 8B of MAC tag, 7B VoRS, 8B UDP and 40B IPv6 headers. Each client handshakes with the server over TCP connection using the PQConnect, Noise, Chempat inspired protocol. It consists of hybrid key exchange, using static Classic McEliece 6960-119 server's public key, static X25519, ephemeral X25519 and ephemeral Streamlined NTRU Prime 761 ones. With HKDF as a KDF and SHAKE as a hash function. => PQConnect => Noise protocol framework => Chempat => Classic McEliece => Streamlined NTRU Prime => X25519 => HKDF => SHAKE * All messages are Netstring encoded strings. Most of them contain netstring encoded sequence of netstrings if multiple values are expected: NS(NS(arg0) || NS(arg1) || ...) => Netstring * Client sends NS("VoRS v5") to the socket. Just a magic number. * Then it performs [PQHS]. * Client sends initial handshake message. Its prefinish payload message contains his username, room name and optional SHAKE256 hash of the room's password (or an empty string) as a payload: [USERNAME, ROOM, hash(PASSWD)]. * Server answers with final noise handshake message with the ["COOKIE", COOKIE], or ["ERR", MSG] failure message. It may reject a client if there are too many peers, its name is already taken or it provided an invalid room's password. * The 128-bit cookie is sent by client over UDP to the server every second. If UDP packets are lost, then no connection is possible and after a timeout the server drops the TCP connection. That cookie means: * confirmation of successful handshake on client side; * UDP hole punching of stateful firewall or NAT; * fact of client's UDP traffic ability to reach the server; * client's UDP address knowledge (after passing NAT, its port may differ from known to client one) * Server replies with ["SID", SID], where SID is single byte stream number client must use. TODO * ["PING"] and ["PONG"] messages are then sent every ten seconds as a heartbeat. S <- C : hello S -> C : hello S <- C : finish, NS(NS(USERNAME) || NS(ROOM) || NS(hash(PASSWD))) S -> C : NS(NS("COOKIE") || NS(COOKIE)) S <- C : UDP(COOKIE) S -> C : NS(NS("SID") || NS(SID)) S <- C : NS(NS("PING")) S -> C : NS(NS("PONG")) S <> C : ... Every second the client sends UDP packet with his single-byte stream identifier, even if it's muted. That may help punching holes in stateful firewalls. Clients are notified about new peers appearance with "ADD" commands, telling their SIDs, usernames and keys. "DEL" notifies about leaving peers. S -> C : NS(NS("ADD") || NS(SID) || NS(USERNAME) || NS(KEY)) S -> C : ... S -> C : NS(NS("DEL") || NS(SID)) S -> C : ... "MUTED", "UNMUTED" notifies peer's mute toggling: S <- C : NS(NS("[UN]MUTED")) S -> C*: NS(NS("[UN]MUTED"), NS(SID)) "CHAT" broadcasts the message in the room: S <- C : NS(NS("CHAT"), NS(MSG)) S -> C*: NS(NS("CHAT"), NS(SID), NS(MSG))