Thoughts on compressing Matrix

24/06/2019 — 3 Min Read — In Software

A not-entirely-invalid criticism frequently levelled at Matrix is that it's fairly inefficient, due to a combination of using HTTP, using long polling (sometimes misidentified as normal polling), using JSON, and having a rather verbose schema. Seldom is the fact that it triggers CORS preflights if used in a browser (unless you stick auth tokens in the querystring) mentioned, but there's that, too.

This is more because nobody's done the work to shrink it than because of any inherent limits.

Without considering the federation protocol, or changes that require altering what protocol the client thinks it's speaking (ie, can be done in a pair of proxies, one forward and one reverse), there's quite a bit of scope for trimming. I'm also not considering filters here, though a client which wished to minimise bandwidth at the cost of functionality could request the server send it much less data.

Websockets

First up, use websocket. There's already a draft spec (original, revised) for doing this, so some of the thinking could at least be reused if not implementations.

According to Firefox, a typical turn of the polling loop is a 416 byte OPTIONS request, then around 650 bytes of actual (long) poll, figures which appear to include the compression given by HPACK for headers, and gzip for bodies. Whether it counts the HTTP/2 framing I don't know, but either way, using websocket would save the entirety of that OPTIONS request (which also saves a RTT), and also sending sync tokens back to the server once the connection is working.

CBOR

This actually saves approximately nothing on its own! The exception is if you're sending lots of binary data around, which admittedly you might be in Matrix-land, on account of the end to end encryption facility. If you are, then those blobs don't expand by 4/3 from the base64 encoding.

After that, you need to do tricks with replacing constant strings with integers and the like; fortunately CBOR is suited to doing that to JSON-oriented schemas: property keys can be numbers, and the "tag" mechanism¹ allows labelling items whose literalness varies from the expected. This needs more negotiation if static tables are used, but those tables will seldom change compared to how often connections are established.

Here's a simple static approach: The two sides negotiate a shared list of strings, then any integers seen as keys in a CBOR object are indices into that list, while any integer values tagged with a particular tag are likewise (a schema-aware design would be able to omit a lot of these tags).

Compression

HTTP and Websocket both have provision for gzipping payloads, and Websocket provision for maintaining the compression context across messages. However, EXI points out that sometimes gains can be had by separating out the structure and content of a document prior to compression. WBXML has a related trick of moving all string literals to the top of the document, and only having pointers inline.

I don't know how effective this would be, but an avenue for research would be to try such tricks. To wit, arrange for the encoder to separate items such as strings and encrypted blobs out from the CBOR-encoded document. Left behind is a placeholder in the form of a tagged integer, specifying in which buffer the item got placed. Then the buffers, and the original document, can be compressed using separate contexts.

Here's an example. Suppose our document is:

{
  "type": "m.room.message",
  "sender": "@kythyria:berigora.net",
  "content": {
    "msgtype": "m.text",
    "body": "Is htop's memory readouts in kilobytes?"
  },
  "event_id": "$156134019913rdlXW:berigora.net",
  "origin_server_ts": 1561340199096,
  "unsigned": {
    "age": 620,
    "transaction_id": "m1561340197706.0"
  },
  "room_id": "!CbWGSqXrUpGnRUglfe:matrix.org"
}

We might create two extra buffers, thusly:

Buffer 0:

{
  "type": "m.room.message",
  "sender": buffer(1),
  "content": {
    "msgtype": "m.text",
    "body": buffer(2)
  },
  "event_id": buffer(1),
  "origin_server_ts": 1561340199096,
  "unsigned": {
    "age": 620,
    "transaction_id": "m1561340197706.0"
  },
  "room_id": buffer(1)
}

Buffer 1:

[
    "@kythyria:berigora.net",
    "$156134019913rdlXW:berigora.net",
    "!CbWGSqXrUpGnRUglfe:matrix.org"
]

Buffer 2:

[
    "Is htop's memory readouts in kilobytes?"
]

Buffer 0 can be further reduced by interning tricks, if we want to, or just assume gzip will do adequately in that regard. Or even use a more effective compressor, since we're no longer using the one built in to the websocket implementation. It would probably also be worth having a buffer for uncompressed binary blobs, since the biggest source of those is currently E2EE, which by definition makes random-looking and thus uncompressible data.

Whether it's worth moving ALL values out of buffer 0, leaving only arrays, objects, and keys, I'm not sure. It would need investigation on top of determining if this approach is worth it at all. On the one hand, more bytes. On the other hand, longer identical runs of bytes.

Tags are wrapped around another value, modifying its meaning. On the wire, they're simply prefixes, while in pseudo-JSON they're written similarly to a function application: tag_93(10). The name used may be symbolic rather than referring to the actual integer.↩