Why providing non-compressed version at all? This is new protocol, no need for backwards compatibility. Dictionary may be baked into protocol itself, being fixed for specific version. E.g. protocol v1 uses that fixed v1 dictionary. Useful for replaying stored events on both sides
Jetstream isn't an official change to the Protocol, it's an optimization I made for my own services that I realized a lot of other devs would appreciate. The major driving force behind it was both the bandwidth savings but also making the Firehose a lot easier to use for devs that aren't familiar with AT Proto and MSTs. Jetstream is a much more approachable way for people to dip their toe into my favorite part of AT Proto: the public event stream.
It's impossible to use the compressed version of the stream without using a client that has the baked-in ZSTD dictionary. This is a usability issue for folks using languages without a Jetstream client who just want to consume the websocket as JSON. It also makes things like using websocat and unix pipes to build some kind of automation a lot harder (though probably not impossible).
FWIW the default mode is uncompressed unless the client explicitly requests compression with a custom header. I tried using per-message-deflate but the support for it in the websocket libraries I was using was very poor and it has the same problem as streaming compression in terms of CPU usage on the Jetstream server.
A non-compressed version is almost certainly cheaper for anything local (ex self-hosting your own services that consume the firehose on the same machine or for testing).
There's not really a good reason to do compression if the stream is just going to be consumed locally. Instead you can skip that step and broadcast over memory to the other local services.
It could be a flag, normally disabled. Also, i'm not sure about "cheaper" side, since disk ops are not free, maybe uncompressing zstd IS cheaper than writing, reading huge blobs from disk, exchanging info between apps
Centralization on trusted servers is going to happen but if they speak a common protocol, at least they can be swapped out. For JetStream, anyone can run an instance, though it will cost them more.
It’s sort of like the right to fork in Open Source; it doesn’t mean people fork all the time or verify every line of code themselves. There’s still trust involved.
I wonder if some security features could be added back, though?
If you’re going to try data reduction and compression, always try compression first. It may reveal that the 10x reduction you were looking at is only 2x and not worth the trouble.
Reduction first may show the compression is less useful. Verbose, human friendly protocols compressed win out in maintenance tasks, and it’s a marathon not a sprint.
As a corollary, if you try to be too clever with your data reduction strategy, you might walk yourself into a dead end / local maximum by making the job of off-the-shelf compression algorithms more difficult.
> If everyone is on one server (remains to be seen)
Even internally, BlueSky runs a number of "servers." It's already internally federated. And you can transfer your account to your own server if you want to, though that is very much still beta quality, to be fair.
You're not really "on a server" in the same sense as other things. It's closer to "I create content addressable storage" than "The database on this instance knows my username/password."
Thanks for this! I'd never checked. Turns out I'm on https://morel.us-east.host.bsky.network. I do want to host my own PDS someday, but then I'd be responsible for keeping it up...
I'm also currently trying to understand the tradeoffs for did:plc more. It's unclear to me just how centralized it is. Will it always require a single central directory, or is it more like Certificate Transparency? Based on what I've heard about the recover process, I believe it's the latter, but I still need to dig into it more.
I don't know myself, but given that the discussion there, from what I've heard, is along the lines of "should be moved into an independent foundation," my assumption is that it will always require a directory. But this is probably the part of the tech stack that I know the least details about.
Depends on what the right way is to be honest. If the "right way" is moving the project into zero-knowledge proofs territory, it does push out a lot of the developers. It is not that cryptography in general is pushing people out in that case but ZKP complexity is.
The full Firehose provides two major verification features. First it includes a signature that can be validated letting you know the updates are signed by the repo owner. Second, by providing the MST proof, it makes it hard or impossible for the repo owner to omit any changes to the repo contents in the Firehose events. If some records are created or deleted without emitting events, the next event emitted will show that something's not right and you should re-sync your copy of the repo to understand what changed.
The "bring it all home" screenshot shows a CPU Utilization graph, and the units of measurements on the vertical axis appears to be milliseconds. Could someone help me understand what that measurement might be?
>Before this new surge in activity, the firehose would produce around 24 GB/day of traffic. After the surge, this volume jumped to over 232 GB/day!
>Jetstream is a streaming service that consumes an AT Proto com.atproto.sync.subscribeRepos stream and converts it into lightweight, friendly JSON.
So let me get this straight. if you did want to run Jetstream yourself you'd still need to be able to handle the 232 GB/day of bandwidth?
This always has been my issue with Bluesky/AT Protocol, For all the talk about their protocol being federated, It really doesn't seem realistic for anyone to run any of the infrastructure themselves. You're always going to be reliant on a big player that has the capital to keep everything running smoothly. At this point I don't really see how it's any different then being on any of the old centralized social media.
> It really doesn't seem realistic for anyone to run any of the infrastructure themselves. You're always going to be reliant on a big player that has the capital to keep everything running smoothly
I run a custom rust-based firehose consumer on the main firehose using a cheap digitalocean droplet and don't cross 15% CPU usage even during the peak 33 Mb/s bandwidth described in this article.
The core team seem to put a lot of intention into being able to host different parts of the network. The most resource intensive of which is the relay which produces the event stream that Jetstream and others consume from. One of their devs did a breakdown of that and showed you could run it at $150 per month which is pricey but not unattainable with grants or crowdfunding https://whtwnd.com/bnewbold.net/entries/Notes%20on%20Running...
Old social media never gave full access to the firehose so there’s a pretty big difference.
If you want large scale social networks, you need to work with a large scale of data. Since federated open queries aren’t feasible, you need big machines.
If you want a smaller scale view of the network, do a crawl of a subset of the users. That’s a perfectly valid usage of atproto, and is how ActivityPub works by nature.
>Old social media never gave full access to the firehose so there’s a pretty big difference.
That is good, but it's still a centralized source of truth.
>If you want large scale social networks, you need to work with a large scale of data. Since federated open queries aren’t feasible, you need big machines.
Thats just simply not true. ActivityPub does perfectly without the need of any bulky machine or node acting as a relay for the rest of the network. Every single ActivityPub service only ever interacts with other discovered services. Messages aren't broadcast through a central firehose, they're sent directly to who needs to receive them. This is a fundamental difference with how both protocols work. With ATProto you NEED to connect to some centralized relay that will broker your messages for you. With ActivityPub, there is no middle man, Instances just talk directly to each other. This is why ActivityPub has a discovery problem by the way, but it's just a symptom of real federation.
> That is good, but it's still a centralized source of truth.
It's not. It's a trustless aggregator. The PDSes are the sources of truth, and you can crawl them directly. The relay is just an optimization.
> Messages aren't broadcast through a central firehose
ATProto works like the web does. People publish information on their servers, and then relays crawl them and emit their crawl through a firehose.
> ActivityPub does perfectly without the need of any bulky machine or node acting as a relay for the rest of the network
ActivityPub doesn't do large scale aggregated views of the activity. The peer-wise exchanges means that views get localized; this is why there's not network-wide search, metrics, or algorithms.
> This is why ActivityPub has a discovery problem by the way,
>The PDSes are the sources of truth, and you can crawl them directly. The relay is just an optimization.
This is such a massive understatement. the relay is the single most important piece in the entire Bluesky stack.
Let me ask you this, is it possible for me to connect to a PDS directly, right now, via the bluesky app? or is this something that will be possible in the future?
>ATProto works like the web does. People publish information on their servers, and then relays crawl them and emit their crawl through a firehose.
>ActivityPub doesn't do large scale aggregated views of the activity.
So are relays really just an optimization or an integral part of how ATProtocol is supposed to work?
ActivityPub doesn't require relays to function properly. This is why I say it's real federation. You can't truly be federated if you require centralization.
> Let me ask you this, is it possible for me to connect to a PDS directly, right now, via the bluesky app?
Well, yes, that's what you do when you log in. If you open devtools you'll see that the app is communicating with your PDS.
> So are relays really just an optimization or an integral part of how ATProtocol is supposed to work?
I think the issue here is that you're mentally slicing the stack in a different way that atproto does. You expect each node the be a full instance of the application, and that the network gets partitioned by a bunch of applications exchanging peerwise.
> You can't truly be federated if you require centralization.
I’m not so sure: isn’t the certificate transparency log a pretty good example of a federated group of disparate members successfully sharing a view of the world?
That requires some form of centralization to be useful (else it’s not really a log, more of a series of disconnected scribbles), and it’s definitely a true federated network.
> > If you want large scale social networks, you need to work with a large scale of data. Since federated open queries aren’t feasible, you need big machines.
> Thats just simply not true.
> [snip]
> This is why ActivityPub has a discovery problem by the way, but it's just a symptom of real federation.
You're actually agreeing with them! The "discovery problem" is because "federated open queries aren't feasible".
> With ATProto you NEED to connect to some centralized relay that will broker your messages for you.
You can connect to PDSs directly to fetch data if you want; this is exactly what the relays do!
If you want to build a client that behaves more like ActivityPub instances, and does not depend on a relay, you could do so:
- Run your own PDS locally, hosting your repo.
- Your client reads your repo via your PDS to see the accounts you follow.
- Your client looks up the PDSs of those accounts (which are listed in their DID documents).
- Your client connects to those PDSs, fetches data from them, builds a feed locally, and displays it to you.
This is approximately a pull-based version of ActivityPub. It would have the same scaling properties as ActivityPub (in fact better, as you only fetch what you need, rather than being pushed whatever the origins think you need). It would also suffer from the same discovery problem as ActivityPub (you only see what the accounts you follow post).
At that point, you would not be consuming any of the _output_ of a relay. You would still want relays to connect to your PDS to pull data into their _input_ in order for other users to see your posts, but that's because those users have chosen to get their data via a relay (to get around the discovery problem). Other users could instead use the same code you're using, and themselves fetch data directly from your PDS without a relay, if they wanted to suffer from the discovery problem in exchange for not depending on a relay.
It doesn't change the fact that if someone were to do that, it wouldn't be supported by anyone let alone the main bluesky firehose. I think it's pretty disingenuous to just say "You can do it" when what your suggesting is so far off the intended usage of the Protocol that it might as well be a brand new implementation. As a matter of fact, people DO already do this. they use ActivityPub and talk to bluesky using a bridge.
The core of the issue is, Bluesky's current model is unsustainable. The cost of running the main relay is going to keep rising. The barrier to discovery keeps getting higher and higher. It might cost 150$/month now to mirror the relay, but what's going to happen when it's 1000$?
By your own numbers it averages 2.7 MB/s. This is manageable with a good cable internet connection or a small vps. This is a small number for 5.5 million active users.
What happens if it expands to 10 times its current active users? Who knows, maybe only the 354 million people around the world with access to gigabit broadband can run a full server at home and the rest of the people and companies that want to run full servers will have to rent a vps.
It has the same name, not the same functionality. I am not a downvoter... but it's probably because reading a few sentences of this blog post would reveal what it is.
It wasn't even a criticism, just an observation for anyone else who was thinking the same or was interested in another popular project with a similar name (and seemingly similar functions? didn't look too hard).
Naming things is hard and we all kinda share one global tech namespace, so this is gonna inevitably happen.
You're not the only one. I don't get why they couldn't have named it something that wasn't very similar to something already around for several decades, or at least insist on the shortened ATproto name (one word, lower case p). Sure, in practice, no one will actually confuse them, but that could be said for Java and JavaScript.
The 10x surge in traffic was us gaining 3.5M new users over the course of a week (growing the entire userbase by >33%) and all these users have been incredibly active on a daily basis.
Yes the actual record content on the network isn't huge at the moment but the firehose doesn't include blobs (images and videos) which take up significantly more space. Either way, yeah it's pretty lightweight. Total number of records on the network is around 2.5Bn in the ~1.5 years Bluesky has been around.
> The firehose is all public data going into the network right?
It's the "main subset" of the public data, being "events on the network": for the Bluesky app that's posts, reposts, replies, likes, etc. Most of those records are just metadata (e.g. a repost record references the post being reposted, rather than embedding it). Post / reply records include the post text (limited to 300 graphemes).
In particular, the firehose traffic does _not_ include images or videos; it only includes references to "blob"s. Clients separately fetch those blobs from the PDSs (or from a CDN caching the data).
I wondered something similar when I clicked the link: "who is still using enough AT commands that a compressed representation would matter, and how would you even DO that?" But this is clearly something else.
And then I had to look up CBOR too, which at least is a thing Wikipedia has heard of. I mostly use compressed wire protocols and ignore the flavor of the month binary representations.
BlueSky is mentioned. The ATProto thing has been discussed often in relation to an opensource social media protocol and all of this started from Jack Dorsey and Twitter. If it helps, it's thoroughly documented on Wikipedia.
For real. It's 2024, which in my millennial mind is squarely 'the future', and I'm writing bespoke UART drivers and command parsers for a protocol that's older than I am.
State what's not factual instead of just saying it's not factual. Otherwise I can easily say your statement is 100% top-down, in-and-out, all-around, unsubstantiated.
So what? Others simply don't attach the same importance to this that you do. Last time I looked he was all in on Nostr, which doesn't fit with your theory at all.
Where does Gaddafi come into this? That seems like a complete non-sequitur with the previous sentence talking about people involved in Start-ups in the early 2010s.
Gaddafi was overthrown as part of the Arab Spring. He wasn’t a guy involved in tech startups in the early 2010s hence my confusion as to how he was related to Aaron Swartz.
Server-Sent Events (SSE) with standard gzip compression could be a simpler solution -- or maybe I'm missing something about the websocket + zstd approach.
Well-configured zstd can save a lot of bandwidth over gzip at this scale without major performance impact, especially with the custom dictionary. Initialising zstd with a custom dictionary also isn't very difficult for the client side.
As for application development, I think web socket APIs are generally exposed much better and used much easier than SSEs. I agree that SSEs are a more appropriate technology to use here, but they're used so little that I don't think the tooling is good. Just about every language has a dedicated websocket client library, but SSEs are usually implemented as a weird side effect of a HTTP connection you need to keep alive manually.
The stored ZSTD objects make sense, as you only need to compress once rather than compress for every stream (as the author details). It also helps store the data collected more efficiently on the server side if that's what you want to do.
I don't have an understanding of SSE in depth, but one of the points the post is arguing for is compress once (using zstd dictionary) and send that to every client.
The dictionary allows for better compression without needing a large amount of data, and sending every client the same compressed binary data saves a lot of CPU time in compression. Streams, usually, require running the compression for each client.
I find it baffling that the difference in cost of serving 41GB/day vs 232GB/day is worth spending any dev time on. We're talking about a whopping 21.4Mbps on average, which costs me roughly CAD$3.76/month in transit (and my transit costs are about to be cut in half for 2 x 10Gbps links thanks to contracts being up and the market being very competitive). 1 hour of dev time is upwards of 2 years of bandwidth usage at that rate.
A thing I have admired about the BlueSky development team is that they're always thinking about the future, not just the present. One area in which this is true is the system design: they explicitly considered how they would scale. Sure, at the current cost, this may not be worth it, but as BlueSky continues to grow, work like this will be more and more important.
Also, it's just a good look towards the community. Remember, ideally not all of ATProto's infrastructure is run by BlueSky themselves, but by a diverse set of folks who want to be able to control their own data. These are more likely to be individuals or small groups, not capitalized startups. While the protocol itself is designed to be able to do this, "what happens when BlueSky is huge and running your own infra is too expensive" is a current question that some folks are reasonable skeptical about: it's no good being a federated protocol if nobody can afford to run a node. By doing stuff like this, the BlueSky team is signaling that they're aware of and responsive to these concerns, so don't look at it as trying to save money on bandwidth: look at it as a savvy way to generate some good marketing.
The current relay firehose has more than 250 subscribers. It's served more than 8.5Gbps in real-world peak traffic sustained for ~12 hours a day. That being said, Jetstream is a lot more friendly for devs to get started with consuming than the full protocol firehose, and helps grow the ecosystem of cool projects people build on the open network.
Also, this was a fun thing I built mostly in my free time :)
Also, being some sort of streaming firehose with the same data for everyone (if I understood what this is in a quick read) I guess if it makes any sense to do any kind of P2P/distributed transfer of it to ease the load...