Hacker News new | past | comments | ask | show | jobs | submit login
Introducing S2 (s2.dev)
372 points by brancz 30 days ago | hide | past | favorite | 195 comments



IANAL,but naming your product S2 and mentioning in the intro that AWS S3 is the tech you are enhancing is probably looking for a branding/copyright claim from Amazon. Same vertical & definitely will cause consumer confusion. I'm sure you've done the research about whether a trademark has been registered.

https://tsdr.uspto.gov/#caseNumber=98324800&caseSearchType=U...


Fun fact: S2 and EC2 sound exactly the same in Spanish - both are "ese dos". Add that to EC2 and S3 already being confusing to tell apart by ear


Only for you lot of Thouth-american thpanish thpeakerth.


not for non latin american speakers.


TBF, building something with the goal of enhancing S3 I would call it S4.


Thats short term thinking. you need to leapfrog everybody and go s∞


That’s actually a pretty cool name if you pronounce the first letter the letter sound rather than as an initial: Sinfinity


Sounds more like a porn website...


Very responsive log porn. ;-)


My aging eyes just pronounced it "sooooo"


Too late, name's taken for something else: https://incubator.apache.org/projects/s4.html


And don't forget the other S4: http://www.supersimplestorageservice.com/

It's like S3, except better because, by focusing on being a write-only data store, they can manage much more throughput and efficiency, plus your data is far more secure at rest than it is in S3.


How about the other other S4 https://adv-r.hadley.nz/s4.html


I would really like to know how many people send money through that paypal link


Try clicking on it.


Ah, thank you :)


why not s11?


S3++ ? T4?

My company is a Fivetran client, and they named that company after a (bad) joke, but it's worth a fortune.


Fivetran is going to zero because they don’t offer anything of actual value and their CEO isn’t a good person.

[1] https://news.ycombinator.com/item?id=42434450


F3 - (Fast Furious Fail-Safe)


At least cloudflare’s R2 has an argument for the naming (IBM vs HAL, A Space Odyssey)


I'm not sure whether they consulted a bad trademark lawyer or didn't consult one at all, but it wouldn't have cost that much to do so. I say this having just recently started the process of filing a trademark - the cost is about the same as buying i.e. 's4.dev' according to the domain registry's website.

Having to rebrand your product after launching is a lot more painful than doing it before launching.


OR

Amazon just builds the same thing, calls it S3 Streams, and doesn’t care about S2.

Maybe they make a buyout offer.

I highly doubt they would sue.


Trademark law encourages companies to defend their marks. If they don’t, they may lose the trademark. So Amazon has to write these guys a letter if it wants to defend the s3 trademark.


Amazon might write a letter, but if the tech is solid, they’ll probably just work with them.


s3 (serverless stream store)


What could possibly be better than being sued by Amazon for some nitpicky naming Issue ?

That’s the kind of David vs. Goliath publicity one could only dream of …


98% of the time, law suits are just a money pit. There is zero publicity. A tiny number go viral. I don't think this is likely to be one of those times.

Most people would simply say "Amazon is right." Because Amazon is right. This is an intentional attempt to leverage their product branding to promote a new product. There is very little good here.

If this were open-source, academic, non-profit, or something like that, perhaps. A small venture trying to commercialize on some digital equivalent of Amazon's trade dress? I can't imagine anyone would care....

Even those times when someone is 100% right, usually, there is zero publicity. Right or wrong, most times I've seen, the small guy would settle with the big guy with the deep legal pockets and move on because litigating is too expensive.

In a situation like this one, your marketing spend / press coverage on the existing name is shot, links to your domain are shot, and perhaps you have an egg on your face, depending on how things play out.


Yep, letter S and a number is copyrighted, can't do that


1) we're talking about trademark law, not copyright law.

2) the problem here is that they're in the same business segment, and explicitly reference S3.


S3. But trademark law prevents subtle variations.

E.g. creating a product called “Gooogle”


This is a really good idea, beautiful API, and something that I would like to use for my projects. However I have zero confidence that this startup would last very long in its current form. If it's successful, AWS will build a better and cheaper in-house version. It's just as likely to fail to get traction.

If this had been released instead as a Papertrail-like end-user product with dashboards, etc. instead of a "cloud primitive" API so closely tied to AWS, it would make a lot more sense. Add the ability to bring my own S3-Compatible backend (such as Digital Ocean Spaces), and boom, you have a fantastic, durable, cloud-agnostic product.


(Founder) we do intend to be multi-cloud, we are just starting with AWS. Our internal architecture is not tied to AWS, it's interfaces that we can implement for other cloud systems.


It would be extra ironic if the whole thing already ran on top of AWS.

There's no end to startups which can be described as existing-open-source-software as a service, marketed as a cheaper alternative to AWS offerings.. who run on AWS.


People keep making the same argument against Aptible (https://aptible.com) and it is still a very successful PaaS over a decade later.


I had never heard of this company so I took a look and the main pitch was compelling and then I went to the pricing page and saw the pricing goes from $0 to $500 a month once you want to go to “production”. i’m clearly not the target market, which makes sense why I’ve never heard it.


It’s popular for security sensitive (e.g. healthcare) stuff


If you do cloud infra stuff, AWS will try to undercut you on price but will never outdo you on D/UX. So I wouldn't let Beezus hold me back


They just did https://news.ycombinator.com/item?id=42211280 (Amazon S3 now supports the ability to append data to an object, 30 days ago). Azure has had the same with append blobs for a long time. It's still a bit more raw than S2, without the concept of record. The step for a cloud provider to offer this natively is very small. And with the concept of a record, isn't this essentially a message queue, where the competitor space is equally big? Likewise if you look into log storage solutions.


(Founder) Both S3 Express _One Zone_ appends and Azure's append blobs charge the regular PUT price for appends. It may work for you, but probably not if you want to do smaller writes.

Blob stores will also not let you do tailing reads, like you can with S2.

In AWS, S2's Express storage class takes care of writing to a quorum of 3 zonal buckets for regional durability.

I doubt object stores will go from operating at the level of blobs and byte ranges, to records and sequence numbers. But I could be wrong.


Amazon don't compete for price sensitive product offerings.

If anything, they normlise an expectation with a budget aware base.


Help me understand - you build on top of AWS, which charges $0.09/GB for egress to the Internet, yet you're charging $0.05/GB for egress to the Internet? Sounds like you're subsidizing egress from AWS? Or do you have access to non-public egress pricing?


(Founder) We are not charging in preview. At the scale where it matters, we will work it out. Definitely some assumptions in here.


For what it's worth, there's zero chance I would do business with a company whose business plan is "we'll work it out". It gives one every reason to believe that in a couple years time you guys will either be out of business (because you didn't figure out the numbers to make a profit) or will pull the rug from under customers in the form of surprise price hikes. Obviously you have to do what you think is right, but I think that this approach is going to scare off a lot of customers for you.


(Founder) We are not charging during preview. If anything, I wanted to be transparent about our planned pricing. Our mission is to make streams a cloud storage primitive, and I worked backwards from there in terms of our costs and expected costs looking ahead once we can scale a bit - based on concrete data points about what kind of discounts can be unlocked. I realized it was premature based on the comments here, so the price for internet egress has been updated. Thank you for your feedback.


Just FYI, that doesn't give me confidence in the longevity of your service.


Cloud services offer giant discounts sometimes and the receiving party aren't allowed to talk about it concretely so that's probably what's happening here.


(Founder) I understand the concern. However, cloud discounts at scale can be very large, and we are going to share as much of it as we reasonably can.


Discounts require multi year commitment for minimum (and increasing) spend. Generally you need to be either profitable or a well funded startup to demonstrate why a vendor would trust your ability to pay (it's literally a debt on your books). How do they know you're good for it?

Plus multi cloud means less scale and less marketing incentive (can't talk about you as a x cloud customer).

I wish you the best, but would encourage you to not set your prices below your costs.


(Founder) Thank you for the advice. I hope we can offer better when the deals come into play, but for now setting our planned internet egress price to $0.08/GiB.


Do you plan to charge differently for bandwidth depending on whether the customer is in AWS or not? Would be nice if you pass on the cost savings.


(Founder) Yes, we will charge less for private connectivity. Pricing is transparent https://s2.dev/pricing - free during preview.


Doesn't AWS charge $0.01 intra region and $0.02 between regions, even without setting up private links? Can't you pass part of those savings (compared to the $0.05-$0.09 of egress) on? Or is it too difficult to detect if the remote IP qualifies?


(Founder) Unfortunately, if you access over a public IP, it is internet egress. Even if the client is in the same AWS cloud region. PrivateLink is the only option.


Peering is an option as well but it's a whole different ballgame of complexities to set up vs. private link


List pricing is $0.05 per GB after 150TB and at high volume it’s cheaper than that


They’re probably betting on most users being in AWS and only having to pay 1¢-2¢ transfer.


They're also banking on scale to PPA with a specific amendment for egress.


Nobody with sufficient scale will be paying retail for data transfer.


Looks like they changed it to $0.08/GB. Which loses them at most $300/month at 50TB, and makes money after that.


strat is likely just get users, then offboard aws if the product works.


(Founder) No, we want to be in the same cloud regions as customers.


So is this basically WarpStream except providing a lower-level API instead of jumping straight to Kafka compatibility?

An S3-level primitive API for streaming seems really valuable in the long-term if adopted


(Founder) That somewhat summarizes yes :) We take a different approach than WarpStream architecturally too, which allows us to offer much lower latencies. No disks in our system, either.


These folks knowingly chose to spend the rest of their careers explaining that they are not, in fact, S3.


(Founder) well 50% of our name is different


I like it. I see it as ostensibly a product for engineers and so when I see a name like S2 it's immediately clear that it's a product led and conceived by engineers.

I also see that on your pricing page -

"We are building the S3 experience for streaming data, and that includes pricing transparency"

Love the simple and earnest copy. One can imagine what an LLM would cook up instead, I find the brevity way preferable.


(Founder) Thank you for the kind comment!

Yes we are not trying to confuse S2 with S3, we just think S3 is the best damn serverless experience out there, and we aspire to that greatness. We borrowed the structure of that name to reflect that aspiration, as have other services inspired by S3 like Cloudflare's object store R2.


I actually thought S2 is a Cloudflare service at first.


You should have gone with S4 tbh. The suits love bigger numbers. Super Simple Stream Store.


http://www.supersimplestorageservice.com/ exists and calls itself S4. It's a decent gag and the immediately came to mind when I heard S2 vs S3.


(Founder) I have definitely received that advice before :) - to not seem like a regression from S3. But as an abbreviation for Stream Store, it made sense.


Why not just use SS? There can’t possibly be any negative connotations there.


You could even make the s look kind of like a lightning bolt to emphasize how fast it is


Quite dangerous. Will look almost like the Schutzstaffel runic insignia. I'd better avoid this resemblance.


thatsthejoke.jpg


SS .. as in nazi?


Reserved by GM for the Super Sport


So that's why GM has been asking itself "Are we the baddies?" lately.


S3++?


How do you store a stream? Don’t they just spray around the internet here and there, and if you don’t catch them in the moment, they’re just gone?


(Founder) I thought you were joking but coming back it could well be serious :)

When we say stream, we really mean The Log that Jay Kreps has a famous blog about https://engineering.linkedin.com/distributed-systems/log-wha...

We say stream because we would rather not be confused with "logs" as in application logs, but rather associate with the world of streaming data where this primitive is very relevant. We don't mean stream as in a TCP stream or live stream.

You can, however stream Star Wars on S2 ;-) https://s2.dev/docs/quickstart#get-started-with-the-cli


Surely S3++? /s


Disagree. You have a marketing opportunity for a hipster character named "Stu" to be the spokesman.


Disco Stu don't advertise


Props to you for having a sense of humor about it. :D

If I could put in one request...a video which describes what it is and how to use it would make it easier for me to understand.


(Founder) Yes we should create a video, thanks for the feedback.

In the meantime, checkout this quickstart which will have to streaming Star Wars with the S2 CLI and give you a pretty good sense of things https://s2.dev/docs/quickstart#get-started-with-the-cli

(You will have to apply to join the preview, but we are approving quickly)


You could say that. Or, in binary ASCII, you could say your name is 93.75% the same (it flips only the last bit of 16).


Your 66.66% (2/3) of the way there to the second character too. So I would say your only 16.66% different across the two characters.


I would look much more into Levenshtein Distance ;) if I would like to be smart ass funny.


You're 50% of the way closer to 1st!


How many of these letter-number storage services are there now? S3, B2, R2, S2...


S3 isn't the name of the service - that's "Amazon Simple Storage Service". S3 is a nickname, short for "Simple Storage Service".


Nickname implies it's unofficial, but S3 is very much the product name too:

https://aws.amazon.com/s3/faqs/

"Simple storage service" is used once. "S3" is used throughout.


While you’re technically correct, for all intents and purposes it is called S3 even by AWS themselves.


When I was a student we had a Facebook group to share information, and one angry guy ranted that the correct shortening of "Mathematical Analysis" is not, in fact, "anal", as we were used to say


and EC2 stands for "Elastic Compute Cloud". but no one remembers that.



Seems preferable to having to explain you're not a paramilitary organization responsible for unspeakable war crimes. Nothing funny about that.


Including potentially in court / to lawyers? IANAL, but isn't this just inviting Amazon to claim it's deliberately leveraging their 'S3' trademark and sowing confusion in order to lift their own brand? (Correctly, and even somewhat transparently in TFA, IMO.)


My issue is that 2<3 and for most people they will just assume its older/shittier S3 lol


It looks neat but, no Java SDK? Every company I've personally worked at is deeply reliant on Spring or the vanilla clients to produce/consume to Kafka 90% of the time. This kind of precludes even a casual PoC.


(S2 Team member) As we move forward, a Java/Kotlin and a Python SDK are on our list. There is a Rust sdk and a CLI available (https://s2.dev/docs/quickstart) . Rust felt as a good starting point for us as our core service is also written in it.


Merely as a "for your consideration," writing an SDK in a very, very different language can surface "rust-isms" in the way your API works that might not be obvious when using a homogeneous tech stack

I think of that as the "Chinese wall" of shipping SDKs: can someone not familiar with your product use it effectively from a language you don't know


I do like this. The next part I'd like someone to build on top of this is applying the stream 'events' into a point-in-time queryable representation. Basically the other part to make it a Datatomic. Probably better if it's a pattern or framework for making specific in-memory queryable data rather than a particular database. There's lots of ways this could work, like applying to a local Sqlite, or basing on a MySQL binlog that can be applied to a local query instance and rewindable to specific points, or more application-specific apply/undo events to a local state.


Roughly ten years ago, I started Gazette [0]. Gazette is in an architectural middle-ground between Kafka and WarpStream (and S2). It offers unbounded byte-oriented log streams which are backed by S3, but brokers use local scratch disks for initial replication / durability guarantees and to lower latency for appends and reads (p99 <5ms as opposed to >500ms), while guaranteeing all files make it to S3 with niceties like configurable target sizes / compression / latency bounds. Clients doing historical reads pull content directly from S3, and then switch to live tailing of very recent appends.

Gazette started as an internal tool in my previous startup (AdTech related). When forming our current business, we very briefly considered offering it as a raw service [1] before moving on to a holistic data movement platform that uses Gazette as an internal detail [2].

My feedback is: the market positioning for a service like this is extremely narrow. You basically have to make it API compatible with a thing that your target customer is already using so that trying it is zero friction (WarpStream nailed this), or you have to move further up to the application stack and more-directly address the problems your target customers are trying to solve (as we have). Good luck!

[0]: https://gazette.readthedocs.io/en/latest/ [1]: https://news.ycombinator.com/item?id=21464300 [2]: https://estuary.dev


(S2 Founder) Congrats on the success with Estuary! You are not the first person to tell me there is no/tiny market for this. Clearly _you_ thought there was something to it, when you looked to HN for validation. We may do a lot more on top of S2, like offering Kafka compatibility, but the core primitive matters. I have wanted it. It gets reinvented in all kinds of contexts and reused sub-optimally in the form of systems that have lost their soul, and that was enough for me to have this conviction and become a founder.

ED: I appreciate where you are coming from, and understand the challenges ahead. Thank you for the advice.


The market is gobsmackingly huge, it's just the go-to-market entry points which are narrow.

In my opinion, the key is to find a value prop and positioning which lets prospects try your service while spending a minimum of their own risk capital / reputation points within their own org.

That makes it hard to go after core storage, because it's such a widely used, fundamental, and reliable part of most every company's infrastructure. You and I may agree that conventions of incremental files in S3 are a less-than-ideal primitive for representing streams, but plenty of companies are doing it this way just fine and don't feel that it's broken.

WarpStream, on the other hand, leaned in to the perceived complexity of running Kafka and the share of users who wanted a Kafka solution with the operational profile of using S3. Internal champions can sell trying their service because the prospect's existing thing is already understood to be a pain in the butt.

For what it's worth, if I were entering the space anew today I'd be thinking carefully about the Iceberg standard and what I might be able to do with it.


Fair :) Yes, we are pretty hyped about the possibilities with Iceberg, especially now with S3 Table buckets.


This is a very useful service model, but I'm confused about the value proposition given how every write is persisted to S3 before being acknowledged.

I suppose the writers could batch a group of records before writing them out as a larger blob, with background processes performing compaction, but it's still an object-backed streaming service, right?

AWS has shown their willingness to implement mostly-protocol compatible services (RDS -> Aurora), and I could see them doing the same with a Kafka reimplementation.


(S2 team member here)

> I suppose the writers could batch a group of records before writing them out as a larger blob, with background processes performing compaction, but it's still an object-backed streaming service, right?

This is how it works essentially, yes. Architecting the system so that chunks that are written to object storage (before we acknowledge a write) are multi-tenant, and contain records from different streams, lets us write frequently while still targeting ideal (w/r/t price and performance) blob sizes for S3 standard and express puts respectively.


Wait, data from multiple tenants is stored in the same place. Do you have per-tenant encryption key, or how else are you ensuring no bugs allow tenants to read others data?


(Founder) We will be using authenticated encryption with per-basin (our term for bucket) or per-stream keys, but we don't have this yet. This is noted on https://s2.dev/docs/security#encryption


Seems like really cool tech. Such a bummer that the it is not source available. I might be a minority in this opinion, but I would absolutely consider commercial services where the core tech is all released under something like a FSL with fully supported self-hosting. Otherwise, the lock-in vs something like kafka is hard to justify.


(Founder) We are happy for S2 API to have alternate implementations, we are considering an in-memory emulator to open source ourselves. It is not a very complicated API. If you would prefer to stick with the Kafka API but benefit from features like S2's storage classes or having a very large number of topics/partitions or high throughput per partition, we are planning an open source Kafka compatibility layer that can be self-hosted, with features like client-side encryption so you can have even more peace of mind.


Having a kafka compatible API and S3 storage would be something I would jump to, the savings over MSK would be huge.

If you had a (paid for) API that sat on top of an S3 API for on-prem, that would be fantastic as well.

Kafka is great, but the whole Java ecosystem and the lack of control of what is in the topics and the stuff about co-ordinating the cluster in zookeeper is a management PITA.


Checkout warpstream (recently acquired by confluent)


> we are considering an in-memory emulator to open source ourselves

I'd suggest a persistent emulator, using something like SQLite (one row per record). Even for local development, many applications need persistence. And it'd be even enough to run a single node low throughput production server which doesn't need robust durability and availability. But it still has enough overhead and limitations not to compete with your cloud offering.

What's however important is being as close as possible to your production system, behavior wise. So I'd try so share as much of the frontend code (e.g. the GRPC and REST handlers) as possible between these.


First-class kafka compatibility could go a long way to making it a justifiable tech choice. When orgs go heavy on event streaming, that code gets _everywhere_, so a vendor off-ramp is needed.


(Founder) That makes sense. We would eventually host the Kafka layer too - and will be able to avoid a hop by inlining our edge service logic in there.


I look at the egress costs to internet and it doesn’t check out. It’s a premium product dependent on DX, marketed to funded startups.

But if I care about ingress and egress costs, which many stream heavy infrastructure providers do.. This doesn’t add up.

I wish them luck, but I feel they would have had a much better chance from the start by getting some funding and having a loss leader start, then organising and passing on wholesale rates from cloud providers once they’d reached critical mass.

Instead they’re going in at retail which is very spicy. I feel like someone will clone the tech and let you self host, before big players copy it natively.

It’s a commodity space and they’re starting with a moat of a very busy 2 weeks from some Staff engineers at AWS.


(Founder) Thanks for sharing your thoughts. We are early and figuring things out. I agree egress cost is going to be a big concern. We want to do the best we can for users as we unlock some scale. During preview, we are focused on getting feedback so the service is free (we will need to talk if the usage is significant though).


Just you wait, I am launching S1 next year!


Ok good, my startup S½ (also known as Ç) is still unique, phew


Dibs on S0


your incident lingo will be fun.


Wow, imagine Debezium offering native compatibility with this, capturing the changes from a Postgres database, saving them as delta or iceberg in a pure serverless way!


I wish more dev-tools startups would focus on clearly explaining the business use cases, targeting a slightly broader audience beyond highly technical users. I visited several pages on the site before eventually giving up.

I can sort of grasp what the S2 team is aiming to achieve, but it feels like I’m forced to perform unnecessary mental gymnastics to connect their platform with the specific problems it can solve for a business or product team.

I consider myself fairly technical and familiar with many of the underlying concepts, but I still couldn’t work out the practical utility without significant effort.

It’s worth noting that much of technology adoption is driven by technical product managers and similar stakeholders. However, I feel this critical audience is often overlooked in the messaging and positioning of developer tools like this.


(Founder) Appreciate the feedback. We will try to do a better job on the messaging. It is geared at being a building block for data systems. The landing page has a section talking about some of the patterns it enables (Decouple / Buffer / Journal) in a serverless manner, with example use cases. It just may not be something that resonates with you though! We are interested in adoption by developers for now.


I think they're saying that you should provide some example use-cases for how someone would use your service. High-level use-cases that involve solving problems for a business.

For what it's worth, I am already familiar with this design space well enough that I don't need this kind of example in order to understand it. I've worked with Kinesis and other streaming systems before. But for people who haven't, an example might help.

What kind of business problem would someone have that causes them to turn to your service? What are the alternative solutions they might consider and how do those compare to yours? That's the kind of info they're asking for. You might benefit from pitching this such that people will understand it who have never considered streaming solutions before and don't understand the benefits. Pitch it to people who don't even realize they need this.


(Founder) Yes I understand, and this could definitely do with work. I struggle with it personally because it is so obvious to me. I don't even know where to start? How do you pitch use cases for object storage? Stream storage feels just as universal to me.


If you ever figure it out, LMK. I don't think I've ever looked at logs more than about 24 hours old. Persistence and durability is not something I care about.

Errors, OTOH, I need a week or two of. But I consider these 2 different things. Logs are kind of a last resort when you really can't figure out what's going on in prod.


Here "log" means "append-only stream of small records". This isn't just about traditional logs (including http request logs and error logs). You could use it to store events for an event-sourced application, and even as the Write-Ahead-Log (WAL) for a database.

A distributed, but still consistent and durable log is a great building block for higher level abstractions.


That makes more sense. I suppose an audit log would also fit. I guess append-only backups wouldn't fit the "small" requirement though.


"Small" means 1MiB per record here. But a higher level abstraction could split one logical operation into multiple records. Just like FoundationDB has severe limits on its transaction size, while higher level databases built on top of it work around that limit.

The OP's blog post linked to this article, which explains some scenarios where this storage primitive would be helpful: https://engineering.linkedin.com/distributed-systems/log-wha...

This product offers two advantages over S3: 1) Appending a small amount of data is cheap 2) Writes are forced into a consistent order (so you don't need to implement Paxos or RAFT yourself). Neither of these are useful for backups. Raw S3 already works well for that usage-case, especially now that Amazon added support for pre-conditions.


"Replace our MSK clusters and EBS storage with S3 storage costs."


1. Do you support compression for data stored in segments?

2. Does the choice of storage class only affect chunks or also segments?

To me the best solution seem like combining storing writes on EBS (or even NVMe) initially to minimize the time until writes can be acknowledged, and creating a chunk on S3 standard every second or so. But I assume that would require significant engineering effort for applications that require data to be replicated to several AZs before acknowledging them. Though some applications might be willing to sacrifice 1s of writes on node failure, in exchange for cheap and fast writes.

3. You could be clearer about what "latency" means. I see at least three different latencies that could be important to different applications:

a) time until a write is durably stored and acknowledged

b) time until a tailing reader sees a write

c) time to first byte after a read request for old data

4. How do you handle streams which are rarely written to? Will newly appended records to those streams remain in chunks indefinitely? Or do you create tiny segments? Or replace and existing segment with the concatenated data?


(Founder) Thanks for the deep questions!

1) Storage is priced on uncompressed data. We don't currently compress segments.

2) It only affects chunk storage. We do have a 'Native' chunk store in mind, the sketch involves introducing NVMe disks (as a separate service the core depends on) - so we can offer under 5 millisecond end-to-end tail latencies.

3) The append ack latency and end-to-end latency with a tailing reader is largely equivalent for us since latest writes are in memory for a brief period after acknowledgment. If you try the CLI ping command (see GIF on landing page) from the same cloud region as us (AWS us-east-1 only currently), you'll see end-to-end and append ack latency as basically the same. TTFB for older data is ~ TTFB to get a segment data range from object storage, so it can be a few hundred milliseconds.

4) We have a deadline to free chunks, so we we PUT a tiny segment if we have to.


> To me the best solution seem like combining storing writes on EBS (or even NVMe) initially to minimize the time until writes can be acknowledged, and creating a chunk on S3 standard every second or so.

Yep, this is approximately Gazette's architecture (https://github.com/gazette/core). It buys the latency profile of flash storage, with the unbounded storage and durability of S3.

An addendum is there's no need to flush to S3 quite that frequently, if readers instead tail ACK'd content from local disk. Another neat thing you can do is hand bulk historical readers pre-signed URLs to files in cloud storage, so those bytes don't need to proxy through brokers.


This is a very interesting abstraction (and service). I can’t help but feature creep and ask for something like Athena, which runs PrestoDB (map reduce) over S3 files. It could be superior in theory because anyone using that pattern must shoehorn their data stream (almost everything is really a stream) into an S3 file system. Fragmentation and file packing become requirements that degrade transactional qualities.


(Founder) There are definitely some interesting possibilities. Pretty hyped about S3 Table (Iceberg) buckets. S2 stream to buffer small writes so you can flush decent size Parquet into the table, and avoid compaction costs.


My first thought: "introducing? The S2 has been out for a while!"

https://www.sunlu.com/products/new-version-sunlu-filadryer-s...



This is cool but I think it overlaps too much with something like Kinesis Data Streams from AWS which has been around for a long time. It’s good that AWS has some competition though


(Founder) We plan to be multi-cloud over time. Kinesis has pretty low ordered throughput limit (i.e. at the level of a stream shard) of 1 MBps, if you need higher. S2 will be cheaper and faster than Kinesis with the Express storage class. S2 is also a more serverless pricing model - closer to S3 - than paying for stream shard hours.


Thanks. You are right about those points. One thing to probably consider is whether serverless provides enough cost savings for most streaming ingest use cases which need static provisioning since ingest volumes are unpredictable. A better messaging would be that your serverless model can handle bursts well. (for context: used to sell KDA and KDS at AWS as part of AI solutions)


In the long-term, how different do you want to be from Apache Pulsar? At the moment, many differences are obvious, e.g., Pulsar offers transactions, queues and durable timers.


(Founder) We want S2 to be focussed on the stream primitive (log if you prefer). There is a lot that can be built on top, which we mostly want to do as open source layers. For example, Kafka compatibility, or queue semantics.


so the naming convention for 2024-25 products seems to be <letter><number>.

o1, o3, s2, M4, r2, ...


In terms of a pitch, i'm not sure i understand how this differs from existing solutions. Is the core value proposition a simpler api?


(Founder) Besides simple API,

- Unlimited streams. Current cloud systems limit to a few thousand. With dedicated clusters, few hundred K? If you want a stream per user, you are now dealing with multiple clusters.

- Elastic throughput per stream (i.e. a partition in Kafka) to 125 MiBps append / 500 MiBps realtime read / unlimited in aggregate for catching up. Current systems will have you at tens. And we may grow that limit yet. We are able to live migrate streams in milliseconds while keeping pipelined writes flowing, which gives us a lot of flexibility.

- Concurrency control mechanisms (https://s2.dev/docs/stream#concurrency-control)


Forgot to mention storage classes to tune your latency vs cost tradeoff. That you can even reconfigure - soon we will make that a live migration.


Seems really good for IoT no? Been a while since I worked in that space, but having something like this would have been nice at the time.


(Founder) so many possibilities! That's what I love about building building blocks. I think we will create an open source layer for an IoT protocol over time (unless community gets to it first), e.g. MQTT. I have to admit I don't know too much about the space.


I had an idea like this a few years ago. basicly emitting a stream interface to a cloud based fs to enable random access seeking on bystreams. I envisioned it to be useful for things like loading large files. would be amazing for enabling things like cloud gaming, images processing and CAD

kudos for sitting down and makin it happen!


Definitely a useful API but not super compelling until I could store the data in my own bucket


So is this a "serverless" named-pipe-as-a-service cloud offering? Or am I misreading?


Yep. Just tack "serverless" onto something that already exists and charge for it


(Founder) Named pipe that operates at the level of records, is durable regionally, you can read from any sequence, and lets you do concurrency control for writes if you need to.


How is this compare to https://github.com/deuxfleurs-org/garage ?

Seems like there are a lot of more lite weight self-hosted s3 around now days. Why even use S3?


I really liked the landing page and the service, but it took me a while to realize it wasn't a AWS service with a snazzy landing page.


Apparently this is “S2, a new S3 competitor” not “S2, the spatial index system based on heirarchical qaudrilaterals”.


How does this compare to Kafka? Is the primary difference that this is a hosted solution?


Is it possible to bring my own cloud account to provide the underlying S3 storage?


(Founder) Not currently! We want to explore this.


Really interesting service and bookmarked.

I'd really love this extending more into the event sourcing space not just the log/event streaming space.

Dealing with problems like replay and log compaction etc.

Plus things like dealing with old events. Under GDPR, removing personal information/isolating it from the data/events themselves in an event sourced system are a PITA.


(Founder) An S2 stream is a durable log and can be replayed! We do want to add compaction support. Event sourcing is a great use case for S2.


Would this be like an alternative to Delta? Am I thinking about that right?


Pretty bad branding! It should have at least been S4!


S2 is, in my opinion, the sweet spot of PRS's lineup.


Related to an old comment of yours:

> I also kind of strongly dislike HtDP.

I'm researching programming pedagogy and I'm curious about your thoughts on this.


This would sell much better is was S5 or S6 next level thing.

Wow man are you stil stuck on S3?


"Making the world a better place through streamable, appendable object streams"


Scribe aaS? ;)


Kafka as a service ?


(Founder) Nope! We have a FAQ for this ;)


Can someone tell me what does this do? And why its better.


(Founder) There is a table on the landing page https://s2.dev/ which hopefully gives a nice overview :) It's like S3, but for streams. Cheap appends, and instead of dealing with blocks of data and byte ranges, you work with records. S2 takes care of ordering records, and letting you read from anywhere in the stream.

This is an alternative to systems like Kafka which don't do great at giving a serverless experience.


Could you clarify the Kafka difference further?

Or more generally, when is it better to choose S2 vs services like SQS or Kinesis?

S2 sounds like an ordered queue to me, but those exist?


(Founder here) Managed cloud offerings for streaming limit ordered throughput pretty low, e.g. Kinesis at 1 MiBps, Redpanda serverless at 1 MiBps, Confluent's even higher-end clusters at 10-20 MiBps IIRC. If you really need ordering, this can indeed be a limit. S2 lets you push 125 MiBps currently, and we may grow that.

Another factor is how many ordered streams you can have. Typically a few thousand at most with those systems. We take the serverless spirit of S3 here, when did you have to worry about the number of objects in a bucket?

We are also able to offer latency comparable to disk-based streaming like Confluent's Kora and Kinesis, with our Express storage class (under 50 milliseconds end-to-end latency for client in the same cloud region) - while being backed by S3 with regional durability! Not a disk in the system.

We want people to be able to build safe distributed data systems on top of S2, so we also allow concurrency control mechanisms on the stream like fencing. Kafka or Kinesis won't let you do that. This is the approach AWS takes internally (https://brooker.co.za/blog/2024/04/25/memorydb.html), but they don't have that as a service. We want to democratize the pattern.

ED: on throughtputs, to clarify, I am talking about ordered throughput, i.e. per Kafka partition or Kinesis shard. WarpStream also does well here because of their architectural approach to separate ordering, but at a latency cost.


Between your site copy and your early comments on this thread, it was this rundown that made the product click in my mind.

I’m sure that in this early preview you’re trying to reach mainly devs with existing domain expertise, but the way that, in this comment, you laid out existing constraints and what possibilities might lie beyond them—it really helped me situate your S2 product in the constellation of cloud primitives.

Just wanted to offer that feedback in the hope that the spirit of your comment here doesn’t get buried down-thread!


thank you for the feedback!


Hey congrats! Looks like a really cool idea.

Looks like you're pushing for the throughput angle - that could be important but IMO it's not often you come across devs who need this level of throughput without dealing with large scale problem. My feedback is the lack of per-tenant encryption is a big deal breaker here since you're mixing up data of tenants within one objects.

Plus your security section talks very little how you prevent cross data contamination - that's probably first thing that popped up in my mind when I read about your data model. It makes me extremely uneasy - and can't imagine that I can adopt this for anything serious. I would encourage you to think about how you can communicate that angle to the customer as well, besides supporting per tenant encryption key.


(Founder) It's a number of dimensions. I get excited about the ordered throughput angle because I have personally cared about this in the past, and yeah a lot of folks may not need that :)

Simple API, reasonable pricing, latency flexibility, unlimited streams, _and_ elastic to high throughputs. All adding up to a great serverless experience.

Re: the data colocation. This is how most multi-tenant systems - including S3 itself AFAIU - operate. I understand there is a difference in level of trust vs a cloud provider, and the best we can do here while delivering a serverless experience is encrypting every single record at the edge of S2 where they transit in or out, with a tenant-specific key. We may even allow specifying it as part of the request, if clients want to manage the key for themself.

The best data security when leveraging any multi-tenant service is going to be client-side encryption, and we also want to make this super easy. With our planned Kafka layer, we plan on client-side encryption as a value add.


I failed to mention that we do want to support single-tenant cells for customers that need isolation.


@agallego Yes in aggregate both Confluent and Redpanda can push GiBps throughputs, and I know Redpanda has amazing perf. I was referring to Redpanda Serverless :) And per-partition i.e. ordered throughput.

ED: for some reason I wasn't seeing the reply link before on your comment, do see it now.


coo cool right on.


Redpanda cloud doesn’t limit tput. Most ppl get a bigger discount at high volumes. We have customers in 10s of GB/s. Confluent has those volumes too.


Sort of serverless Kafka, which natively uses object storage and promises better latencies than things like warpstream.


A interesting difference is the ability to have exclusive access to writes on the log (the fencing token). This allows you to use the logs as write ahead logs.


It's a message queue on the cloud.

https://chatgpt.com/c/676703d4-7bc8-8003-9e5d-d6a402050439

Edit: Keep downvoting, only 5.6k to go!


Thank you


[flagged]


Indeed... we sure wish we could have nabbed that crate name, but it was not to be. Our Rust SDK is here https://lib.rs/crates/streamstore


Replying to this one since you apparently can't reply to a comment that has been flagged. Why was the grandparent flagged? Google's S2 library has been around for more than a decade and is the first thing I think of when I see "S2" in a tech stack.

And the flippant response from the parent here that they don't really care that they're muddying the waters and just want the crate name is irksome.


Serverless pricing to me is exactly like the ETH gas pricing !




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: