Correct, a common mistake people make is conflating these things. I wrote this several years ago about MongoDB:
One thing that helps is if people stop referring to things as SQL / NoSQL as what ends up happening is various things get conflated.
When talking about stores, it's important to be explicit about a few things:
1. Storage model
2. Distribution model
3. Access model
4. Transaction model
5. Maturity and competence of implementation
What happens is people talk about "SQL" as either an NSM or DSM storage model, over either a single node, or possibly more than that in some of the MPP systems, using SQL as an access model, with linearizable transactions, and a mature competent implementation.
NoSQL when most people refer to it can be any combination of those things, as long as the access model isn't SQL.
I work on database engines, and it's important to decouple these things and be explicit about them when discussing various tradeoffs.
You can do SQL the language over a distributed k/v store (not always a great idea) and other non-tabular / relational models and you can distribute relational engines (though scaling linearizable transactions is difficult and doesn't scale for certain use cases due to physics, but that's unrelated to the relational part of it).
Generally people talk about joins not scaling in some normalized form, but then what they do is just materialize the join into whatever they are using to store things in a denormalized model, which has its own drawbacks.
As to the comment above you, SQL vs NoSQL also doesn't have anything to do with the relative maturity of anything. Some of the newer non-relational engines have some operational issues, but that doesn't really have anything to do with their storage model or access method, it just has to due with the competence of the implementation. MongoDB is difficult operationally not because it's not a relational engine, but because it wasn't well designed.
Just like people put SQL over non-tabular stores, you can build non-tabular / relational engines over relational engines (sharding PostgreSQL etc.). In fact major cloud vendors do just that.
Cassandra is not a columnar database, columnar in this sense is about the storage layout. Values for a column are laid next to each other in storage, thus allowing for optimizations in compression and computation, at the expense of reconstructing entire rows. Postgres is a row store, meaning all the columns for a row are stored next to each other in storage, which makes sense if you need all of the values for a row (or the vast majority).
Having used both TSDB and ClickHouse in anger I have some thoughts on this:
They are both fantastic engines, I really like that both have made very specific tradeoffs and can be very clear in what they are good and bad at. Having worked on database engines, I can appreciate the complexity that they are solving.
My most recent use is with ClickHouse, which is great and I think a complete game-changer for the company. However there's a lot of issues (that are being worked on, the core team is great, though there are a few personalities that are a bit frosty to deal with). All of these comments come with love for the system.
1. Joins really need some work, both in the kinds of algorithms (pk aware, merge joins that don't do a full sort etc.), and in query optimizer work to make them better. We have analysts that use our system, and telling them to constantly write subqueries for simple joins is a total PITA. Not having PK aware joins is a massive blocker for higher utilization at our company, which really loves CH otherwise.
2. Some personalities will tell you that not having a query optimizer is a feature, and from an operational standpoint, it is nice to know that a query plan won't change, or try and force the optimizer to do the right thing. However, given #1, making joins performant (we have one huge table with trillions of rows, and a few smaller ones with billions) is really rough.
3. The operations story really needs some work, especially the distribution model. The model of local tables with a distributed table over it is difficult to work with personally. It would be nice to just be able to plug servers in without alot of work, like Scylla, and not have two tables that you have to keep schemas consistent with. THere's also just some odd behavior, like if you insert async into a distributed table, and only have a few shards, it'll only use a thread per shard to move that data over. It would be nice if there wasn't as much to think about.
4. Following #3, there's just too many knobs, maybe if they had a tuning tool or something that would help, but configuring thread pools is difficult to get right. I suspect CH could use a dedicated scheduler like Scylla's, that could dispatch the work, instead of relying on the OS.
5. The storage system relies a lot on the underlying FS and settings on when to fsync etc. I suspect if they had a more dedicated storage engine (controlled by the scheduler above), things could be more reliable. I still don't fully trust data being safe with CH.
6. Deduplication - This is a hard problem, but one that is really difficult to solve in CH. We solve it by having our inserters coordinate so that they always produce identical blocks, using replacing merge trees to catch stragglers (maybe), but it isn't perfect. A suggestion if possible is to try and put the same keys into the same parts, so they'll always get merged out by the replacing merge tree (I understand this is difficult).
The CH team is great, and these will be fixed in time, but these were the problems we ran into with CH.
TSDB was really solid, but we never used it at a scale where it would tip over. Our use case is really aligned with Yandex's so a lot of the functionality they have built is useful to us in a way that TSDB's isn't. (Also, being able to page data to S3 is amazing).
Thanks for your excellent contribution to this discussion. As the post author I wholly agree with your approach: if a solution hits the sweet spot for you in the context of your requirements that's the one you choose. Thank you for considering TimescaleDB alongside ClickHouse in what was obviously a well thought through assessment of these two excellent technologies.
Thanks for the great, thoughtful feedback. We (Timescale) couldn't agree more that there is a lot to love about ClickHouse, especially where it truly excels.
Information like this is helpful for others currently in the "choose the right tool" part of the job and to the developers of the product. I can't imagine how different all of our offerings will look in a few more years! :-)
Generally these tend to be systems that work with machine generated data, my experience is with sensor data generated by automobiles (automated car efforts).
Naive solutions tend to either summarize the data, store as logs and then run batch processes to index in some form (or leave unindexed and just brute force the computation), or limit the incoming data rate to whatever could be indexed.
These can work for some use cases, but make it very difficult to operationalize these data sources (i.e use them to make real-timeish decisions).
Even human generated data sources (fb / twitter etc.) can generate something close to that data rate.
I've considered something like that, but instead of trying to figure out crimes, it would produce a score for bills.
A corruption score for bills, almost like a facebook for bills "This bill is friends with Exxon". It would figure out who spent the most getting the bill passed, and who they bought off to get it.
Just a simple thing for people to point to when they say things are corrupt. Granted in today's environment, that score would be 100% most of the time, but it would be interesting to have some idea just who bought the bill.
Some of the coding-friendly news orgs like ProPublica have done one-off versions of this. It'd be great to have an ongoing tool to check the scores at any time.
If you are bootstrapping, and you are starting a company in an area that does not need the top tier of engineers (which is most of them, regardless of how they talk about hiring the best), I'd consider starting someplace low cost.
I think college towns in the midwest / rust belt are untapped resources, or areas that used to have strong technically focused companies that moved away. I've personally seen founders bootstrap and launch successful companies (B2B with real revenue, not Uber for cats) in areas with little tech presence.
The ability to not have to shell out huge salaries and equity was a real winner, there's also less distractions. You can get strong engineers (no, not SV / Seattle top end engineers, but people that can throw together a reasonable website) for < 100k in these areas, and they won't bounce around as much. You aren't competing with AmaFaceGoogSoft here, you're competing with HR companies and random consulting houses. There's disadvantages to being outside of the tech bubble, but advantages too. That being said, for an employee, you should at least try and do some time in SV / Seattle etc.
I share you bullishness about "untapped" cities, but:
> ...starting a company in an area that does not need the top tier of engineers...
As far as quality of talent, the "top tier of engineers" aren't all in SF, Seattle, or NYC. There are plenty of "top tier" engineers in all kinds of places.
If you mean the "top tier" in pay rates are in those cities, that probably true. But those cities also have a ridiculous cost of living adjustment baked in.
I might be misreading what you're writing here, but it sounds like you're saying, "Try founding a place in Champaign, Ill. The engineers aren't top notch, but they're cheap enough to be worth it." I just wanted to push back against this meme.
Also: Even though there are many top-tier engineers in those hotspots, just how gettable are they if you're hiring in SV/Seattle/NYC?
Based on my purely anecdotal experience, engineers that would be easy no-hires in Chicago get scooped up into senior roles in the valley because everyone there is so desperate to hire anyone who can fizzbuzz. It's the best area to be in if you're one of the best... and also if you're one of the worst. Top-flight talent exists in these "hotspots" but they're all "Unobtanium" for anything that isn't a unicorn, IPO'd, or founded by one of their friends.
> Top-flight talent exists but they're all "Unobtanium" for anything that isn't a unicorn, IPO'd, or founded by one of their friends
I disagree; top talent is very obtainable, provided that you have:
* decent compensation
* great culture
* interesting problems to solve
* a great hiring process
We've grown our engineering team from 12 to 29 since opening up shop in Kansas City this February, and
I could not be happier with the quality of people that we've found.
> Based on my purely anecdotal experience, engineers that would be easy no-hires in Chicago get scooped up into senior roles
I left a Chicago Loop job 3 years ago for a fully remote startup, moved to Tampa, moved from that fully remote startup to a Tampa-based org, and Chicago recruiters get in touch 2-3 times a week offering full relo expenses. There literally is an insane amount of demand and people with very little experience being pulled into Chicago roles.
Hello! :wave: If you're going to be at the St Pete Night Market tonight, happy to buy you a beverage (my email is in my HN profile)!
EDIT: I know I'm not allowed to complain about downvotes, but I'm doing it; why would someone downvote this comment? The poster I'm replying to is near me IRL, and I am just trying to be friendly and connect in person (and they have no contact info on their profile). No need to downvote or upvote.
It's easier to take a risk on a startup in a tech hub because you can likely find a job elsewhere if things don't work out. In a non-hub area, it's riskier to take the leap because fallback options are much more limited.
In my limited experience outside of one of those hubs I can say that's definitely a factor. I've spoken to a few good candidates that when decision time came got cold feet because of fear that the local ecosystem not being strong enough to provide a security cushion in case something went poorly.
This is what ultimately led to my moving to the Bay Area despite the pay increase not being even close to the cost of living increase. In Nowheresville, USA, a job change usually means moving to a different city. In the Bay Area and other tech hubs, you have a menu of options to choose from that don't involve picking up and moving your family.
It depends on what you are doing. If you are trying to build something like a database kernel, you want people who have done it before, very specialized. There aren't lots of them outside of the tech hotspots.
I don't necessarily mean there aren't smart people outside of the hotspots, but there isn't the specialized skillsets that get built from working at the Microsofts / Googles etc. edit: In the number that you need.
Champaign obviously has lots of smart people (my parents went there), but you may have issues finding specialized senior people there.
Most of them I'd wager. Issue is, and it was for me, is that moving out of an area where I can walk down the street and get another great job to an area where the only great job is the one that this hypothetical company is offering just wasn't worth the risk. I have a wife and kids, and so do most specialized senior engineers.
Sure, so minimal commute times and affordability of family homes in areas with good schools should be very appealing to many of the specialized senior engineers.
Moving the family is a pain, sure, but that just means the pitch needs to include why it's worth it for the family. And it means that senior engineers move cities less often. It doesn't mean they move less altogether.
Idle thought: this kind of risk (moving around too much) could be mitigated a bit by some sort of contractually guaranteed employment. A move to Princeton from SF would sound better if a four-year employment guarantee or a comparable cash buy-out was part of the employment agreement.
Honestly, if someone could mitigate the risk for this sort of thing, it's a huge deal.
Personally, I hate living in the tech hubs, I'd love to be able to move back to my hometown, it was a great place to raise kids, didn't have the same crushing traffic / monoculture and it was close to my family. I'm where I'm at due to the jobs, nothing else.
I wouldn’t value a guarantee of employment by a startup, especially not one measured in years. The nature of a startup is that it’s risky, and risky things fail. Unless they put up some type of bond or buy an annuity with the employee as a beneficiary, I don’t see how that works. And they wouldn’t be able to afford it anyway.
What if the family hates SF or Seattle? That doesn't seem to enter the equation when recruiting people in the other direction.
The job-is-terrible concerns can be addressed other ways. Like putting the candidate up in temporary digs for a month or three while they decide on a longer-term deal. A reverse contract-to-hire, if you will. That sounds complicated, but it's not more complicated than a normal contract-to-hire.
Probably a good amount of them. But if you want to hire them, and you don't want to be there, you have to think pretty hard about how you're going to lure them away. Just saying, "The cost of living here is lower!" is a start, but it takes a lot more than that.
A good measure of where the best engineers are is where the best engineers migrate to. In the smaller communities I’ve been in over time I’ve seen the top engineers in the community migrate away. Engineers that come back from effectively unsuccessful 1-yearish stints at Big X companies often become top engineers in the communities to which they return. I think I’d want a lot of people evidence to agree with the claim that the top engineers that stay behind in communities that have substantial migration to the Valley are as good as the top engineers in the valley (Which hires engineering talent from all over the world). There may be a handful of people who are great and stay for various reasons. But there are very few teams in Champaign, Ill. Where every team member is in the top 5% of engineers globally. There are quite a few teams like this in the valley.
Unless you think money is being spent in a way that is anti-correlated with engineering quality the places providing the most savings are going to attract more of the best engineers. I think that is still the Bay Area as most analyses of savings I’ve seen exclude RSU’s which is a big factor here. The huge influx of talent seems to agree. To me it seems rather extraordinary to claim that an area with limited amounts of inward talent migration and fairly substantial outward talent migration is going to beat an area that consistently hires from all over the world. This is a statement about distributions and pools of talent: there are small numbers of extremely talented engineers everywhere. It’s a lot harder to find a critical mass of world class people.
> As far as quality of talent, the "top tier of engineers" aren't all in SF, Seattle, or NYC. There are plenty of "top tier" engineers in all kinds of places.
Surely depends on the metric used: the same people who, if brought up in one of the progressive teams at an SV giant or a well-funded startup might become celebrated code-fashionistas touring the conference speaker circuit, would likely have turned out very different had they joined the insurance company that was the only game in town. But the things they would have seen there! Not C-beams glittering in the darkness near the Tannhauser gate, bad a lot of bad code plastered over and over with half-hearted attempts to clean up during decades of maintenance and extension. If there are wisdoms to be learned from working with that one of our two hypotheticals will know them.
Just an irrelevant side note - what do envision when you think of 'Uber for cats'?
- A cat sharing service?
- A 'cat on demand' service that allows you to reap the
benefits of cat ownership without having to worry about acquisition and maintenance costs?
- A taxi service for cats?
I could envision paying for all of those, under the right circumstances. So such a company would likely have at least a few dollars of real revenue. :)
Obviously cat sharing, but you'll have to find a way to take a margin from what was nominally a goodwill activity.
Maybe you pay cat owners in cathours (pronounced cath-ours), and have a marketplace where people can buy and sell cathours. Even better, make Catcoin (CC), set up an ICO, and get to work.
It was such a ridiculously awesome service I wish they had it year-round. Our office ordered bunches of the service and there was a lot more demand than supply (obviously for that one day they offered it.)
But I thought the internet removed all boundaries? Why do we need to be in the Bay area to have access to top tier of engineers? They are also everywhere else and let's say they are in the Bay area, that's what remote is for right?
We aren't going to agree on this, around remote employees. I believe there is value to co-located teams, not everyone agrees and that's fine.
The other point is, most startups do not need that talented of engineers, they think they do, but they don't. You don't want to pay bay area salaries if you don't have to, and in most of these areas, you don't have to.
If you are somewhere else but the bay, and you are paying people bay area salaries, but you aren't in SV, so you aren't raising VC like SV companies, I don't think that's a winning move.
> you should at least try and do some time in SV / Seattle etc.
I'm curious to know what your current thoughts are behind this. As someone who intentionally has steered clear of both those areas in order to try to optimize financially I sometimes wonder whether I'm missing out on something. Obviously one can learn more from better engineers, but don't the brightest ideas from the brightest engineers wind up being written about online and/or presented at conferences at user groups and broadcast across the Internet? Or does having the opportunity to put time into a name-brand tech company for a while really increase lifelong salary or career prospects sufficiently to recover the money thrown away on rent there? Or is there really sufficient value in serendipitous collaboration/socialization to justify moving to one of these places? Is there some other question I've overlooked?
On your first question, there are many areas of deep, hardcore technical domain expertise in things like databases, high-scale systems, parallel computing, AI, etc where much of the existing knowledge and recent advancements are poorly represented in public literature, conferences, etc due to layers of operational secrecy. What you would learn from the public literature or is shown in conferences is often quite misrepresentative of the state-of-the-art.
This becomes tribal knowledge. The way most people become experts in these domains is by working with or around people that are already experts, which requires being in an engineering environment where you are likely to come into contact with some. This type of expertise is far from evenly distributed.
Having done a lot of hiring in SF, NYC, and Baltimore, it's not clear to me that optimizing financially means avoiding SF and NYC.
Market salaries for software engineers in Baltimore are quite a bit lower (like, 40-50% lower) than SF or NYC. Whereas for companies located in NYC/SF, they understand cost of living is high and are thus willing to pay a premium.
I'm not making a counterargument here, just saying that salaries in different areas aren't usually apples to apples.
Also, cost of living is relative: For 10 of my 17 years in NYC, I lived in a studio apartment with a 1k/month mortgage, and no car, no kids, and a lower cost of living than most of my friends in other cities. So it really depends on your family situation and what kind of environment you want to live in.
Caveat, this comment isn't directed at you (I agree with your comment), but rather the points around what you are saying.
One thing that helps is if people stop referring to things as SQL / NoSQL as what ends up happening is various things get conflated.
When talking about stores, it's important to be explicit about a few things:
1. Storage model
2. Distribution model
3. Access model
4. Transaction model
5. Maturity and competence of implementation
What happens is people talk about "SQL" as either an NSM or DSM storage model, over either a single node, or possibly more than that in some of the MPP systems, using SQL as an access model, with linearizable transactions, and a mature competent implementation.
NoSQL when most people refer to it can be any combination of those things, as long as the access model isn't SQL.
I work on database engines, and it's important to decouple these things and be explicit about them when discussing various tradeoffs.
You can do SQL the language over a distributed k/v store (not always a great idea) and other non-tabular / relational models and you can distribute relational engines (though scaling linearizable transactions is difficult and doesn't scale for certain use cases due to physics, but that's unrelated to the relational part of it).
Generally people talk about joins not scaling in some normalized form, but then what they do is just materialize the join into whatever they are using to store things in a denormalized model, which has its own drawbacks.
As to the comment above you, SQL vs NoSQL also doesn't have anything to do with the relative maturity of anything. Some of the newer non-relational engines have some operational issues, but that doesn't really have anything to do with their storage model or access method, it just has to due with the competence of the implementation. MongoDB is difficult operationally not because it's not a relational engine, but because it wasn't well designed.
Just like people put SQL over non-tabular stores, you can build non-tabular / relational engines over relational engines (sharding PostgreSQL etc.). In fact major cloud vendors do just that.
Correct, there's relatively easy ways to de-risk things. My wife and I dated for 3.5 years and lived together for 2 before we got married. We are in a high income bracket, and both of us thought quite a bit about the person we wanted to married, and made sure we had good conflict resolution skills, respect for each other. We agreed on kids, religion , money and sex (make sure you agree on these or you are going to have a bad time).
Chances of our marriage ending are very low according to the statistics. People always view marriage as this binary thing where you are either being completely and utterly scientific about things, or you are just being naive and jumping in. I love/d my wife deeply when we got married, but I was also aware of her flaws and my own. If you don't have any empathy or like living your life for yourself, don't get married (and probably don't date), it won't end well for you.
At the higher end of the market (i.e you can get hired at the majors in a senior role or equivalent at a smaller company), I generally see three types of people:
I see people that truly understand their market value, have data backing it up, they are professional (I always tell people how I appreciate the time / offer, and I'm polite to recruiters), demanding about information (I have little tolerance for ambiguity in compensation, define everything, this is how I pay the mortgage and save for retirement / my kids college).
Some managers think that is spoiled (not that you are necessarily saying that), but this is a business transaction, I ask for market and turn it down if it's not that, or if the compensation is ambiguous.
Then I see those that don't know their value and just sort of take whatever because they don't like interviewing / negotiating. I don't see many of those at the high end.
The third group are who you are talking about, total babies. They complain about recruiters, or how much everyone wants them, place crazy demands on companies (I worked with a guy who literally had a rider like Van Halen in his LinkedIn profile, he ended up being a total prima donna who got fired because everyone hated him). Most people in this third group think they are in the first, and they aren't worth recruiting, even if they are strong technically.
Sometimes people believe this means startups / small companies can't compete, but the issue is, startups / small companies really aren't offering any sort of market comp, they don't give enough equity to employees to make the risk worth your while. You don't necessarily need to pay me what AmaGoogFaceSoft do, but you need to give me a significant chunk of equity and give me enough visibility into how that is cut up (preferences etc.) in the company for me to value it, otherwise its worth 0, even if I believe that the company will be successful. If you can't trust me enough with that sort of transparency, then it's not gonna work out. With a huge public company, I know what my stock is worth roughly.
There is probably a 2bis category of people who know their value but do not like to interview and negotiate.
They will take or not what is given to them but it is a one shot proposition.
I usually tell companies that I really do not like to negotiate compensation because this is logistics to me and is not worth the effort. Either a company is good and gives the right pay or I pass. I tell this early enough so that there is no expectation of haggling.
So far it worked.
One thing that helps is if people stop referring to things as SQL / NoSQL as what ends up happening is various things get conflated.
When talking about stores, it's important to be explicit about a few things:
1. Storage model
2. Distribution model
3. Access model
4. Transaction model
5. Maturity and competence of implementation
What happens is people talk about "SQL" as either an NSM or DSM storage model, over either a single node, or possibly more than that in some of the MPP systems, using SQL as an access model, with linearizable transactions, and a mature competent implementation.
NoSQL when most people refer to it can be any combination of those things, as long as the access model isn't SQL.
I work on database engines, and it's important to decouple these things and be explicit about them when discussing various tradeoffs.
You can do SQL the language over a distributed k/v store (not always a great idea) and other non-tabular / relational models and you can distribute relational engines (though scaling linearizable transactions is difficult and doesn't scale for certain use cases due to physics, but that's unrelated to the relational part of it).
Generally people talk about joins not scaling in some normalized form, but then what they do is just materialize the join into whatever they are using to store things in a denormalized model, which has its own drawbacks.
As to the comment above you, SQL vs NoSQL also doesn't have anything to do with the relative maturity of anything. Some of the newer non-relational engines have some operational issues, but that doesn't really have anything to do with their storage model or access method, it just has to due with the competence of the implementation. MongoDB is difficult operationally not because it's not a relational engine, but because it wasn't well designed.
Just like people put SQL over non-tabular stores, you can build non-tabular / relational engines over relational engines (sharding PostgreSQL etc.). In fact major cloud vendors do just that.