"Thankfully, the fact that these problems were solved decades ago has not stopped people from coming up with their own solutions, and we now all get to witness the resulting disasters."
I read the article: if they are, that is quite unclear to me.
Possible answers that are discussed are only 'some sort of IP encapsulation', which is vague and GRE, which is just a single solution. He doesn't seem to disapprove of VXlan, so probably something was missing in 'IP encapsulation and GRE'. Was 'all problems solved decades ago' merely hyperbole or is there actually something to it?
Is GRE inadequate? A single solution that solves most cases, is codified in an RFC, and has mature, reliable, performant implementations sounds like a winner to me.
I don't know if GRE is inadequate: if I knew that, I wouldn't need to ask these questions. The author doesn't disapprove of VXlan, so there must be something insufficient in GRE.
Look, I'm just trying to understand the playing field here, for when the moment comes that I need that knowledge. I don't currently have a need for funky networking between Docker containers, but I do have Docker containers and can imagine a future need for funky networking. The article slams new technologies, but doesn't clearly explain the alternatives, which is what I'm interested in, so I'm asking follow-up questions. There is nothing rhetorical here.
L3 routing protocols like BGP are one solution to the network connectivity problem which have been around for decades. BGP powers the internet, so we know it can scale to millions of endpoints.
I read the article, and I don't understand why we're trying to solve the problem of "telling each ContainerOS to listen/respond to an additional IP address, and route it to the correct Container" by "creating another network layer and accompanying additional complexity".
Weave has other issues... like they homebrewed their own ECDHE-PSK-based transport encryption protocol on top of NaCl. Homebrewing your own crypto, especially transport encryption which has to solve problems like key exchange, replay attacks, etc is generally the wrong answer.
Also, even if they were using a standard transport encryption like SSL/TLS or IPSEC, PSKs are generally frowned upon for anything other than point-to-point connections.
They describe the PSK as a "password", so what they really want is a PAKE algorithm, however they do not use a password hashing function, so weak "passwords" are susceptible to brute force attacks.
Anyway, all these things are why you should just stick to standard protocols like SSL/TLS or IPSEC.
Yeah, I feel a little guilty after writing this article, as the speed of the implementation is simply a detail. However, I feel no such guilt in condemning Weave's security. This is a conversation I had with @weave a while ago about their encryption
This "our project is open source, feel free to submit a patch" dismissal is so passive aggressive. If you mean "fuck you," then just say "fuck you."
That said, you shouldn't be saying "fuck you" in the first place: it's rude, it contributes to bad vibes in the OSS community, and it hurts you more than anybody. Try instead something like: "I'm having trouble understanding your argument, do you mind explaining in more depth in an email?" Even if you're dealing with a troll, this is still the best strategy.
I don't think anyone meant, or said, "fuck you". Why are you even implying such a thing?
Ultimately we can't work on even a fraction of the features that every person wants, and Laurie said his idea was simple to implement... so why not show how it's done? Honestly, it's not that sinister and it is certainly not rude.
To put it simply, if @monadic were receptive to @lclarkmichalek's ideas, why did he end the conversation?
But let's look at the Tweet in question:
"@lclarkmichalek @weavenetwork please, if it is so simple and robust you are very welcome to contribute a patch."
1. He says "please", which in this case is sarcastic.
2. Then he says "if it is so simple," which is a dismissive way of saying "you think that it's simple, but you're wrong – it's actually very complicated."
3. And also "if it so so ... robust," which is a dismissive way of saying "you think that it's (more) robust, but you're wrong – it isn't."
4. And finally "you are very welcome to contribute a patch," which, first, does not need to be said as presumably anybody knows that they are welcome to submit a patch to an open source project, and second, basically amounts to "so, I'm going to make my problem your problem."
This case is different from a feature request from a user, where "feel free to do it yourself" is slightly less inappropriate. There, what is meant is "this feature is not important enough to warrant our endorsement or any allocation of resources, but if you were to take on the burden entirely yourself, we would consider it."
In this case, what is meant is something more like "we believe that your charge that there is a fundamental issue with our software is false and we are not interested in discussing until you have actually done the work for us," or in other words "fuck you."
I admit, "fuck you" is quite strong for the general case, but the tone of the Tweet warrants that translation.
What you meant is clear to you. It's not clear to other people. You can chose to ignore them - maybe the people who misread the tweet are a small minority. Or you can chose to think about communication style and whether more people had the same "uncharitable" interpretation.
For what it's worth I lean more to the uncharitable reading of the tweet, although I'm not as firm about that reading as others appear to be.
Tweets for communication are hard so it's not particularly suprising when what you say and what you think you say doesn't match what other people think you said.
I don't; if anything, I personally find them too charitable and reasonable.
Just admit that you (if you're the one who wrote that response; it's not exactly clear who's who in this discussion...) were being a bit of a twit and move on. We all do it; I do it, the girl next door does it, my grandma even does it and she's the nicest person I know. There's no shame in being honest about it.
I think we'll have to agree to disagree on that one, then, though thanks for the clarification nonetheless.
Typically (at least in American English vernacular), "please" is very frequently used in a sarcastic manner (e.g. "You think you can jump from the top of that building and not get hurt? Bitch, please." or "Oh please, like you know the difference between a grape and a grapefruit."). While this sarcastic usage is usually accompanied by some other word prefixing it ("Oh please" or "Bitch please" or "Nigga please" or somesuch), it's not uncommon to see a lone "please" used in this sense as well, and the tweet in question very closely resembled that usage.
"Fuck you" would have indeed been a more appropriate response to "how dare you try to implement crypto" FUD.
Crypto is hard, but it's no harder than a lot of other hard things. If you think someone's crypto is broken, you could point out why you think it's broken. I see no evidence that consigning crypto to a forbidden zone is going to improve real world security, and the old "mature" cruftpiles seem to manifest problems at least as often as newer systems designed with the benefit of hindsight. For example, any competent designer of a newer system would always authenticate before decrypting-- something that was not clearly understood when SSL was developed.
> Crypto is hard, but it's no harder than a lot of other hard things.
Crypto is a LOT harder than some other things.
3d graphics programmers don't have to worry about side channel attacks through timing disclosures through random numbers returned over an HTTP GET.
Physics simulations don't have to worry about tens to hundreds of millions of dollars of losses because Intel changed the L2 cache slightly in some revision of a processor and now it is possible to glean a couple bits of information about the entropy one uses.
Of all the projects I have worked on, maybe the C/C++ compiler had a set of worries close to what an encryption suite has.
You could toss me into almost any field of software engineering and after a few months I'd feel good. Some of them would have a longer ramp up time (order of months). Some of them might require me to go take a few online courses to learn the field (3D, physics sims, etc).
Encryption requires an entire life of devotion. New attacks are coming out all the time. There is so much financial incentive in the field that the competition is insanely fierce.
3D graphic techniques get pushed forward by publishers wanting the latest AAA game title.
Encryption gets pushed forward because they are trying to outrun either large international criminal organizations or entire governments.
You're right, though there are some things that are about as hard: compilers, language design, machine learning, databases with strong ACID guarantees, etc.
The problem I have with the "crypto should be a forbidden zone" line of reasoning is that the real world evidence shows that the old battle tested systems manifest flaws at least as often as competently designed newer systems do. Crypto, it turns out, is so hard that the probability of lurking issues with mature systems approaches or exceeds the probability of mistakes in new ones.
When I say competently designed, I mean a newer system that passes the sniff tests of experienced crypto engineers. An incomplete list: they're using a cipher that's been peer reviewed and is considered strong by modern standards, they're using that cipher correctly, they're authenticating before doing anything, they are using an IV (if needed), they are not sending anything secret in the clear, they're not branching on secret data, etc.
It's also important to refrain from criticizing people for claims they are not making. As far as I know, Weave is not claiming to implement the entire feature set of IPSec. They're just claiming to offer basic but strong crypto and authentication. If you want more, you are likely using other algorithms like SSL and SSH over the overlay network.
Yes, that comes with a performance penalty, but it's also defense in depth. It's better to trust multiple layers of crypto with independent implementations at each layer so that a compromise of one does not destroy your entire security posture.
It all comes down to the question of how paranoid you are. No encryption will give you the best performance, but no security. If you want maximums security you can run SSL over IPSec over Weave with different sets of keys and different ciphers at each level. Bonus points for generating those keys on different air-gapped hardware, etc.
> compilers, language design, machine learning, databases with strong ACID guarantees, etc.
Compiler design isn't that hard. A perfect optimizing compiler is damn nearly impossible, but good enough is easy. LLVM makes good enough compilers pretty easy, see the number of new languages that get posted to HN every month!
Machine Learning is hard to do right for non-trivial cases, but the only risk is you waste your investor's money and your customer's time, you don't risk identity theft (unless you are using machine learning to write crypto code maybe? :) )
Databases, yeah, everyone screws those up.
The real key is how many people are trying to break your system though. Crypto is exposed, your DB layer can have protections put on it to sanitize your data and rate limit how fast stuff comes in. You can clean up the data that goes into your machine learning algorithm (or use one that is resilient to x% of malicious data).
None of those defenses depend upon defending attacks of context switch timing in the CPU leaking data from a CPU that was releases 2 years after your code shipped through QA!
(and yes new CPU designs can cause problems in any code! But Intel and AMD work hard to avoid the major cases!)
No, it was clearly understood when SSL was developed, as shown in http://web.cs.ucdavis.edu/~rogaway/papers/draft-rogaway-ipse.... The real reason why crypto is hard is that it requires an enormous amount of extremely specialized domain knowledge, and very subtle differences between two protocols can make the difference between success and failure. ISO regularly standardizes broken crypto.
I'm tired of all this talk about passive agressive.
How about a different explanation: After answering time and time again on twitter he found out he had other things to do and played the "show me the code" card?
And yes noobs[0]: Show me the code is a valid card in programming discussions.
[0]: here I am purposefully rude, feel free to take offense if you think it helps - or feel free think twice or even laugh with me.
Open source and computing culture has to a certain degree been a safe haven based on technical skills. Lets try to keep it that way as long as possible, shall we?
I don't view this kind of comment as a "fuck you". I think this says more about your outlook than it does theirs.
They're letting you know that they have their own priorities, but they're still receptive to your ideas, which is absolutely fine. What is wrong with that?
"Submit a patch" may be "fuck you" but that's not what he said, it was more like: "It's not as easy as you think. You can prove me wrong by submitting a patch.". This second is not reasonably interpreted as "fuck you".
I find your attitude the ruder. Users of paid products have the right to complain about stuff like that; it's literally what they paid for. Users of open source projects have no such right: if you know what to do, why not make yourself useful instead of bitching out someone who's volunteered their free time to make your life easier? I have very little patience with armchair pundits myself, if you submit a pull request we can happily have a conversation, but everybody's a critic and some of us are trying to get things done.
Nonsense. Users of free and open source software have every right to point out flaws in the design and implementation of that software. And this is an invaluable service to the authors and community. While finding and fixing an issue is nice, it's certainly not required, and not everyone capable of identifying issues has the time, ability, and inclination to fix those issues.
Furthermore, the conversation at issue was initiated by a community member asking why Weave's authors chose to implement their own security mechanism. The point of this kind of question is to assess whether the authors had good reasons, bad reasons, or no reason at all behind a questionable decision. This helps determine whether the effort to resolve the issue would be well-spent. If the authors aren't convinced that other solutions would be superior, they may be unwilling to accept a contribution, and you are potentially wasting your time producing a patch.
"why not make yourself useful instead of bitching out someone who's volunteered their free time to make your life easier"
Because as soon as you've found issues with more than, say, 3 things, you no longer have enough of your own free time to volunteer to solve the problem in a better way, let alone whatever you were already working on. Do you honestly believe that criticism has no value?
Complaining on twitter is not the same as finding an issue! Criticism has value, but not all commentary deserves equal weight or time before it is reasonable to request reciprocal effort.
It has vanishingly little and there's certainly no shortage of people handing it out for free. There's a reason for aphorisms such as "talk is cheap" and "my two cents". You seem to value your own time extremely highly; where's the respect for others?
You think security domain experts don't have the "right" to "bitch" (or perhaps, say, "inform") about potential security problems?
I understand you're trying to "get things done" but crypto is an area where you have to tread carefully, and talking down or ignoring people trying to inform you about security flaws is only encouraging the development of insecure software.
I'm pretty sure the point is more that - had Weave used a standard and already-vetted encryption method instead of rolling their own crypto - they could have put that free time into more useful things instead of now having to maintain yet another crypto implementation on top of their main project.
This isn't to say that there's never room for improvement in the crypto space - I personally disagree with the assertion that rolling one's own crypto is inherently bad in all cases, and instead believe that we need a maximum of innovation attempts now so that they can be evaluated and audited and identified as useful - but unless you're actually fixing a problem, Not-Invented-Here syndrome is dangerous and a waste of time better spent elsewhere.
We did not roll our own crypto. We used NaCl. The rationale is explained here - http://weaveworks.github.io/weave/how-it-works.html#crypto We agree that other approaches are possible, but this is the one we picked for our first version.
Tony, the team recently updated the crypto docs to clarify the rationale etc etc - http://weaveworks.github.io/weave/how-it-works.html#crypto ... Moreover the entire crypto code for weave is about 300 lines. Please please please if you are an expert then a thorough review would not only be welcome but also acted upon. Thank-you. Alexis.
You speak as if SSL/TLS hasn't been a rat's nest of problems.
If they did it competently, there is no reason they couldn't implement their own crypto encapsulation.
As far as why they rolled their own... have you ever actually tried to use IPSec? It's a usability nightmare. It's also problematic in containers due to container permission issues. I suppose they could have used DSSL (datagram SSL) but that'd probably add more overhead than what they did.
I see little real world evidence that this "let the pros handle it" attitude toward crypto is helping.
True. We did try ipsec, and couldn't find an implementation that was oss, demonstrably safe, and easy enough to pull into a first release. As weave matures, we'd love to work with experts to implement standard solutions, even if they are costly to put in place.
I would be extremely grateful if you could provide actionable advice (or help) on which other crypto libraries could fit our requirements for weave. Please note that in addition to functional requirements, any library must be open source, hard to misuse, easy to package, and demonstrably safe.
We are open to recommendations, help and overall insight from expert contributors in security. As many people on this thread have pointed out, it is a big and complex world..
as laurie mentions in his article, you can easily use off the shelf security solutions with weave... the point of the current crypto is to provide something basic that works. we chose nacl for ease of implementation primarily. happy to add more types of security in the future. help would be welcomed!
"The public key from the remote peer is combined with the private key for the local peer in the usual Diffie-Hellman way, resulting in both peers arriving at the same shared key. To this is appended the supplied password, and the result is hashed through SHA256, to form the final ephemeral session key."
What you're looking for here is usually referred to as "PBE" (Password Based Encryption) or "KDF" (Key Derivation Function). There's a couple extra concerns for transforming a human-readable password into a symmetric key. Hashing is the start, so it's great that your project already has that, but there's more to do, and this is a well-studied topic with lots of literature and pre-existing solutions. "PBKDF2", "HDKF", rfc2898, and pretty much everything in https://en.wikipedia.org/wiki/Key_derivation_function is a good start for reading.
Furthermore, it's not clear to me what the use of the diffie-hellman is actually for. Perhaps I'm misreading or the linked document is an oversimplification, but... It appears that the public DH key is transferred without any authentication.
If the public DH key is transferred without any authentication, it's trivially MITMable and serves no purpose whatsoever. It's true that mixing in the password later solves MITM at that point, but... yeah: All of the privacy and integrity you could produce with the system described is what comes from the password.
DH is used to ensure that every connection between every pair of weave router nodes that ever gets established uses a unique session key. Yes, the public key is exchanged in the clear, and yes, that is MITMable. But as you say, the subsequent combination of the result of the DH with the non-exchanged password solves that.
What is the weakness with this approach? Is your point that there's nothing gained from doing the DH - you may as well just generate a secret key randomly, exchange that between the peers in the clear, and still have them combine that with the password secret, and the level of information exposed is the same?
I think there is a difference. Say you have a captured stream of traffic and you want to decrypt it. With the simpler non-DH scheme, you already have the basic key, so all you need to do is guess the password, and run that through sha256 and then you can decrypt the entire stream. Now you have the password, you can decrypt every stream you capture from now on. But with the DH scheme, you have to both guess the password, and the ephemeral private keys which are never exchanged (either that, or break Curve25519). Even once you've done that, you only get access to that one captured stream - sure, you now know the password, but every connection between weave routers will use different random private keys to their DH, so you'll still have to brute force those for every stream you capture (or, again, break Curve25519). So ISTM there is a substantial difference there. That, to me, is the point of using the DH. But maybe I'm missing something... I'm slightly wondering whether people here are considering the use of DH and Public Key crypto by weave in the context of the usual "generate keys once and save them". This is just not the case in weave - weave generates fresh public and private keys for EVERY connection between routers.
There is no requirement for weave that the password is human readable. It can be supplied through a file, so you can happily dd if=/dev/random of=/my/weave/passwd bs=1k count=1 to create a suitable weave password
So the DH does give you some degree of ephemerality, yes... but as used, only if you're not already being MITM'd. (If you are subject to MITM, then it degenerates to the case you describe where peers are just picking a random number and exchanging it in the clear.) You could upgrade this to not have that weakness under MITM at all by doing the DH after the KDF: use the KDF to key an HMAC, thus preventing MITMs by someone lacking the password at the time of the DH exchange. I don't think there are any additional expenses or drawbacks associated with doing this.
Just to clarify, the attack you want to protect against is that of an adversary being able to conclude the DH public key exchange with a bona fide weave peer, despite having no knowledge of the password. Correct?
But what can an adversary learn from doing so? All subsequent messages on the connection are encrypted with the secret key, which has the password mixed in.
> There is no requirement for weave that the password is human readable. It can be supplied through a file, so you can happily dd if=/dev/random of=/my/weave/passwd bs=1k count=1 to create a suitable weave password
Turns out that feature I was thinking about has been removed, so the above is not true.
> All of the privacy and integrity you could produce with the system described is what comes from the password.
That is correct. I guess calling this a 'password' is perhaps misleading in our docs, since it could be seen as implying human-readability and cryptographic weakness. As msackman says, the 'password' can in fact be as strong as you like.
the strong points of weave network, as it is right now, are ease of use (not to be sniffed at), and enormous flexibility. it is really quite easy to create an application involving containers, that runs anywhere and does not commit you to specific architectural choices...
typically though, one weave network might be used by one app, or just a few. but you might run a lot of weave networks
weave works very nicely with kubernetes - later I shall dig out a few links for this
in our own tests, throughput varies by payload size; we tend to think of weave-as-is is best compared with using, for example, amazon cloud networking directly
for users with higher perf needs, we have a fast data path in the works, that uses the same data path that ovs implementations use... the hard problem to solve here is making that stuff incredibly easy and robust w.r.t deployment choices -- see above :-)
Correct! Once upon a time people said Amazon cloud was too slow. Then they said it wasn't suitable for large workloads. Then they said it did not make money ... etc etc. I'm not saying we are like Amazon, I'm just saying that making new stuff excellent in all dimensions at once in hard ;-)
I'd agree that Weave is great at what it does, which is providing just-enough-networking to make Docker applications as easy to design and construct as applications running on a normal LAN. This is a huge win, since I don't have to worry about portmappers and trying to discover what dynamically allocated ports are being used at runtime to connect two services.
I use Weave as the default SDN for Clocker [1] because of this simplicity, and also because it is server-less and will work on a single server Docker cloud or a cluster of tens of machines without having to think about architecture. Of course, Clocker supports pluggable SDN providers so if your networking demands are not met by Weave you can change to another provider.
I don't think Alexis (or any other Docker SDN provider) is suggesting that their software should be used for low-latency microsecond sensitive trading applications. You have to use the right tools for the job, and in this case Weave's sweet spot is its simplicity and reliability.
I REALLY wish that Clocker would support (and document how to use) something besides Weave - weave is intolerably slow for anything that requires some kind of more serious throughput between nodes. It is very, very unfortunate Clocker doesn't document how to not use Weave (for example, simply use whatever is already in place) as the rest of Clocker rocks, and the seemingly hard dependency on Weave makes it un-deployable for serious production use.
I did pop into their IRc channel a few times with a question around this, but over the space of 3 days all activity on the IRC channel was tumbleweed and crickets..
The latest version of Clocker does support Calico as well, now! But I agree there isn't much in the way of documentation on how to change this. I updated the README and the main page at http://clocker.io/ to reflect the changes in Clocker 0.8.1 but it's not immediately obvious. Try this:
that's great, thanks for answering! I have not yet looked at Calico - I have OVS-VXLAN between my worker nodes, which is a simple solution that works great. Is it possible to say "use whatever is already in place"?
It may be possible by writing a slightly modified YAML file for the Clocker blueprint itself. To avoid dragging this thread further off-topic, please drop me a line at the email address in my profile, and I can probably help you.
Wow. Is that a full 8Gbit/sec of data per host or a logical 8Gbit/sec of raw frames?
Either way, you're an ideal candidate for closed beta access to the performance focused container networking tools my startup is developing. I can contact you via email if you're interested.
on each host? what kind of physical network are you using, and how many containers per host? how many hosts? let me know if I should email about this instead.
I'm thinking replies like this should come from an individual, not an account representing a company. Having an individual behind the words (instead of a loose consensus mechanism for a company) goes a long way to establishing trust. Hoping to see a direct address of the observations raised in the post!
Using an account that I share with other people in the same team is not the same as 'marketing'. I am sorry if this somehow offends, but think of the handle as just one poster.
Just a casual observation, but you don't appear to be listening to me. You are making rationalizations of my statements instead of trying to see my point. For example, I didn't say it was the same as marketing, I said that it's 'fine for marketing', i.e. posting in union when trying to spread interest would be a fine thing for a group account.
What I may be failing to communicate here is how trust is established between entities like companies and entities like individuals. When there is a issue with trust with individuals, like in the OPs post, it's best to establish 'point-to-point' communications with people you know so you can build up the trusted relationship. Doing that as a company doesn't really work well for that.
BTW, comments like "weave has lots of very happy users who find that weave is plenty fast enough for their purposes" are implicit trust statements based on bandwagon bias. What you are actually saying is there exist a group of people who don't feel the way the poster feels and are happy with the product's current state. The implication of your statement is that others should feel this way, but there's really no way to establish that unless we heard from all those people directly. This is yet another example of why consensus sucks when trying to establish the truth for an individual. (Bitcoin has figured this out, however.)
I think about trust a lot for work, so take my comments with a grain of salt. Nobody died here. :)
We're using weave a lot at Cloud 66; we find for the majority use-case inter-container comms at this throughput is sufficient; for high throughput endpoints like DBs there is an argument for not putting them in containers in the first place… (but thats another discussion)
Except that people who can recognize when something is indeed "good enough" and then move on to the next most important thing are ultimately the only people who get things done and accomplish goals.
Paul, I think our work is good. We thought about our approach very carefully.. And we have built systems like this before. The main difficulty is to combine moving fast with limited resources, with delivering something supportable and improvable, without creating technical debt.
the author, who was part of the MS Azure team at the time, said: "as of today, this Weave/CoreOS tool and doc is the only way I was able to provision a Kubernetes cluster in Azure"
Well, the latency is far from negligible. However, the bandwidth really is impressive. Having userspace as the control plane while keeping the data plane firmly in the kernel has long been the way of things, and this blog post's not-so-subtle message is "your problem is not novel enough to justify the cost of your solution".
To be clear, there is nothing fundamental (to the best of my knowledge) that would prevent Weave implementing a VXLAN backend. It is simply a poor implementation at the moment, and even then, only in this regard. Who knows, maybe Weave's use of a gossip protocol vs flannel's CP etcd could make it more suited to some deployments. Though, that does need to be tested.
Sorry, yes, you're right - the latency overhead is not negligible (as a percentage, though in absolute terms it seems pretty tolerable).
Is this added latency just VXLAN overhead, or is flannel userspace still involved in per-packet-processing? I'd love to see a test comparing flannel/VXLAN to a manually configured VXLAN tunnel.
Actually, userspace packet switching can be fast (10GbE linespeed-fast) thanks to approaches like DPDK [1], where a userspace process has zero-copy, direct access to NIC ring buffers.
The problem with DPDK is (as I understand it) that it doesn't handle multiplexing the connection between multiple cores/processes/containers (i.e. lack of polling makes scheduling hard/impossible). The nice thing about using the Linux kernel as your data plane is that you can still have all your bog standard routing, in addition to this fun VXLAN/GRE/etc stuff. That said, I haven't ever implemented anything using DPDK, so I may be talking out my arse.
You are correct however that userspace networking is awesome, and I look forward to it becoming more and more prevalent in applications that can benefit from it.
It depends on how you use DPDK. If I use it from the container directly to the NIC, you certainly do loose all of the kernel capabilities. However, we believe (but have not tested) that you can use a DPDK virtual interface in the container/vm (memnic or virtio) that connects to the DPDK driver in the kernel, so the path from the container/VM is 0 copy. The kernel then does it's processing, and then, another DPDK path could (potentially) be used to 0 copy the traffic to the NIC (really uncertain about that last stage). Basically, you are just using DPDK to save on the copy cost.
This is all academic until tested, btw. As of yet, we (on Calico) haven't had anyone stand up and say that they need more performance than what the native data-path we use today is capable of delivering.
Yep, DPDK is great for NIC-to-NIC traffic and maybe for VM-to-NIC now that vhost-user exists. But that's leading people to suggest DPDK as the answer for everything, which it isn't.
Is there any solution based on openvswitch ? If using vxlan to encapsulate containers L2 frames, the openvswitch kernel module should be able to route packets very well with much lower overhead than any userspace solution that do not access the nic memory directly.
OVN is 0.1 fairy dust and unicorn horns at the moment, looks very promising but you're most likely going to be waiting another 12 months before its solving your problems.
SocketPlane was bought by Docker and is now DEAD, to become whatever Docker decides is good enough to silence their critics. "See we bought these network guys, we swear we're taking networking and plugins seriously!"
I'm not sure I even understand the problem that weave and Docker bridging/NAT solves for real world cases. IP allocation for containers isn't a problem for most networks, is it? Certainly AWS can give you up to 8 IPs per instance, and every datacenter I've ever worked in can give you even more, if you ask. All you have to do is spin up additional virtual NICs with virtual MACs and use DHCP to assign IP addresses to them.
Or is there something fundamental that I don't understand? Please edify me.
Docker was designed to be very easy to get started on your laptop with one IP address and it looks like some people are getting stuck in that model.
I agree that if you are running on AWS VPC or some other overlay you should just use VPC for container networking. You shouldn't overlay your overlay. But there isn't any tooling that I know of to do that.
Everyone I know who runs Docker runs it in a virtual machine manager that has a built-in DHCP server and provide multiple virtual interfaces to the virtual machine. Certainly both VirtualBox and VMWare do.
Even if one runs a Docker bridge in his development VM, that doesn't one must do so in production as well.
Are we in this mess because production engineers don't understand networking?
what I appreciate about weave is that it solved the cross host container networking problem easily (it's very very easy to use) and now, i.e. no waiting for promises of future solutions or fooling around with more complicated set ups.
Here's where I got burned: I set up an elastic search cluster using containers and weave and life was great, but it then grew to need another node. Upon setting up the new host with docker and weave it turns out the new node couldn't talk to the old nodes because they were using different versions of the weave protocol. That was disappointing and a pain, I stopped experimenting with weave at that point.
Hi, I work on Weave; it may well have been one of my commits that broke the protocol compatibility for you. We've changed things over time to improve performance and resilience.
It should be fairly straightforward to deploy the same version on every host, but maybe that wasn't explained well enough, or didn't work for you. We'd welcome more feedback.
Lastly, I appreciate the positive comments. "Very very easy to use" is exactly what we aimed for.
flannel appears to be what kubernetes is using as well and I know it is what RedHat is using for their OpenShift platform ontop of k8. It seems like the obvious path forward.
Hi Jeff - if there was a single solution that fit all requirements, then that would be obvious :) However, experience has pointed out that there are differing sets of requirements. Some folks may need the flexibility (say disjunct fabrics) that encap provides, while others need the scale (say 1000's or 10K's of servers) or simplicity (those sort-of go hand-in-hand) that a non-encap data-path provides.
The big question that a system architect needs to ask, if they are designing a system at scale is not "should I use this technique" but "do I NEED to use this technique." We can always add more complexity and technology/layers than we need because we "may need it" in the future, and we almost always end up with a jinga tower when we are done.
So, when laying out your infrastructure, be sure to know what your actual requirements are, and don't add a lot of extraneous capabilities that you have to maintain and trouble-shoot later.
I want to make one edit, I should have said "need the flexibility" instead of "may need the flexibility". Both the scale and the flexibility can be hard requirements. I don't want folks to think that I am saying that scale trumps flexibility/disjunct fabrics. They both are equal, if that is the environment that you operate in. Again, full disclosure, I'm with the project calico team.
Flannel relies on etcd. I had stability issues with v1 of etcd which meant flannel couldn't route packets. Since then, the idea of using an immature SDN atop an immature distributed key value store fills me with dread.
I gave up on SDNs and fell back to doing what anyone does without an SDN: published ports to the host interface and advertised the endpoint <host_ip>:<container_port> to etcd. Note this wasn't with kubernetes but with a similar system. Still reliant on etcd, which I wasn't happy with, but one less cog to go wrong.
I have to agree... I do hope to see something close to a drop-in coreos cluster on a number of networks that becomes easier to use in practice and on in-house servers... The reliability is still a concern, and when you are using a PFM (pure f*ing magic) solution, it needs to be rock solid. I think within 1-2 years CoreOS will be a choice solution... unfortunately I can't wait for that, and don't have the time to become intimate with every detail therein.
Although etcd had a lot of issues even when I was in the evaluation stage... I setup a dev cluster, having a node running on half a dozen machines (developer workstations included) in the office. The etcd cluster broke 3-4 times before I abandoned the thought of using it. With the 2.x release, my test network (3 virtual servers) was a lot more reliable, but I decided I didn't need it that much, opting to use my cloud provider's table solution for maintaining my minimal configuration...
For my deployment, the cluster is relatively small, 3 nodes mainly for redunjdancy... our load could pretty easily be handled by a single larger server. That said, I decided to just use dokku-alt for deployments directly to each server (each service with two instances on each server). To make deployments easier to use, I have internal dns setup with wildcards using v#-#-#.instance-#.service.host.domain.local as a deploy pattern for versioned targets and instance-#.service.host.domain.local for default instances. I have two nginx servers setup for ssl, spdy termination and caching configured to relay requests to the dokku cluster for user-facing services. This is turning out to be simpler than implementing etcd/flannel for that communications layer.
Each instance is passed relevant information regarding the host/service information to report it's own health to table storage, internal communication is using ZeroMQ req/res interfaces to minimize communication overhead, and allow for reconnects. Which is working a little better than relying on WebService/REST interfaces internally.
Agree. I may try some benchmarks, as I have recently added support for Calico (as well as Weave, which was my original choice due to its simplicity) as another swappable SDN providers for Clocker [1].
Networking is a disaster. Kernel level networking is nice, but that requires access to the kernel which you can't provide in a container. Doing so means containers no longer contain.
IPV6 was supposed to solve a lot of this by having an address space so huge you could easily give every vm host a few billion IPs. But nobody uses it.
Networking isnt a disaster. It's just that the current container ecosystem on linux hasnt yet resulted in any real domain specific improvements yet. For "default" systems, networking is pretty damned reliable and performant.
A lot of the problems with regards to containment have already been solved in different systems. I believe solaris or opensolaris had Crossbow [1]. Any system that aims to provide connectivity needs to do the least amount of encapsulation possible and probably be a kernel module.
For whatever it's worth, the Crossbow technology remains at the core of our Triton container stack[1] -- and we have extended it significantly by adding VXLAN and virtual layer 2 networking.[2][3] One of the comments we have heard from many early Triton adopters is that they enjoy having IP addresses, networking stacks and bare-metal performance. ;)
"the least amount of encapsulation possible" doesn't sound like much of a solution if what you are looking for is networking with strong encapsulation.
What does that term even mean? Are you talking about encryption? If not, there is no 'strength' to encapsulation. Something is either encapsulated efficiently or it's not.
Hey some of us believe that ipv6 is a good solution for containers. There are even products that support it (like Calico http://calicoproject.org/ ) and even a few cloud providers (not digital ocean who think that 16 ipv6 addresses is ipv6).
If you want a real ip/ipv6 stack for your container, try unikernels.
Thank's for the IPv6 love on Project Calico, Justin. Have you been testing Calico's v6? If so, we'd love to talk to you (disclosure, I'm on the project calico team).
Setting up networking requires root but so does creating the container so I don't see what the problem is. Using networking doesn't require any special permissions.
Virtual Machines have a smaller attack interface to the host, so they do have a chance of containing malicious code.
Kernels were supposed to contain processes, providing them with a virtual private memory space, virtual storage access so they can't read or write any file you don't want them to, etc... but only openbsd is really trying to make that work.
It is not a disaster. Its possible to build a performant and easy networking solution for containers. Except the vendors are in a hurry to get some hacky software out that takes back the clock 20 years.
Container networking is not special or different from VM networking. There are tons of proven and widely used technologies available in the open source ecosystem.
We have been building out a series of networking tutorials at flockport from basics; static, public, private IPs, NAT, bridging etc to multi-host container networking with GRE, L2TP, VxLAN, IPSEC focussed on LXC but these will work with VMs and containers in general.
They don't need any special tools, just IP tools and the kernel and deliver performance and security. A lot of the Docker centric networking projects use these under the hood but they are easy enough to use on their own.
There are ways to make networking easy and performant at the same time without resorting to these user space hacks. Wait for much better products to show up (including one from us).
In the terminal snippets, latency is specified in "us", which I'm guessing is µs, microseconds.
In the table, those latency numbers are specified in "ms". Also microseconds? Couldn't possibly be milliseconds, for two VMs that should only be a couple dozen metres away from each other, right?
yes, microseconds surely - we estimate that the cost of switching from kernel to user space, and back again, several times, adds up to 2-300 microseconds. in return for this, you gain a lot of flexibility and ease of use. for apps that need lower latency, weave is currently working on implementing a fast data path option..
Spot on