More

bahamat · on Feb 15, 2020

There’s nothing in GPL that prevents or prohibits closing later versions. This is also true for the CDDL. The reason Oracle was able to close Solaris after it had been open is that Sun had required a contributor license agreement that assigned the copyright for your code to Sun before they would accept your changes. This would work even for the GPL.

The CLA is actually what initially prompted the illumos fork even before Oracle closed the gate.

Joyent initially had a CLA on Node.js for business reasons that (as far as I know) everyone in engineering disagreed with. When we were finally able to make Triton (née SmartDataCenter) open source we also eliminated the CLA for node.

We now have contributions from many people under the MPLv2 in Triton, and we are no longer the exclusive copyright holder which means that it is pretty much impossible\* for Samsung to close it again.

* We would have to either rip out all those commits or get every contributor to either relicense or assign copyright to Joyent.

bahamat · on Dec 5, 2018

You may want to try it again. We (Joyent) are using some of the latest Supermicro hardware in our cloud.

bahamat · on Dec 5, 2018

Most of the application specific images have stopped receiving updates. This is because they were time consuming to create/validate, and were mostly just the base-64 image with whatever package pre-installed via pkgin. This was also happened around the time of the rise of Docker, so most people were opting for docker images produced by upstream maintainers.

In most cases, you can just make a base-64 image and `pkgin in` whatever package you wanted and it's pretty much the same thing.

The Prometheus stuff is heavily used by us internally, and while it's usable, it's pretty experimental (i.e., changing quickly). I don't see any pull requests or issues that are obviously from you, so if you point me at something I can take a look at it.

bahamat · on Dec 1, 2016

1. CDDL is fully BSD compatible. The license is file based, so it's non-infecting, and binaries can be re-licensed. Win/win.

2. Most already do. Some even believe that 1 makes CDDL compatible with GPL as well and so ship binaries.

3. Patent protection in CDDL is extremely strong. Rumor has it that Oracle wanted to kill illumos via litigation, but never went ahead with it because they knew they'd never win because of the CDDL.

gyjvdf · on Dec 1, 2016

3. So why can't Linux black-box re-implement it?

ryao · on Dec 1, 2016

In my opinion, reimplementing ZFS from scratch would cost at least $100 million and take 5 to 10 years of development by many talented people. Given that we already have the source code under the CDDL and it is good enough, no one is willing to spend that kind of money. Why would they? There is no business case.

If anything, Oracle's software patents are a case against it because they could sue a clean room implementation like they did with Android's Java implementation. They would have a stronger case too due to the hundreds of patents covering ZFS. That is the elephant in the room with btrfs that no one discusses. :/

Anyway, I see no need to reimplement ZFS from scratch after consultation with attorneys of the SFLC and others.

geofft · on Dec 1, 2016

Linux has a black-box reimplementation of DTrace (with different design sensibilities, but essentially all the same functionality) under the name of eBPF, and bcc for the userspace bits. Linux is trying to do that for ZFS under the name of btrfs, but it's not as good.

Chewyrobbo · on Dec 1, 2016

Oracle also ported Dtrace to Oracle Linux, and was available if you purchased support from Oracle.

bahamat · on Dec 1, 2016

I don't see how it could. Except that maybe we'll see some more Solaris refugees coming in.

illumos has been a thriving project for over six years, fully independent from Oracle. There has been zero code sharing, and little interaction of any kind.

bahamat · on March 30, 2016

The significant difference here is that it's running unmodified Linux binaries. Compile on Linux, run in windows.

You should be able to compile functioning Linux binaries on Windows and run them on Linux as well.

bahamat · on March 30, 2016

After Oracle bought Sun. It was never a part of Solaris 11. I don't know if it's still part of Solaris 10 or not, but even if it is, it's only barely usable.

But it's alive and well (and awesome) in SmartOS, with active work going on to merge it into OmniOS, and eventually will be upstreamed to illumos-gate.

bahamat · on March 28, 2016

As has been pointed out on Twitter,

> this is funny because his main beef is actually due to systemic problems inside Intel, not Illumos or FOSS

https://twitter.com/kevinbowling1/status/714352429333086208

kev009 · on March 28, 2016

Since I can elaborate more on here vs twitter, I have a senior engineer spending 100% of his time fixing Intel drivers at my company for coming up on 2 years. The FreeBSD driver is somewhat related to the Linux driver but Intel has obligations to put out a copyfree implementation. I believe this was then ported to Illumos a couple years ago.

ixgbe works relatively well on Linux and FreeBSD right now, but there are still occasionally surprises and it is less efficient than other options. What's astonishing is that you can get better HW with _much_ better drivers from other companies for cheaper. The Chelsio t520-so-cr is unequivocally better than the intel x540 and costs less. Chelsio, Mellanox, SolarFlare are all good choices for Linux, FreeBSD. I think Chelsio has a Solarish driver for Illumos, not sure about the others.

Inside Intel, the Windows team, Linux team, FreeBSD team do not talk to each other. They appear to not be able to talk to the HW team either. Several large and influential companies have been trying to force Intel to clean up their FreeBSD drivers. They have taken action, and that action has been pretty disappointing. The Linux driver and commit logs are also illuminating since that would presumably have massive market share.

Intel's 40g parts have been fraught with issues at the HW level. The drivers were barely able to outperform 10g at release. This kind of slop is not normal. It should not be rewarded in the market. I had an uphill battle convincing old timers that Intel NICs went so far down hill from the good old days, but after a lot of analysis we have totally written them off for the next two years.

JoeAltmaier · on March 28, 2016

Is it truly hardware? I had to deal with radios for years, and the vendor-supplied driver was always poor. It came from some SOM company, who got it from the radio designer. It was always a demo driver, intended to show off the chip features but no effort put into performance or error recovery.

It seemed nobody in the 'chain of custody' of a driver had any incentive to make it work well, in a commercial setting. At best, it was consumer-quality. By that I mean, it worked until it didn't. For a radio, it meant if roaming jammed up then just take the radio dongle out and put it in again. Which in a commercial device (like a forklift touchpad) which had the radio sealed behind a panel, it was junk.

So I had to fix features, performance, bugs, timing, power management, the works. E.g. to get a radio driver fit for a WalMart distribution center forklift going 15mph, it had to roam in milliseconds and choose between 60 APs in radio range. And run for a 12 hour shift without recharging. The chipmaker driver was never, ever good enough.

kev009 · on March 28, 2016

Dollar for dollar, the Chelsio and Mellanox HW is better in terms of features. Better drivers and vendor support at lower purchase price make it a no brainer. I don't know about SolarFlare pricing but it is also better HW.

igb and ixgbe HW seem to be fair, but you can go look at the HW errata to judge for yourself. Intel had to recall the XL710 due to silicon issues. They also had a firmware incident that fundamentally changes the driver interface, so a particular driver will not work between different FW revs.

rincebrain · on March 28, 2016

Chelsio told me they had Solarish drivers, I haven't ever personally tested them.

Solarflare had a 10GbE Solarish driver which they released at one point under CDDL without much engagement, and has recently been proposed for mainline inclusion with support for their 40GbE adapters as well [1].

Mellanox explicitly does not have any Illumos support for newer chips and my understanding is that they have not been receptive to any queries from the community. [2] [3]

Intel's 40g parts didn't arrive before my day job stopped involving high-speed networking every day, but their 10G stuff, as you say, worked relatively consistently across different platforms (though Illumos had some fun with the first few 10G HW revs after they started pulling common code).

I'd guess Intel's 40G parts were either a last blip or a first gasp of attempting to integrate their high-speed Ethernet bits and the "Intel Omni-Path Architecture" bits that are descended from their QLogic IB purchase.

[1] - https://github.com/gdamore/illumos-core/tree/sfxge-merge

[2] - http://lists.omniti.com/pipermail/omnios-discuss/2014-Januar...

[3] - http://omnios-discuss.omniti.narkive.com/FHhYvBui/mellanox-4...

yuhong · on March 28, 2016

To be honest, I think they are probably still better than Realtek or "Killer" NICs.

bcantrill · on March 28, 2016

Yes, and the "big and gnarly" issues that he alludes to are in fact a driver issue that has been seen only by him and brought up exactly once by him on the mailing list -- and that was a year and a half ago.[1] There was lots of discussion at the time, the conclusions being that (1) he was advocating changes that were deemed unsafe and (2) that his most serious problems were seen on hardware known to be bad. The driver that he's referring to (ixgbe) is in very widespread production on illumos (albeit likely more frequently over optics than the copper that he has deployed); to the degree that there's a driver issue here at all (and that's not a foregone conclusion!), it seems likely that there is something specific to his environment that is inducing it. Certainly, with no one else seeing the issue and without better information from him (e.g., a kernel crash dump that indisputably shows an ixgbe-level issue), it's hard to see how anyone could expect any real progress to be made on this issue -- illumos or otherwise, open source or otherwise.

tl;dr: This in no way represents the "limits of open source" -- but it does highlight the limits of relying on other people to magically solve your problems for you.

[1] https://www.listbox.com/member/archive/182179/2014/10/search...

jsnell · on March 28, 2016

Are you sure your summary of that thread is really accurate?

To me it reads as Chris providing an exceptionally detailed bug report (including the exact code paths triggering the problem, and statistics from lockstat and dtrace on the lock in question). Nobody in the thread asks for more information (why would you want a "crash dump" for a non-crashing bug anyway?). Everyone seems to agree that the drivers are in fact taking spinlocks for long periods of time, while holding other locks. Nobody talks about "hardware known to be bad". What is talked about is how it's been too long since the drivers were last synced with upstream.

bcantrill · on March 29, 2016

It is detailed, but in all the wrong ways: instead of describing the problem that he's seeing and offering data, he has jumped to a code path that he believes is inducing it -- without much in the way of supporting evidence. And yes, he talks about bad HW ("access to the second port on the card currently fails to acquire swfw sync"). The ensuing discussion is more of a desultory wandering than it is a deliberate investigation into his problem -- which isn't surprising, because he hasn't described a problem but merely an observed artifact in the system. (Long lock hold times can easily be misleading; when exploring latency bubbles, one needs to be very careful about tying observed behavior to the latency outliers, lest one discover problems without discovering "the" problem.)

So yes, I stand by my summary of the thread.

DominoTree · on March 28, 2016

I'm using ixgbe (with copper) under SmartOS on three machines right now and have had zero issues.

chris_wot · on March 28, 2016

So he's not willing to generate a kernel crashdump? Is that what you are saying? Not an attack, genuinely curious.

bcantrill · on March 28, 2016

I'm saying that even if someone wanted to debug the problem out of the goodness of their heart, they lack the necessary data to debug it. A kernel crash dump may or may not be required; the discussion too quickly jumped to his (unverified) hypothesis as to the root of the problem to even know what data would be required.

PhantomGremlin · on March 29, 2016

He mentioned "several" issues. One was Intel Ethernet, but the other was a serious issue with the kernel:

on a server with 128 GB of RAM, over 70 GB of RAM was being held down by the kernel and left idle for an extended time. As you can imagine, this didn't help the ZFS ARC size, which got choked down to 20 GB or so.

That's a big issue. Is he supposed to periodically reboot his NFS servers to free up the idle RAM?

bcantrill · on March 29, 2016

Here too we have data that borders on the anecdotal: he comes to conclusions first and then seems to look for data to backfill them. Should, as Garrett suggested, we adopt high water marks on magazines? Perhaps. Would this issue have addressed his situation? Unknown, because (once again) we have an issue that isn't being seen widely, represents suboptimal behavior (in contrast to fatal behavior), and lacks the hard data to be able to allow it to be definitively root-caused. Not a recipe for success, in any community.

shiftoutbox · on March 29, 2016

I could be wrong but isn't this the normal memory management in Solaris an its derivatives? Even if it's not arc and it was used and now marked as inactive . The kernel keep its in a different "accounting bucket" , but it's still usable . What I suspect the op is ticked off with is actually arc back pressure and l1 arc eviction which can be horridly slow .

bahamat · on Jan 22, 2016

> The only thing I disagree with in the article is debugging vs. restarting. In the old model, where you have a sys admin per box, yes you might want to log in and manually tweak things. In big distributed systems, code should be designed to be restarted (i.e. prefer statelessness). That is your first line of defense, and a very effective one.

But if you never understand why it was a bad state in the first place you're doomed to repeat it. Pathologies need to be understood before they can be corrected. Dumping core and restarting a process is sometimes appropriate. But some events, even with stateless services, need in-production, live, interactive debugging in order to be understood.

Guvante · on Jan 22, 2016

> But some events, even with stateless services, need in-production, live, interactive debugging in order to be understood.

The question then becomes if it is reproducible since "debuggable when not running normally" seems to be the common thread of unikernels, such as being able to host the runtime in Linux directly rather than on a VM.

I think it if you try a low level language these kinds of things are going to bite you, but a fleshed out unikernel implementation could be interesting for high level languages, since they typically don't require the low level debugging steps in the actual production environment.

In either case unikernels have a lot of ground to cover before they can be considered for production.

vidarh · on Jan 22, 2016

You don't need to be able to log in to be able to support a remote debugging stub, though.

bahamat · on Sept 23, 2015

That same web page lists Linux as harmful.

http://harmful.cat-v.org/software/operating-systems/linux/