*cross-unit replication tuning and the load balancing / multipathing.* That's go...

smcleod · on April 6, 2015

I have no illusions that I'll be able to maintain such performance with DRBD replicating and iSCSI overheads.

My current feeling on load balancing is to keep it simple, have half the clients treat node A as their active storage and the other half using node B. If node A becomes unavailable failover will occur and node B will serve all clients.

As far as multipathing goes - my approach is that I honestly have no idea. That's something that's going to be learnt along the way and I'm in uncertain if it'll make it into the final build or not.

Edit: I can't get that link to load on my connection at present but I've bookmarked to look at tomorrow. If you have any experience / advice I'm always very open to assistance!

FlyingAvatar · on April 6, 2015

I was quite amused to see this posted as we just built nearly the exact same thing about 6 months ago (minus the NVMe drives which weren't available). We are using ZFSonLinux on top of DRBD instead of mdmadm, which has been surprisingly successful.

The biggest frustration that we met was using LACP in trying to get a better data rate out of DRBD. Regardless of the LACP mode, we were not able to get DRBD to scale to full bandwidth (20gbps) and had to settle for using it for redundancy only.

I imagine with your setup (read: much faster peak write capability), you might actually be able to bottleneck DRBD pretty well over a single 10GBE pipe if you were writing at peak capacity. I'll be interested to see if you happen upon a work-around.

Load balancing as you've suggested is an intriguing compromise. Have you tried doing a two-way heartbeat setup where each both boxes have their own active IP and can fail over to each other?

tedchs · on April 6, 2015

I wonder if the lack of help from LACP was because DRBD is using only 1 TCP session, which would only get hashed to one or the other member link?

cgb_ · on April 7, 2015

Might be a case where MPTCP (http://www.multipath-tcp.org/) would help with aggregating multiple links.

smcleod · on April 7, 2015

Correct me if I'm wrong but I'm not sure MPTCP would help as LACP is generally SRC/DST MAC based and shares a uses a single IP?

cgb_ · on April 7, 2015

You are right in that LACP bonding methods cannot increase the throughput of a single flow to gt any single link in the group (the balancing method can utilise src/dst MAC, IP and sometime L4 ports).

MPTCP establishes multiple subflows across individual IP paths and can load balance or failover across all subflows. Applications do not need to be rewritten to take advantage of it. I'm not sure if that includes kernel modules like DRBD though. I suppose someone needs to find out :)

smcleod · on April 7, 2015

I could have a play and report back if I get time?

smcleod · on April 6, 2015

Hi FlyingAvatar, I've worked with some rotational storage solutions in the past built on ZFS on BSD with mixed success - the features were great but we had quite a few issues with performance. If you've been down the DRBD performance road I wonder if you'd be open to having a chat with me around your findings / experience?