Hacker News new | past | comments | ask | show | jobs | submit login
Confluent, a company for Apache Kafka and realtime data (linkedin.com)
108 points by sbilstein on Nov 7, 2014 | hide | past | favorite | 30 comments



Kafka is a Big Deal, and it's great to see it get first-class support.

In the same way that Hadoop is starting to feel outdated, but HDFS doesn't seem to be going anywhere -- I think we'll see a lot of innovation in stream processing frameworks in the next few years, but Kafka will just keep on going.


Good luck to the Confluent guys!

There's a missing piece in the realtime puzzle, at least one that I haven't been able to find, for which Kafka is an overkill - perhaps someone here knows of a solution:

I have tens of system endpoints connected through unreliable (Line-of-sometimes-occluded-sight, 2G and 3G WWAN, some are in vehicles so connections are intermittent).

I just want to consistently tail their logs in a bandwidth-efficient, connection-drop resistant way; and I can't find any standard thing that does this.

Kafka would fit the bill in general, but would require a lot of work (reading textual logs into kafka, querying kafka for new stuff across connection, reading from kafka and writing to text files) - and I'm not sure how well it deals with dropped connections.

My existing solution is to rsync the log directories (--append, --inplace) as infrequently as I can from an operational view, which is 1 minute. It is relatively bandwidth efficient (although could be much better), robust with respect to connection issues, and generally works.

However, it is less efficient than it could be: if directories have a lot of files, like /var/log often does, there's a lot of sync overhead. The delay is 1 minute instead of a couple of seconds (which is what you would get with a simple "tail -f" through a TCP connection), and it doesn't play well with common log rotation schemes (though that's relatively easy to work around).

Anyone has a better solution, kafkaesque or otherwise?


Sounds like a job for logstash: http://logstash.net

If you don't mind using a 3rd party service, you could look into using Papertrail, Loggly, etc.


3rd party is not an option. I'll have to look at logstash, thanks.


How about a tiny forwarder on each device. It fopens the file and reads to the end, when the file is appended to it will be able to continue reading. Packet up each line and forward to you server using zeromq PUSH/PULL. If you are disconnected they queue up and send when zeromq automagically reconnects.


Since logstash was already mentioned, it's worth putting https://github.com/elasticsearch/logstash-forwarder here, as the logstash team built it as a solution for logstash-type needs on systems that may not be able to support logstash itself.


Can your endpoints log to syslog so they can be forwarded on to a central collector?


Probably. Do you know of a syslog/forwarder that deals properly with intermittent connections? (i.e., available for an hour, then gone for an hour, then available again, etc.)


Not sure about logstash but a lot of its users use NXLog (http://nxlog.org) as a shipper since it has a much lower resource footprint as there is no java and ruby in there. (I'm affiliated with NXLog.)


Syslog-NG PE works well for this. It can be configured to use a per-destination disk buffer, so that if the destination goes offline, messages will queue until they can be sent again.


autossh + ssh -C + tail -F ?

kafka sounds like an enormous overkill. If you want to store the logs locally while tailing, just add in a tee.


Both this solution (and easytiger's above) work if disconnects happen rarely and for short times.

But it my case, I have "30 minutes on, 5 minutes off, 90 minutes on, 90 minutes off, 2 minutes on" kind of situations, in which anything that doesn't track what was already transferred and what wasn't, will lose data. (zeromq's buffers also have limited capacity and/or are tied to a process on the other side - if it restarts, buffers are gone).


It is surprising how long it has taken for the common practice of databases to be distributed. I'm glad that people are starting to move in that direction, and open source their work. This is a win for our industry overall.

We're working on a database/cache/messaging system too, http://github.com/amark/gun it is dedicated to removing the pain I and other Javascript/NodeJS developers had when it came to managing/debugging databases (devops and sysadmin work is frustrating).


Good luck to these guys. Its certainly an up and coming area with some overlap with an existing, mature market. Tibco (the information bus company) is a $4b company with relatively dated technology...Informatica is another legacy solution in this area with tons of revenues. Either way, the market is certainly poised for some disruption


In case anyone else is wondering why the article is linked to LinkedIn, Jay Kreps, co-founder/CEO was previously with LinkedIn. LinkedIn is also an investor of Confluence.


I doubt anybody is wondering that. The very first section is titled "Origin at LinkedIn."


Sounds somewhat like Confluence, this will get confusing. Already I can see joshmn has made the mistake in his comment.


I'm a bit confused. The more i read the doc, the more i feel it look Just Another Queue system. I don't really see the difference between Kafka and let's say RabbitMQ, Celery or also statsd...


There are differences, the biggest of which is throughput. Kafka can handle incredible load. The messaging semantics are also a bit different. Here's a pretty good comparison:

http://www.quora.com/RabbitMQ-vs-Kafka-which-one-for-durable...


Congratulations guys and good luck on this next adventure! How does Samza fit into this new venture?


Samza's role in this is a question I had as well. But, certainly Hadoop needs some polish first.


I love the true engineers' launch, with no logo (twitter account is just the egg).


Does anyone have a mirror for those of us without LinkedIn accounts?



Thank you!



You don't need an account to read the post.


Are you sure? This is what I'm being redirected to when I click on the link:

http://i.imgur.com/w9YxLvA.png


Weird, that shouldn't be there. I double checked in an incognito and things seemed fine. LinkedIn blog posts are supposed to be visible with or without an account, so this must be a bug of some kind.


It doesn't seem to be doing it anymore. Strange! Maybe whatever it was got fixed.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: