Powering CRISPR with AWS Lambda

dankohn1 · on Sept 26, 2015

I know it's old hat on HN, but I just wanted to point out how close to science fiction this article is. This technique to edit genomes is only a decade old. This startup is able to run sub-second searches without requiring any of their own infrastructure.

It costs them less than a $100 a month.

It was written by an intern.

akulesa · on Sept 27, 2015

>This technique to edit genomes is only a decade old.

Actually, this technique is about 3 years old.

http://www.sciencemag.org/content/339/6121/819.short

Moshe_Silnorin · on Sept 26, 2015

>The technique to edit genomes is only a decade old

In America, this tech will be patented and regulated to within an inch of its life. I'm hoping China will be more sensible.

phkahler · on Sept 26, 2015

While reading, I just kept wondering why this search needs to be in the cloud at all. Finding 20 byte strings in 3GB can be done on a laptop very quickly.

dankohn1 · on Sept 26, 2015

That seems partly a reference to this classic article <https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html> "Your data isn't that big".

But in this case, it does seem incredibly advantageous to be able to scale up to any number of parallel searches, and to be able to search arbitrary new genomes.

jghn · on Sept 26, 2015

Alignment is a lot harder than that. Keep in mind that there are natural errors and such so what you're looking for is the most likely spit in the genome

robbiep · on Sept 26, 2015

The ability to search multiple genomes is also useful.. Perhaps you don't have access to any genomes locally-> use this service and scan a heap of human (or mouse or whatever) genomes and find an appropriate match

aleem · on Sept 26, 2015

AWS Lambda is great for inconsistent atomic workloads. However, I had a fairly disappointing experience with Lamda when I tested it just last week.

For example, you cannot send dynamic response headers using the AWS API Gateway (the complementary service to expose HTTP endpoints). In my case I wanted to change the mime-type depending on JSON vs JSONP response.

It's also not possible to connect Lambda directly to ElastiCache and mostly you are expected to work with S3 or DynamoDB (Amazon's proprietary JSON store and what was mostly responsible for the data outage recently in US East). ElastiCache would allow easy persistence which is why it's surprising it can't be connected to given that it's an AWS service (you can connect to it by creating an EC2 proxy but that would defeat the purpose of a serverless architecture).

Some other oddities were sniffing the response body to set HTTP headers as opposed to just allowing your Lambda function to set the HTTP header directly or parsing the JSON response as opposed to doing a regex match.

impostervt · on Sept 26, 2015

I've been playing around with API Gateway & Lambda a bit lately, and it definitely feels like these services are sometimes built by teams that don't talk to each other.

API Gateway tries really hard to HIDE things from you. For instance, you can't see what the requested URL was without using a fair bit of VTL to put it back together from some other variables. Any only lately can get you a full list of query parameters, without having to specify them at the time of API creation. In fact, it seems like most of the work on API Gateway, since it's release, has been to let end-users have more access to data they hid in the first place.

ac360 · on Sept 26, 2015

Hi Vineet,

I'm a huge fan of CRISPR. I've been following it closely since I heard Radiolab's podcast about it.

I'm also the founder of the JAWS framework, which is an open-source application framework built entirely on AWS Lambda and AWS API Gateway: https://github.com/jaws-framework/JAWS

I would LOVE to grab a coffee with you or anyone on your team some time, and chat about lambda or CRISPR, or anything really :) I live in Oakland and my email address is austen[at]servant.co

Also, will you be at Re:invent? I'm doing a breakout session on JAWS and I'll be there all week.

Good luck to you!

Austen

taternuts · on Sept 25, 2015

I kind of forgot that while lambda only supports node, you can use it as a glorified wrapper to call your c++ code

_Marak_ · on Sept 25, 2015

If you are looking for an open-source microservice hosting platform that supports multiple programming languages you can try looking at:

https://hook.io

I'm the creator of the project. We currently support 11+ programming languages.

daviding · on Sept 25, 2015

Yes, I've used it to call out to phantomjs, imagemagick, basically anything that will run on the AWS-EC2-Linux environment. Node or Java can be simple wrappers.

Here's a template if it helps anyone:

https://github.com/justengland/phantom-lambda-template

mands · on Sept 26, 2015

We’ve also been hacking away in the same server-less space with StackHut (http://www.stackhut.com).

We're adding support for more languages (JS & Python currently) but also have an entire build chain that let's you specify any Linux OS and language packages you want, and it's also trivial to embed your binaries and shell out to them if needed.

netcraft · on Sept 25, 2015

they also support java, and both node and java can launch other things on amazon linux including bash, python and ruby according to their docs

rottencupcakes · on Sept 25, 2015

I'd be nervous about java - the startup time of the JVM plus the slow execution at the beginning until everything gets JITed makes me think overhead could easily trump execution.

glibgil · on Sept 25, 2015

Q: Will AWS Lambda reuse function instances?

To improve performance, AWS Lambda may choose to retain an instance of your function and reuse it to serve a subsequent request, rather than creating a new copy. Your code should not assume that this will always happen.

https://aws.amazon.com/lambda/faqs/

mahmoudimus · on Sept 26, 2015

I use lambda with java and the JVM startup time is not a problem.

motoboi · on Sept 25, 2015

"Our old server infrastructure cost thousands of dollars each month just for server costs.

Using the new Lambda infrastructure, we pay for the number of Lambda invocations, the total duration of the requests, and the number of S3 requests. This comes out to $60/monthfor hundreds of thousands of CRISPR searches!"

Well, how much of that money you spent on EBS storage for your copies of genome data?

EC2 instances could read from S3 directly as lambda does, maybe that could alleviate the cost a lot.

Using AMI S3 backed instances could save a lot too.

But great work, nonetheless!

vineetg · on Sept 25, 2015

EBS costs are actually fairly small ($9 a month per instance for 90GB). More than 95% of our costs were just paying for the EC2 servers.

ac360 · on Sept 26, 2015

Right.

My friend is refactoring an app at his company right now, using only Lambda via JAWS and we ran some numbers on the cost savings. He's retiring 2 EC2 c3.large instances which were costing $2.97/day. On Lambda the app will cost $0.05/day.

We don't hear about it nearly enough yet, but the cost savings of building apps on Lambda are huge. Then you add in the time saved on devops... and you realize how seriously disruptive this tech is.

hyperpallium · on Sept 26, 2015

Yes, this is the key point of microservices: it's not the modularity, nor cross-language nor "webscale" etc.

It's cheaper, because more efficient use of resources, because finer-grained. Each component of an app only gets what it needs; and the vendor can sell that unused capacity to someone else.

Geometrically speaking, finer grains pack tighter, wasting less space.

They also utilize multi-core effectively.

Gatsky · on Sept 26, 2015

Of note, the latest thing in reference genomes is representing them as a graph data structure, which importantly allows variation to be incorporated. Some of the newest methods for mapping short DNA fragments (that come out of the most common type of sequencers) take this approach. They use a genome index though, which takes a lot of computational effort to build before hand.

Anyway, benchling wants to avoid genome indexes from the sounds of it, in case users upload their own genomes. Having said that, if someone is doing multiple searches, it would quickly become more efficient to just index the genome. I would have thought most people seriously concerned about off target CRISPR hits would be using high quality reference genomes though.

psycr · on Sept 26, 2015

Are you referring to string-overlap graphs for de novo assembly? In that case, isn't CRISPR addressing another problem?

akulesa · on Sept 27, 2015

I think Gatsky is referring to the method described here:

http://www.technologyreview.com/news/537916/rebooting-the-hu...

Gatsky · on Oct 1, 2015

https://ccb.jhu.edu/software/hisat2/index.shtml

netcraft · on Sept 25, 2015

I recently have started looking harder at lambda after realizing that you can use 1M requests / month for free indefinitely. I just worry about vendor lock-in with services like this - if for whatever reason you want to move away its a rewrite at best. If amazon was to open source the lambda implementation allowing me to run my services somewhere else with a config change id probably buy into it completely and never move away...

daviding · on Sept 25, 2015

I've used node-lamba to help with this:

https://github.com/motdotla/node-lambda

I write/test locally and then deploy to AWS with a single command. The lock-in is helped by the fact that (a) the touchpoints and interface/interaction of AWS Lambda are pretty simple and (b) I could spin up a production version of node-lambda too.

The code deployment side is helped by using S3 as an interim place to upload and deploy packages from. The CLI makes that nice and easy once set-up.

crandycodes · on Sept 26, 2015

If Lambda open sourced the engine, but didn't make it super easy (as in a 1-button deploy, nothing intentional) to stand up your own stand alone Lambda service, would that be better?

On my projects that have been open sourced, we mostly open sourced to make debugging easier and make extension authoring easier. I've gotten comments from people that that makes them feel easier about vendor lock in, but honestly, I haven't seen many people try and stand up their own service. Would you say that matches your own expectations?

netcraft · on Sept 26, 2015

yeah pretty much. I don't want to move, I want to have the ability to if something goes bad.

mands · on Sept 26, 2015

We’ve also been hacking away in the same space with StackHut (http://www.stackhut.com) - build stateless & ephemeral microservices in Python and JS that are deployed as cloud APIs for access over JSON-RPC; with simple type-checking, dependencies, shelling out to any other code/binaries, and more.

Most code is open-sourced at http://www.github.com/StackHut and we're working on making it easily deployable on your own hardware.

_Marak_ · on Sept 25, 2015

If you are worried about vendor lock-in, I built an open-source alternative to lambda: https://hook.io

Scaevolus · on Sept 25, 2015

Even without the free allotment, it would only cost ~$7/month to use 1M requests and 400,000 GB-seconds.

ThatMightBePaul · on Sept 25, 2015

I work for a similar service, Iron.io. We're platform agnostic, but closed source.

Can't tell if that ticks your boxes or not. If you're interested we're free to try.

swuecho · on Sept 25, 2015

usually, you use one lambda for one specific need. get the request, do the work and return the response. I did not see how can be locked. the concern for me is the limitation of uploaded file size.

ac360 · on Sept 26, 2015

Check out JAWS, it has Lambda optimization built in. It browserifies and minifies your code before deploying it. You will regularly see 20mb files shrink to 100kb.

https://github.com/jaws-framework/JAWS

JulianMorrison · on Sept 26, 2015

I wonder if it would be possible to go the other way. How close is CRISPR to a primitive of Turing complete computation?

Take it from s/xxx/yyy/ into being /bin/sed. And then run the search in wetware.

toufka · on Sept 26, 2015

Crispr is a homing system. It allows you to address a specific part of (genetic) memory. It is a single component in a much larger system. A required, and otherwise missing component. But it is just a component.

88e282102ae2e5b · on Sept 26, 2015

> How close is CRISPR to a primitive of Turing complete computation?

Not even remotely close.

deegles · on Sept 25, 2015

They might be able to save a bit on costs by caching locally. Lambda instances can be reused if TPS is high enough. I think the limit is 500MB in the /tmp directory.

jewbear48 · on Sept 26, 2015

How are you getting such quick responses from S3? In our own testing using Java. It was taking over 500ms just to initiate the connection with S3 from Lambda.

vineetg · on Sept 26, 2015

We're getting around 100ms connection times. We're using the Node aws-sdk to get things from S3: s3.getObject({Bucket, Key})

janpio · on Sept 26, 2015

Did I miss it or doesn't the article mention what server does the slitting and combining? Is this also done in Lambda?

vineetg · on Sept 26, 2015

I briefly mentioned it in the "New Infrastructure" section, but we're doing the splitting and combining of results on our web servers.

janpio · on Sept 27, 2015

Thanks!