I know it's old hat on HN, but I just wanted to point out how close to science fiction this article is. This technique to edit genomes is only a decade old. This startup is able to run sub-second searches without requiring any of their own infrastructure.
While reading, I just kept wondering why this search needs to be in the cloud at all. Finding 20 byte strings in 3GB can be done on a laptop very quickly.
But in this case, it does seem incredibly advantageous to be able to scale up to any number of parallel searches, and to be able to search arbitrary new genomes.
Alignment is a lot harder than that. Keep in mind that there are natural errors and such so what you're looking for is the most likely spit in the genome
The ability to search multiple genomes is also useful.. Perhaps you don't have access to any genomes locally-> use this service and scan a heap of human (or mouse or whatever) genomes and find an appropriate match
AWS Lambda is great for inconsistent atomic workloads. However, I had a fairly disappointing experience with Lamda when I tested it just last week.
For example, you cannot send dynamic response headers using the AWS API Gateway (the complementary service to expose HTTP endpoints). In my case I wanted to change the mime-type depending on JSON vs JSONP response.
It's also not possible to connect Lambda directly to ElastiCache and mostly you are expected to work with S3 or DynamoDB (Amazon's proprietary JSON store and what was mostly responsible for the data outage recently in US East). ElastiCache would allow easy persistence which is why it's surprising it can't be connected to given that it's an AWS service (you can connect to it by creating an EC2 proxy but that would defeat the purpose of a serverless architecture).
Some other oddities were sniffing the response body to set HTTP headers as opposed to just allowing your Lambda function to set the HTTP header directly or parsing the JSON response as opposed to doing a regex match.
I've been playing around with API Gateway & Lambda a bit lately, and it definitely feels like these services are sometimes built by teams that don't talk to each other.
API Gateway tries really hard to HIDE things from you. For instance, you can't see what the requested URL was without using a fair bit of VTL to put it back together from some other variables. Any only lately can get you a full list of query parameters, without having to specify them at the time of API creation. In fact, it seems like most of the work on API Gateway, since it's release, has been to let end-users have more access to data they hid in the first place.
I'm a huge fan of CRISPR. I've been following it closely since I heard Radiolab's podcast about it.
I'm also the founder of the JAWS framework, which is an open-source application framework built entirely on AWS Lambda and AWS API Gateway: https://github.com/jaws-framework/JAWS
I would LOVE to grab a coffee with you or anyone on your team some time, and chat about lambda or CRISPR, or anything really :) I live in Oakland and my email address is austen[at]servant.co
Also, will you be at Re:invent? I'm doing a breakout session on JAWS and I'll be there all week.
Yes, I've used it to call out to phantomjs, imagemagick, basically anything that will run on the AWS-EC2-Linux environment. Node or Java can be simple wrappers.
We’ve also been hacking away in the same server-less space with StackHut (http://www.stackhut.com).
We're adding support for more languages (JS & Python currently) but also have an entire build chain that let's you specify any Linux OS and language packages you want, and it's also trivial to embed your binaries and shell out to them if needed.
I'd be nervous about java - the startup time of the JVM plus the slow execution at the beginning until everything gets JITed makes me think overhead could easily trump execution.
To improve performance, AWS Lambda may choose to retain an instance of your function and reuse it to serve a subsequent request, rather than creating a new copy. Your code should not assume that this will always happen.
"Our old server infrastructure cost thousands of dollars each month just for server costs.
Using the new Lambda infrastructure, we pay for the number of Lambda invocations, the total duration of the requests, and the number of S3 requests. This comes out to $60/monthfor hundreds of thousands of CRISPR searches!"
Well, how much of that money you spent on EBS storage for your copies of genome data?
EC2 instances could read from S3 directly as lambda does, maybe that could alleviate the cost a lot.
Using AMI S3 backed instances could save a lot too.
My friend is refactoring an app at his company right now, using only Lambda via JAWS and we ran some numbers on the cost savings. He's retiring 2 EC2 c3.large instances which were costing $2.97/day. On Lambda the app will cost $0.05/day.
We don't hear about it nearly enough yet, but the cost savings of building apps on Lambda are huge. Then you add in the time saved on devops... and you realize how seriously disruptive this tech is.
Yes, this is the key point of microservices: it's not the modularity, nor cross-language nor "webscale" etc.
It's cheaper, because more efficient use of resources, because finer-grained. Each component of an app only gets what it needs; and the vendor can sell that unused capacity to someone else.
Geometrically speaking, finer grains pack tighter, wasting less space.
Of note, the latest thing in reference genomes is representing them as a graph data structure, which importantly allows variation to be incorporated. Some of the newest methods for mapping short DNA fragments (that come out of the most common type of sequencers) take this approach. They use a genome index though, which takes a lot of computational effort to build before hand.
Anyway, benchling wants to avoid genome indexes from the sounds of it, in case users upload their own genomes. Having said that, if someone is doing multiple searches, it would quickly become more efficient to just index the genome. I would have thought most people seriously concerned about off target CRISPR hits would be using high quality reference genomes though.
I recently have started looking harder at lambda after realizing that you can use 1M requests / month for free indefinitely. I just worry about vendor lock-in with services like this - if for whatever reason you want to move away its a rewrite at best. If amazon was to open source the lambda implementation allowing me to run my services somewhere else with a config change id probably buy into it completely and never move away...
I write/test locally and then deploy to AWS with a single command. The lock-in is helped by the fact that (a) the touchpoints and interface/interaction of AWS Lambda are pretty simple and (b) I could spin up a production version of node-lambda too.
The code deployment side is helped by using S3 as an interim place to upload and deploy packages from. The CLI makes that nice and easy once set-up.
If Lambda open sourced the engine, but didn't make it super easy (as in a 1-button deploy, nothing intentional) to stand up your own stand alone Lambda service, would that be better?
On my projects that have been open sourced, we mostly open sourced to make debugging easier and make extension authoring easier. I've gotten comments from people that that makes them feel easier about vendor lock in, but honestly, I haven't seen many people try and stand up their own service. Would you say that matches your own expectations?
We’ve also been hacking away in the same space with StackHut (http://www.stackhut.com) - build stateless & ephemeral microservices in Python and JS that are deployed as cloud APIs for access over JSON-RPC; with simple type-checking, dependencies, shelling out to any other code/binaries, and more.
Most code is open-sourced at http://www.github.com/StackHut and we're working on making it easily deployable on your own hardware.
usually, you use one lambda for one specific need. get the request, do the work and return the response. I did not see how can be locked. the concern for me is the limitation of uploaded file size.
Check out JAWS, it has Lambda optimization built in. It browserifies and minifies your code before deploying it. You will regularly see 20mb files shrink to 100kb.
Crispr is a homing system. It allows you to address a specific part of (genetic) memory. It is a single component in a much larger system. A required, and otherwise missing component. But it is just a component.
They might be able to save a bit on costs by caching locally. Lambda instances can be reused if TPS is high enough. I think the limit is 500MB in the /tmp directory.
How are you getting such quick responses from S3? In our own testing using Java. It was taking over 500ms just to initiate the connection with S3 from Lambda.
It costs them less than a $100 a month.
It was written by an intern.