Hacker News new | past | comments | ask | show | jobs | submit | rcsorensen's comments login

Trello's hiring for a bunch of different roles, Front and Back-End (node), Internal Tools Developer, and Site Reliability Engineer are all listed at https://trello.com/jobs . Anywhere in the US.


We've also got a lot of Product roles coming up, specifically PM positions and some product design stuff on the horizon. Feel free to reach out to me (Lydia M: lydia@trello.com) or my colleague Carrie (Carrie Marvin: carrie@trello.com)


Trello | New York, NY or REMOTE

Trello ( https://trello.com ) is a visual collaboration tool that creates a shared perspective on any project.

We're a remote-friendly, venture-backed startup headquartered in NYC.

We're growing quickly, not the kind of quickly where you're hiring just for headcount numbers, we're hiring for quality. We're currently 92 employees total. Joining us at this stage empowers you to help define our future processes, what it means to be on the team, and lends itself to lots of exciting career growth.

We’ve got four roles open right now that might be of special interest to the HN crowd:

Site Reliability Engineer – you’ll work on keeping everything running efficiently as we scale the infrastructure supporting over 10 million Trello users on the way to our goal of 100 million users. ( https://trello.com/jobs/site-reliability-engineer )

Developer Advocate – it will be your mission to make it easy and fun to build on top of our API and new Power-Ups platform. ( https://trello.com/jobs/developer-advocate )

Growth Engineer – you'll be working to build features and run experiments to improve acquisition, engagement, retention, revenue, and virality. ( https://trello.com/jobs/growth-engineer )

IT Specialist (NYC) – own and manage our cloud based infrastructure tools for collaboration across the company and support our local and distributed employees for all things IT related. ( https://trello.com/jobs/it-specialist )

Some other jobs and some of our perks can be seen at https://trello.com/jobs .




Works for me...


Yeah looks like it's back now.


The handshake agreement presented here makes me very happy.

Were many revisions done on it, or was it pretty clear to write?


I wrote it in one take 15 minutes before hitting Go.


Yes.

In the final analysis this question is equivalent to "Is it difficult to make a site that can have any piece of it fail to load or be removed at any time?" since website developers have no control over what may or may not make it onto the list of an adblocker.


JavaScript has had try/catch since forever though?


Ad blockers don't just block third party things though. Or even necessarily sane things. Our site is a classified ad aggregator. At the time at least, we learned that Adblock Plus would just blithely delete any element with 'ad' as part of its class. (A major section of our results page had the class 'ad block' (ironically) and was blocked.)

Yes, these things can be worked around, but not every site maintainer will bother. (Especially since those adblock users likely don't earn them any direct revenue.)


The most egregious site breakage from ad blockers is when sites can't handle the absence of their Adobe or Google JS. Some sites have large amounts of content that is only loaded via three levels of third-party-JS indirection. Others will try to fire an event to Google Analytics on every click, but if GA doesn't load, an error is thrown that prevents the click's default action.

I was making a $1000-range purchase on an ecommerce site that was utterly broken by my refusal to be tracked in intimate detail, and their product manager was basically, "eh." So I worked around it by catching the errors for them in the browser inspector. This is the type of breakage that site developers absolutely should be fixing.


>when sites can't handle the absence of their Adobe

When sites that require flash break, isn't that a win for the user?


Adobe makes more than Flash. They have a number of user tracking, profiling, and A/B testing systems, plus an email delivery and segmentation system. Some or all of those things started as Omniture. Google also has Google Tag Manager, which is a third-party JS for injecting more third-party JS. When half your site's content is loaded by Adobe into "mbox"es, triggered by JS loaded by Google Tag Manager, it's a nightmare for real developers and end users alike even with everything working.



If you're using Webpack there is also a great and simple loader module here: https://github.com/deepsweet/autopolyfiller-loader

Works like a charm


It looks like this is related to 'bot detection' services from http://www.whiteops.com/ , as White Ops is listed as the contact on tagsrvcs.com.

The html served back includes a script tag having

  require(['http://static01.nyt.com/bi/js/tagx/tagx.js'], function () {


  http://static01.nyt.com/bi/js/tagx/tagx.js
loads

  http://s.tagsrvcs.com/2/818492/analytics.js?<bunch of query params>
which defines an 'encryption key' that is later used to 'encrypt' POST parameters back to tagsrvcs, but looks a lot like a page load identifer, including using it to namespace 'postbacks' later on.

  tagsrvs.com/.../analytics.js
loads http://s.tagsrvcs.com/2/4.10.0/loaded.js

  tagsrvcs.com/../loaded.js
'postsback' to urls like http://s.tagsrvcs.com/2/4.10.0/818492/L0an5rrDUnnex1s8dQ4fGu...

(Note the 'L0an5rr...' which is the same as the 'encryption key' returned before in the 'analytics.js' file)

This collects some fun things like WebRTC addresses, scripts loaded on the page, and much much more, and sends them out as they become available or more events happen.

If you'd like to investigate this information yourself, the best place to set a breakpoint is in 'loaded.js', inside the function 'e.prototype.emitAsCORSXHR' which is responsible for 'encrypting' the payload before transmission.



The difficulty of Levenshtein automata is highly overstated. When I read that Lucene blog post I wrote an implementation in an hour, and prior to reading that post I hadn't even heard of Levenshtein automata.


Unless you are the mythical 100x programmer, I doubt that you wrote a full implementation of general Levenshtein automata in an hour. I read the paper that introduced them ( http://link.springer.com/article/10.1007/s10032-002-0082-8 ) and they are quite the complex beast. Not to mention that the paper is very technical and you need to keep a dozen definitions in your head.

That said, there seems to be a fairly readable implementation at https://github.com/universal-automata/liblevenshtein

I'm currently working on implementing fast Levenshtein queries in C++ with a friend, and we intend to implement the paper I linked in my original post. So far, our dynamic programming Levenshtein already beats Lucene++ (C++ implementation of Lucene), which is a bit embarrassing [1]. If you're interested, more advanced stuff will hit https://github.com/xhochy/libfuzzymatch when we get around to implementing it.

[1] Lucene++ spends more time converting strings between UTF-8 and UTF-32 than it does computing Levenshtein distances, says the profiler.


I'm not a 100x programmer, I just did a couple of things that drastically reduced the time:

1. I didn't follow that paper. Even trying to understand that paper would have taken way more time, so after 5 minutes of trying to understand it I gave up on that approach. See this comment for what I did do: https://news.ycombinator.com/item?id=9699870 That saved maybe 20x.

2. I used Python instead of C++ or Java. This saved 5x.

3. The code was throwaway quality code. This saved 2x.

Together that's 200x, but I'm at least a 2x worse programmer than them, so that gives you the 100x ;-)


(see my other comment as well)

An algorithmicist would say that all this saved you a constant factor of work for a linear slowdown ;)


That's a nice soundbite but it's not correct. The worst case performance with the DFA is linear, the same as them.


No that's just not true. Your step function takes time linear in the length of string. For example, `newstate = [0 for x in state]` takes θ(|state|) time, and because you initialise the state with `range(len(string)+1)`, that's linear in the string length.


Now you're talking about the cost of constructing the DFA, not searching the index with the resulting DFA. The cost of construcing the DFA is irrelevant, and even then you can construct the DFA in O(n) with my method for fixed max edit distance and fixed alphabet. Same as that paper.


I'd like to implement the same paper. Perhaps I'm missing something, but I'm not sure how the residual strings are created. Do you have a link to an implementation or a description of the residual strings?

I get that a residual string is the original string with a deletion, incrementing the deletions until you hit edit distance d. What I'm not sure about is if it's all permutations of possible deletions.


The residual strings are all subwords where exactly d letters were deleted. For d=1 and the word "Levenshtein", that would be {"evenshtein", "Lvenshtein", "Leenshtein", "Levnshtein", "Leveshtein", "Levenhtein", "Levenstein", "Levenshein", "Levenshtin", "Levenshten", "Levenshtei"}.

The paper does not specify how to generate those efficiently, and I haven't given it any thought yet. I don't know of any implementations of the paper, but this aspect of it should be common enough.

EDIT: sorry, didn't read your comment fully. I'm not sure what you mean with "all permutations of possible deletions". The d-deletion-Neighbourhood of w contains all sub-words of w that you obtain by deleting any d letters from w. For d=2, take any two letters and remove them. N₂(jamra) = {jam,jar,amr,jaa,ama,jmr,jra,ara,mra} (hope I didn't forget any...)

Does that make it clearer?


Yes that makes it supremely clearer. I also found a FastSS implementation, which uses the same d-deletion neighborhood. Here it is: http://fastss.csg.uzh.ch

I am looking at a python implementation for examples.


nice, that seems to be based upon a similar idea as the paper I mentioned (but earlier and less refined).


The paper you mentioned reduces memory consumption hugely and averages out the query and insertion time. It's a good improvement.


I.. have a hard time believing this.

Unless you're talking about a very simplified case (N=1 or something, maybe even for a specific word)?

On the other hand: Maybe the Lucene guys and me are just bad. :/


It was for the general case. The reason that I was able to do this is because I was less persistent than them. I tried to understand the paper but after 5 minutes I realized that even understanding the paper was going to take WAY more time than ignoring the paper and implementing it my own way. Here's how I did it.

If we are looking for

    string = "banana"
Then we can represent the state of the automaton as the last row of the matrix that you get when you compute the Levenshtein distance between two fixed strings. The initial state is (in Python):

    def init():
        return range(len(string)+1)
Then to take a step in the automaton:

    def step(state, char):
        newstate = [0 for x in state]
        newstate[0] = state[0]+1
        for i in range(len(state)-1):
            if i < len(string) and string[i] == char:
                newstate[i+1] = state[i]
            else:
                newstate[i+1] = 1 + min(newstate[i],state[i],state[i+1])
        return newstate
We step like this:

    s0 = init()
    s1 = step(s0, 'c')
    s2 = step(s1, 'a')
    s3 = step(s2, 'b')
    s4 = step(s3, 'a')
    s5 = step(s4, 'n')
    s6 = step(s5, 'a')
Now we can compute the lowerbound of the Levenshtein distance by doing min(s6). In this case it's 2. This means that whatever comes after "cabana", it will always have at least distance 2 to "banana". With this info we can prune away a search path in the full text index if that value is larger than our maximum edit distance.

Those handful of lines of code is all you need to do fuzzy string search in practice. This represents the automaton as a step procedure. If you want you can also generate a DFA from this (though it's probably not necessary in practice). If your maximum edit distance is n then if one of the numbers in the state is greater than n it doesn't matter what it is. In the above example s6 = [6, 5, 4, 3, 2, 3, 2]. If n = 3 then s6 = [4, 4, 4, 3, 2, 3, 2] is equivalent, because in the end it only matters whether a number is >3 or not. So you might as well keep the numbers on 4. Replace:

    newstate[i+1] = 1 + min(newstate[i],state[i],state[i+1])
with:

    newstate[i+1] = 1 + min(newstate[i],state[i],state[i+1],n+1)
where n is the maximum edit distance. Now the state space of the automaton is finite, and you can generate a DFA from it by just exploring all the states with the step() function. One more optimization is to not generate the DFA for the full alphabet. If your search word is "banana" then for the purposes of the automaton the letter 'x' is equivalent to the letter 'y' because both are not equal to any letter in "banana". So instead of creating a DFA for the full ASCII alphabet (or worse, the full unicode alphabet), you can instead work with the reduced 4 letter alphabet (b,a,n,X). X represents any letter other than b,a,n.

You could also do a hybrid where you generate the DFA lazily.

I don't know if that made sense, it's a bit difficult to explain in a short HN comment.


I'm not sure if I can follow - I'll give it some more thought and time. That said: You're doing something completely different as far I can tell. You build an ~automaton~ based on an input word. That's not what the paper does/what I struggled with. The paper describes a general automaton and creating a 'vector' based on the input word, that you use as steps.

At the moment I don't see how you could handle transpositions either.

I'm not saying that your approach is bad. But I do think that the 'I did it in an hour' comment was a quite a bit misleading, if you basically ignored the paper and did something that is different in most ways.

The tradeoffs are immensely different - the whole point of the paper is that you're precomputing a looot of stuff so that the lookup is fast.


Computing the step() is extremely fast. And if that's not fast enough then do the DFA construction as I described. The approach is different, yes, but that's the point.


I commented on your step function in another comment so I'm going to skip that.

Your DFA construction, while a bit incomplete (you don't say how you do the transitions), achieves roughly the same thing as Levenshtein automata do. But you spend significantly more time to construct it. The point of the original paper was not to show that DFAs can be used to compute Levenshtein distance, but to show how to do it quickly and efficiently.


I replied to this in the other thread to avoid splitting the conversation in two.


You implemented an automaton that computes Levenshtein distances. However, Levenshtein automata are quite different from what you describe. Your automaton executes the basic Wagner-Fischer / Needleman-Wunsch / ... algorithm.

Btw, see also https://news.ycombinator.com/item?id=9698785 for another discussion on basically the same problem.


This is not correct. The end result of the step() based automaton is the same: it prunes exactly the same search paths as any Levenshtein automaton would. And the part where I described how to build the DFA gives you exactly a Levenshtein automaton DFA. The approach is different, yes, that's the point: it's much simpler and still does the job.


It's not nearly as efficient though, your step function requires O(len(string)) time no matter how well you prune. Since you have len(query) many steps, that gets you to O(len(string)*len(query)), aka quadratic time if they're roughly the same length. Levenshtein automata can do this in linear time because they spend time building the automaton first (preprocessing). So yes, you implemented an algorithm using automata that computes the same result. But you didn't implement Levenshtein automata.


In practice with Lucene, len(string) and len(query) are like 10. So it's totally irrelevant. Furthermore computing the step is extremely fast: you're just doing a handful of min()'s. Even a single cache miss is going to completely dominate that, let alone a disk seek. What matters is that you don't scan a 10 million word index, instead you want to prune your search to hit, say, 50 words in the index.

That's just the step() approach. After that I described how to build the DFA, which gets you the same optimal linear time that you want which does not depend on the query string size.

Reply to other comment:

> Your DFA construction, while a bit incomplete (you don't say how you do the transitions), achieves roughly the same thing as Levenshtein automata do. But you spend significantly more time to construct it. The point of the original paper was not to show that DFAs can be used to compute Levenshtein distance, but to show how to do it quickly and efficiently.

Why is it incomplete? You just follow the step() automaton and construct a DFA that does the same. Every time you hit a new state you create a new node in the DFA, and if you hit an old state you just point to a previously constructed DFA node. You can even do the DFA construction lazily.

> But you didn't implement Levenshtein automata.

A Levenshtein automaton for a word W is a finite state automaton that accepts all words V within edit distance n from W. That's exactly what I built. The point here is that you turn a 30 second search over the whole index into a 0.02 second search by pruning. If you can then optimize that to 0.015 by making the automaton more efficient that's nice but you can hardly claim that what I did is not a Levenshtein automaton because it's a bit slower (and it's not even clear to me that it would be slower).


You can argue about the lengths of these strings and the simplicity of your step function all day long, at the end of the day it's still θ(n²) and no better than the simple dynamic programming approach. In fact, even for that it's a relatively bad implementation because it constructs a new list in every step, whereas you could just reuse the same two lists (called score in your implementation) all the time.

An L1 cache miss, L2 hit costs around 10 cycles, and the L2 cache is more than sufficiently large. Even your initialisation loop takes more than that. The min loop has 11 instructions (branchless) per iteration for the minimum calculations in C++: https://goo.gl/wjhRtb - not taking into account superscalarity.

You have not shown how you prune the search, so I can't say anything about that. Of course that's the entire point of having an index.

Your DFA construction again is massively slower than what is done in the paper. The authors show how to construct the entire automaton in time O(|string|), whereas each step in your implementation takes that much time.

Whether or not you built Levenshtein automata is a pointless discussion. You say you built a DFA for Levenshtein distance (true). I'm saying that you didn't implement the paper. Both are correct.

Look, I'm not claiming you did anything wrong. I'm just pointing out that your implementation, while it was fast to write, is also much much slower than their algorithm, and you shouldn't compare the two as if they were the same.


> You can argue about the lengths of these strings and the simplicity of your step function all day long, at the end of the day it's still θ(n²) and no better than the simple dynamic programming approach.

It's not O(n^2), but I forgot to mention that. You don't need to calculate the full state, only k entries when your maximum edit distance is k, because the other entries are going to be greater than k anyway. So it's actually O(nk), so it's linear in n (that paper assumes that the max edit distance and the alphabet size are fixed, so by their assumption O(nk) = O(n)). But I think you are missing what a Levenshtein automaton is. The point of a Levenshtein automaton is not to calculate the edit distance. In fact a Levenshtein automaton does not need to calculate the edit distance at all. What a Levenshtein automaton with max distance k to a word "banana" does is this: it determines whether given a prefix like "caba", is there a way to extend that prefix such that the distance to "banana" is less than k. The dynamic programming algorithm does not do this. They are two different (but of course related) problems.

> In fact, even for that it's a relatively bad implementation because it constructs a new list in every step, whereas you could just reuse the same two lists (called score in your implementation) all the time.

Of course, it's a proof of concept not a fast implementation. Plus it's not as simple as keeping two lists when you're searching a text index, because from the same state you will need to calculate multiple successor states. If you destroy the state s when stepping you can't step from that same state s with another letter.

> An L1 cache miss, L2 hit costs around 10 cycles, and the L2 cache is more than sufficiently large

A normal full text index does not even fit in L3. Perhaps it does not even fit in memory.

> You have not shown how you prune the search, so I can't say anything about that. Of course that's the entire point of having an index.

Pruning the search works exactly the same way given any implementation of a Levenshtein automaton. It's just intersecting the automaton with the index.

> Whether or not you built Levenshtein automata is a pointless discussion. You say you built a DFA for Levenshtein distance (true). I'm saying that you didn't implement the paper. Both are correct.

I never claimed to implement the algorithm in the paper. Whether or not it's a Levenshtein automaton is not a pointless discussion. Lucene improved their fuzzy text search times by 100x by pruning the search. You can do that with my method too. That's why the goal here is "implement a Levenshtein automaton" not "implement a Levenshtein automaton with the algorithm described in paper XYZ".

> I'm just pointing out that your implementation, while it was fast to write, is also much much slower than their algorithm, and you shouldn't compare the two as if they were the same.

Even if my algorithm for constructing the DFA is 10x slower than theirs (which I HIGHLY doubt), it still wouldn't matter because you can still use it to get their ~100x speedup because it does exactly the same pruning and the cost of the automaton is just noise.


Yup that's the blog post I was referring to, I thought it had been on some official Lucene blog. Thanks for linking it.


Perl is dynamic enough to support this.

See http://search.cpan.org/~ether/Moose-2.1403/lib/Moose/Manual/... and http://search.cpan.org/~ether/Moose-2.1403/lib/Moose/Manual....

Competing type systems, competing Object systems, and enough internal run-time rewriting to do whatever it is you want.

See also http://search.cpan.org/~smueller/Filter-Simple-0.91/lib/Filt... for a simple rewriting filter, and the Acme namespace for all kinds of fancy experiments.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: