Hacker News new | past | comments | ask | show | jobs | submit login
When Data Do Not Conform to Rows and Columns: Diagnosing Disease by T-Cells (github.com/jostmey)
79 points by jostmey on Feb 25, 2020 | hide | past | favorite | 18 comments



Datasets of T-cell receptors cannot be arranged as numbers into rows and columns, representing what we call non-conforming features. By forcing ourselves to develop methods to cope with non-conforming features, we have developed a new approach we call Dynamic Kernel Matching for classifying complex, non-conforming data. We think this approach may be useful for other problems.


In your README, you state:

> we implement the Needleman-Wunsch algorithm in TensorFlow.

What sort of performance do you get with your implementation on TensorFlow vs other implementations of Needleman-Wunsch?

What scores did you use for gap opens and extensions?


> What sort of performance do you get with TensorFlow vs other implementations of Needleman-Wunsch?

TensorFlow runs > 10,000 instances of the Needleman-Wunsch algorithm in parallel. But I believe each instance of algorithm runs in serial. This is very efficient if all the sequences are roughly the same length, allowing for millions of sequences to be processed each second.

Here's my implementation of the Needleman-Wunsch algorithm: https://github.com/jostmey/dkm/blob/master/antigen-classific...

> What scores did you use for gap opens and extensions?

Another great question! There are two gap penalties. One for skipping weights and one for skipping features. If I know there are always more features than weights, then I set the penalty for skipping a feature to zero. This does not penalize the model for having more features than weights. I then set the penalty for skipping weights to a value approximating negative infinity. This ensures every weight is used and is not wasted.


Why can they not be arranged as numbers into rows and columns, and why does not being able to represent them as numbers matter? Stats and ML people have been dealing with non-numeric data for a long time, after all.


Many approaches for dealing with symbols reduce the information to numbers. Consider regression of categorial variables. The categorical variables are usually converted into a vector of numbers using a one-hot encoding. Now, if you are classifying a sequence, the number of categorical variables will vary depending on the sequence length. Some sequences will be longer than others, resulting in an irregular number of features. Consequently, you will not have a fixed number of columns for our regression model.

Statistics and ML are vast, so there are lots of methods that cannot be pigeonholed like this. If you mention a specific method, we can contrast DKM to that method.


So, I dont know much about these methods, but if you dont have a fixed number of columns, the entity attribute value approach is clunky but can definitely model an N+ set of "columns".


You're definitely right that there are a number of simple-ish ways to represent the data, but in a tradition regression model the 102nd attribute being the letter A could mean something totally different than the 103rd attribute equaling A.

So therefore the authors had a challenge having sequences that are different lengths. In the past, one possible workaround that people have done with some success is taking some fixed length summary (say, the proportion of As, Cs, Gs, and Ts). Then you can control the order and avoid having to consider a possibly intractable number of sub-sequence comparisons, but you also lose information that might be important.


How would this compare to common types of graph neural nets?

Afaict many use random walks (=traces) to learn their model, except they are typically then used for a classifier over individual nodes/edges instead of paths/ graphs. I'm not sure of the natural formulation to reuse them here. Likewise, it is unclear if node reuse on a path here is meant to be meaningful, which would also seem to change natural encodings.


I've only just started looking into classifying graphs. I'm not really qualified to answer this question, but I will try anyway. Let's assume we are talking about un-directed graphs using deep learning jargon.

DKM would not be a random walk through the graph. Features would be assigned to weights using a graph alignment algorithm, which is NP-hard. There's probably multiple ways to define graph alignment, so you have to pick an alignment strategy that makes sense given the problem. For example, we may want to reuse the same weights whenever the graph forks, ensuring each branch is treated the same way. Once we have assigned features to weights, we can classify the graph with only a single neuron. We can also use a deep model, if we choose.

Some graph neural networks reduce the graph in a feed-forward manner. These models, which can only represent directed interactions between nodes, are inappropriate for un-directed graphs.

Some graph neural networks require multiple neurons to represent the graph, in contrast to DKM, which can use only a single neuron.

There are types of graph neural networks that I have yet to understand, so I cannot compare these models to DKM.

Feel free to email (https://news.ycombinator.com/user?id=jostmey) if you want to discuss this more. I would be happy to provide insight if you want to implement a graph-DKM model.


If I understand this, this model (dkm, "dynamic kernel matching") is to sequences as convolutional networks are to images?


Yeah, that's right. Just as a convolutional network picks a patch from an image using a max-pooling operation, dynamic kernel matching picks symbols in a sequence using a max-alignment operation


> Datasets of T-cell receptors cannot be arranged as numbers into rows and columns

Is this true? The title seems to imply that T-cell receptor data can't be represented with sets or manipulated with relational algebra. I'm a little skeptical.


EDIT: I think I see the confusion. This project is about statistical classification, which generally assumes the data is in rows and columns. It is about statistical classification when data is not in rows and columns. Of course the data can be represented as a set of sequences. The challenge is how to build statistical classifiers for that, which is the scope of this project.

> The title seems to imply that T-cell receptor data can't be represented with sets

There are two datasets, one containing labelled sequences and the other containing labelled sets of sequences. The problem is that the number of features is irregular, resulting in some rows with more columns than others. Also, the information in each column does not line up. It's like you have patient age represented by a column, but because the number of features is irregular, suddenly that column switches to the patient's weight.

> manipulated with relational algebra

Can you be more specific? I'm not sure what you mean


> EDIT: I think I see the confusion. This project is about statistical classification, which generally assumes the data is in rows and columns.

If I may come at this from the other direction, do you mean the rows and columns of a matrix? Coming from a relational background, I hear "data", "row" and "column" and I'm thinking of something that is essentially meant to be a set of relations.

> Can you be more specific? I'm not sure what you mean

If your data is structured in a normal form, you can project it back out in a number of different ways. This is not true of most data structures, which entangle the structure of the world with the pattern of data access. Relational algebra allows a database to distinguish between the model of the world and how data may be arranged and accessed.

The flipside of this is that relational algebra is performed over classical sets and, depending on one's views about sets, they can be used to represent anything that can be represented.


/Pedantic

That’s a common issue for many datasets. You can map that just fine in a relational DB either using Nulls on some columns or breaking that column data into a separate table. EX: (CustomerID: 50, ColumnID : 6852, Value : “Mike”)


Hmm... I think I see the confusion with my title, which I had to shorten to meet the space requirements. It should read "Statistical Classification When Data Do Not Conform to Rows and Columns: … ". The problem is that statistical classification generally requires the data to be in tidy rows and columns, which is not the case for the T-cell data.


Except this isn't how statistical methods on vectors and matrices work.

In an ML data set, the value "Mike" may actually be one-hot encoded to one of 500 columns (one column value is 1, everything else is 0) - because you have 500 different names in your dataset, so each VALUE gets its own column.

It's a very different problem/solution than NULLs in databases.


I was referring to jacques_chester’s comment. You’re bringing up a related and still well explored topic as this situation is very common. https://link.springer.com/article/10.1007%2Fs00180-013-0468-...




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: