*Why* can they not be arranged as numbers into rows and columns, and why does no...

jostmey · on Feb 25, 2020

Many approaches for dealing with symbols reduce the information to numbers. Consider regression of categorial variables. The categorical variables are usually converted into a vector of numbers using a one-hot encoding. Now, if you are classifying a sequence, the number of categorical variables will vary depending on the sequence length. Some sequences will be longer than others, resulting in an irregular number of features. Consequently, you will not have a fixed number of columns for our regression model.

Statistics and ML are vast, so there are lots of methods that cannot be pigeonholed like this. If you mention a specific method, we can contrast DKM to that method.

hobs · on Feb 25, 2020

So, I dont know much about these methods, but if you dont have a fixed number of columns, the entity attribute value approach is clunky but can definitely model an N+ set of "columns".

jrumbut · on Feb 25, 2020

You're definitely right that there are a number of simple-ish ways to represent the data, but in a tradition regression model the 102nd attribute being the letter A could mean something totally different than the 103rd attribute equaling A.

So therefore the authors had a challenge having sequences that are different lengths. In the past, one possible workaround that people have done with some success is taking some fixed length summary (say, the proportion of As, Cs, Gs, and Ts). Then you can control the order and avoid having to consider a possibly intractable number of sub-sequence comparisons, but you also lose information that might be important.