Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: APL family instead of R for data analysis?
62 points by wrp on June 25, 2015 | hide | past | favorite | 54 comments
I've read a lot about the APL languages, but never used them because I felt they were too specialized. Recently, I've been roped into some projects using R. I fully understand why R is so popular in statistics departments, but programming in R has been the most migraine inducing experience I've ever had with a language.

Since R is an array language, I though maybe an APL family language would work as a replacement. It wouldn't have the multitude of libraries that R has, but would have a clean design. Note that I am not thinking of this as an alternative for the typical R user, who is a non-programmer and for whom R is their first and possibly last language. I'm thinking of a moderatly experienced programmer who is faced with a data analysis task.

Some APL derivatives, like J and Dyalog APL, have had their capabilities extended well beyond simple array juggling. How are they, compared to R, as data analysis environments?




I'm having a blast playing with the free 32bit version of Q/K/Kdb+ these days in the context of a medium sized website which requires a good deal of data analysis.

As you probably know, Q/K are in the lineage of APL languages, with strong inherent vector-oriented capabilities, but Q uses an SQL-like dialect of words to replace APL's symbols.

I've always found the idea of APL exciting, but it seemed to be a very isolated platform. "How do I use this on my site?"

Q speaks SQL natively (using s.k) so it's easy to work into my MySQL-based flow, though not exactly the same as MySQL's syntax.

It's got great built in facilities for bulk data loading and storing, which saves you a lot of time with the boring "getting my data set up" step when banging out little helper scripts.

I've begun using Q as a cache in place of memcache/redis because I much prefer being able to ask flexible questions with a dynamic query language. Example: A MySQL query on a 1m row table that was often showing up in my slow query log at 1sec+ took less than 50ms in Q.

Obviously Q and MySQL are apples to oranges, but it is still a handy adjunct when you need serious speed without full loss of flexibility.

And writing Q directly is a really interesting exercise. My PHP and Node code has gotten better as a result.


I have just started working in kdb+/q and, it's my first hands on with any APL, it was a bit overwhelming at first to see an alien language like script written q but now, I am quite comfortable to understand what the script is doing in general though frequently I get stuck deep. The q gods' inclination to shorten and obfuscate everything(even the C api) makes it harder to grasp things at first, but later you also find guilty pleasure in writing one liner codes that can't be done without writing multiple loc in sql. It's a very interesting language.


I have wondered about K/Q, because I gather that K is actually a simplification relative to APL, to make it more focused and easier to use. I thought it might be missing useful facilities for data munging. Like, I don't think it has regexes, does it?


It has a limited version of regular expressions available in the "like" and "ss" functions: http://code.kx.com/wiki/Reference/like

There's also a port of re2 available: http://code.kx.com/wiki/Cookbook/regex

The C API seems pretty simple once you get over the insane function (and macro!) names, so I think a pcre port could be done.

In practice though I think Q gods prefer a split-based vector oriented approach when it comes to string chomping. Q's built in verbs and adverbs function much like regex operators do, but at an abstraction level tied fundamentally to your data.


Your experience with Q is really a satisfying read! Should I abandon my tepid attempts to get into J and throw myself on Q, despite its hefty pricetag for 64-bit version? Would you be able to do as much with Kona instead of Q?


Glad to hear my useless observations benefited someone!

Getting into J: I like J a lot. I like how professional the software is, their excellent docs, and their amazing community on IRC. But for some darn reason, the syntax just doesn't make sense to me. I know that sounds silly when we're talking about Q or K which are pretty unreadable, but I kinda-sorta grok the latter a lot more easily (the difficult part is adverbs, etc). I get why they chose the ascii chars they chose; with J, I feel totally lost. "NB."?! Cmonnnn.. :)

Kona is based on an earlier version of K, the language that underlies Q, and though I am sure you could learn a ton from it, I need something that's been vetted a bit more because I intend to put it into production (crazy as that sounds) and Kona is still developing much of their HTTP and database support. I also wanted to learn the latest iteration, Q, which Kona hasn't implemented a dialect of just yet. Still a cool project that I follow actively on Github.

I'm struggling with the 32bit vs 64bit thing now. I know I couldn't ever justify the 64bit version with my pathetic budget. The great thing about the 32bit version is that you can easily spawn multiple copies of the interpreter, and IPC is incredibly fast (1mil sync messages to localhost in 52sec on my SurfacePro3, 607msec async). So theoretically you could easily build "microservices" that each live in 4gb and pass messages back in forth[1].

At least in theory. I'll be documenting my experiments soon if all goes as planned. Already seeing big benefits.

[1] That sounds hacky as hell coming from most platforms, but IPC in K/Q is baked in as a fundamental, powerful primitive, a bit like how Erlang's messages underlie the entire system. For instance, their Tick system (http://code.kx.com/wiki/Startingkdbplus/tick) used for insane volume is an ingenious way to feed out tons of data to tons of clients (firehose) and let them pick their own views of it.


Had a kind of opposite experience with K.

J seems quite reasonable - intuitively - after reading very good "J for C programmers" (http://jsoftware.com/help/jforc/contents.htm) . And symbols in J chosen for most operations also seem good - at least after some small use. "NB." is immediately "nota bene" (I guess knowing some Latin doesn't hurt - it's "note well", but I always understand it as "note that") - don't some other languages use this acronym for comments as well?

What is IMO particularly good in K is the small size of the interpreter. J is "for the whole math", while K is "for practical 99% of cases" - at least it feels like that. Having small interpreter which fits in CPU cache allows potentially to run faster than with assembler, because K code is packed densely - so less time to wait for slow memory.


I'm in.

In a Previous Age, there were “rumors that there would be interest in opening up the language if there would be enough community response around it”—wonder if they amounted to anything [1]. Interest in k is only growing it appears.

[1] http://www.kuro5hin.org/story/2002/11/14/22741/791


The 32 bit version of k4/q is free from Kx (the original K people) is now free from http://kx.com

There's a GPL implementation of K3 called Kona at https://github.com/kevinlawler/kona ; kevin lawler also has a language called kerf (not open source) that is based on the same principles with a different syntax (JSON+SQL+more).

John Earnest has a Javascript implementation of K5 called oK at https://github.com/JohnEarnest/oK

Andrey Zholos has a JITing implementation of a K-like language at https://github.com/zholos/kuc


R is... _interesting_ language. Its main power really comes from the comprehensive libraries that are available. I think there is an interesting option if other languages can figure out how to make the R libraries easily accessible in their environments, which is not that easy, because even the libraries can use the _interesting_ R design features :-).

That said, in F# (which is a functional-first language with some solid programming language design background), one can call R functions in a fairly nice way using a type provider: http://bluemountaincapital.github.io/FSharpRProvider/.

This probably cannot replace R for typical statisticians, but it is a nice option for programmers...


Exactly! There are many good languages for data analysis (Julia is one example) but no other language (not even Python, which is BTW leaps and bounds above all the other alternatives) can match the library support and statistical maturity that R has.

And R is not so bad as a language. It's OO is really bad, but it has functions as first-class members and great support for NAs. Combined with Hadleyverse and pipe operator (magrittr) is makes a very readable code.


I think the value of all the libraries available for R is overrated (even ignoring the buggy ones). Most of the time, you don't need that kind of choice.

I just finished reading _Data Analysis_ by Peter Huber. One chapter is about programming languages for statistics. In the language that he developed (ISP), he included only a small core of statistical facilities because in his experience that's all people really need. David Hand has also written observations along the same line.


It really depends on what you're doing. If you're a developer tapping into data analysis, sure. But I'm building predictive models for a living and I evaluated quite a few different frameworks and languages.

Sophistic algorithms for missing data, sparse solutions, latent modelling. You can find the many industrial grade algorithms in Python, Spark, Julia, Weka etc. but for stats-heavy data science/machine learning R is unmatched. Sans writing your own implementations based on pseudocode, but this is really not effective use of your time for prototyping.

EDIT: And even for simple stuff, the strength of R packages is easy to prove since they are either directly ported to other languages (ggplot) or heavily influence the implementation (pandas).


With Pymc3, you can code a bayesian generalization of most of those packages pretty easily. For bayesian models, it is easier than calling into C++ with stan in R.

For everything else, one can call arbitrary R packages with Rpy2, albeit with clunkier syntax.

For agent based modeling, there is no comparison. Python with its better and faster class system and Numba is in another league.

Finally python has distributed and out of core lists, arrays and dataframes: http://dask.pydata.org/en/latest/


I think it's the 80-20 rule at play. You can solve 80% of a problem with the same 20% of the options. But the last 20% is different and you need a very large class of packages to cover the long-tail of needs.


This is nonsense. Majority of people might not need sem or mlogit, but the majority of the budget in market research most definitely does need them.


I don't think R's OO is great; but it's mostly misunderstood because it's fundamentally different to most popular modern languages. Interesting Julia, uses exactly the same model as R (I.e. Generic functions not message passing)


My main criticism is around two things.

One, "clunkiness" of class creation (which I guess is subjective); I don't think it's very readable, but maybe that's bias from other languages.

Second the S3/S4/R4 mess.


I think it's useful to make a big qualification to the question. Of course, if you need to use a library written in R, then you use R. Same with Java or anything else. So the question is assuming your choice isn't governed by need for a particular library.


I am sure I have seen Python libraries that are wrappers around R. As someone that's never used R, I would try that first.


Are you using Hadley Wickham's libraries in R? I think those go a long way toward fixing the problems you may have encountered -- i.e. API consistency, usability, orthogonality, naming, etc.

I also suggest reading some of his papers like "Tidy Data".

R is weird in that you NEED a large set of CRAN libraries to do work, whereas in Python you can do a fair amount with the standard library, and PyPI is relatively weak in comparison.

I'm also a programmer who learned R for data analysis. It does take quite awhile and some pain. Part of the difficulty is that R is weird and has warts, but an even bigger difficulty is that you are learning new concepts and programming abstractions (fundamental difficulty vs accidental difficulty). R passes the test of a language which changes the way you think.

R has its warts but is likely the most practical choice. It has the full set of things you need -- data preparation and cleaning, exploratory analysis, model building, and visualization.


May I venture to add that R based applications may be more acceptable to clients as they can be sure that there is a large community that can at least understand how the application links into the environment, even if client employees might not fully understand the logic of the algorithm.

Disclaimer: I'm a civilian, not a programmer


I use both. R is slow as a dog, Q/KDB+ is (almost) as quick as C. R is ugly, Q/K is beautiful and elegant. R has a wealth of open-source packages, Q not so much.

Best solution is to do most of your work in Q and call into R when you want to use a package. The R integration works well, see http://code.kx.com/wiki/Cookbook/IntegratingWithR.

The reality is that if you are using (and paying for) Q/K, you are likely doing so because you're dealing with billion+ point datasets. At least. R just melts and falls apart at that kind of size (without using native code, which isn't really R then is it?). So I often end up just translating R functions into K equivalent, and just using R for the nice-to-have features like latex/brew/ggplot/etc.


Good comments. The way I have viewed the role APL languages is that you use a general purpose language to do everything up to the point where you have a clean data array, then send it to APL for array juggling, then take the result back into the GP language for all follow-up tasks. Does this sound like the workflow for Q/K and other APL language users? My reading of IBM's APL2 literature is that they intended this workflow with Tcl as the GP language.

The two big things I'm trying to get here are a comparison of how APL-family compares to R in that role and in reaching beyond that role to handle the stuff before and after.


You could just blow everyone's mind of sneak APL into R:

http://statistics.ucla.edu/system/resources/BAhbBlsHOgZmSSIp...


Interesting. He implemented APL array functions in R. Since there is some mismatch in how APL and R handle things, I wonder if it doesn't make coding even hairier.


Given a choice between 'clean design' (which is certainly not a truth universally acknowledged) and 'getting stuff done with libraries that somebody else did the hard work on' I'll choose r any day of the week.


J has pretty good - but may be small - community on Jsoftware forums. J libraries are small but ever growing - are there particular examples of R libraries which are sorely missing in J?

I should admit I was asking myself the question of this topic.


J is also amazingly active on Rosetta Code. Right now at place 4 with 789 implemented challenges. Only Tcl, Racket, and Python are ahead.


"Almost as popular as tcl and racket" is one hell of a selling point.


Here it means "more popular than Lisp, Java, C++..." - so context matters.


The existence of an index in which j is #4 behind tcl and racket should not lead one to think 'wow, j is really catching on' but rather 'wow, that's an unusual index.'


I think looking at Rosetta Code you shouldn't make conclusions about selling points :) . Rather you can use RC as a repository of J programs, which happen to have many examples - comparing to repositories for other languages on the same RC.

If you're familiar with RC, you know that it encouraged participation with unusual languages, as it's generally hard to find lots of examples of code for them. RC is a code chrestomathy site, but for Java and likes it's easy to find many good examples elsewhere. So rare languages comparatively shine on RC.


You might give Kerf a try: https://github.com/kevinlawler/kerf


I was going to suggest that Kona would be a better fit, but you would probably know which is best.

Thanks also for adding a linux binary today, looking forward to playing with it. Can't say I'm as excited about the 1 month timer though.


If you want an unrestricted version send me an email at k.concerns@gmail.com (this goes for anyone here)


The advantage of R is it's robust community. The "Array processing languages" family [APL, J, K] is a radically smaller community.


It really doesn't ease the pain of dealing with an awkward language. I think the general lack of technical sophistication in the R community just adds to the aggravation. So as far as community experience goes, I might actually rate the APL-family higher.


As a J fanboi, I can't say I don't sympathize, and J has perhaps the best documentation of any language I've come across. Yet, R is probably a better choice for general collaboration simply due to its ubiquity.


Oh yeah, I wouldn't even consider APL for collaborating with non-programmers. I'm thinking of just as a personal tool.

With R, you can do data preparation, but I find it easier to do it with Perl or awk than import the cleaned data to R. Is a standard workflow with J pretty much the same? I thought that with PCRE and other libs available in J, it might be suitable for more than just juggling cleaned data.


There are some other alternatives that I suspect you'll be happier with if your main issue is the R language. Python + pandas or Julia for instance. Smaller communities around those but less aggravating languages for sure.


Python is definitely an alternative that seems to be taking the hard sciences by storm. I just didn't mention it because I don't need any help in evaluating that option.


I did a lot of work with device models and SPC, while working in VLSI Design. We had SAS and APL2 (with a companion stats package called GraphStat). The APL2/GraphStat combo was awesome, and I always found myself going to APL2 when I really needed to do some analysis. But, APL2 withered on the vine long ago.

After some recent forays into Clojure, Haskell, and R, I find myself getting re-interested in languages in the APL family.

I had read about Q, K and Kdb+ a while ago. Guess I'm going to have to kick the tires.


Sadly Q and K are both proprietary.


Really, you're going to call the typical R user a non-programmer? R has become so popular because there are many phenomenal programmers working in R.


I use Julia (http://julialang.org/) and have found that it brings the best of multiple worlds into one neat platform. There are the native Julia functions, but there are interfaces to C, Python, and R. Plus, programming in the language is a pleasure.


You haven't found it too be a little rough around the edges?


For huge datasets, Python has distributed and out of core data structures: http://dask.pydata.org/en/latest/ https://github.com/ContinuumIO/blaze

This is pretty unique, and works better than spark for out of core on one machine...(and easier to set up.)

For stats not in the statsmodels and scikitlearn packages, you can easily whip up the bayesian generalization in pymc3.

Then if there is another R package you need, you can use Rpy2 to call it.

Not sure if this would be relevant to your usecase.


There are probably plenty of good options, but have you considered F#? It's got some good support for this type of work, including an R Type Provider (and I'm not talking about something that swallows quarters!). Here's a useful summary of some of the key features F# has access to for data science work: http://fsharp.org/guides/data-science/index.html


I am slightly familiar with SML, OCaml, and F#. I like F# and might use it sometimes if I were interested in developing on .NET/Mono. However, while F# is good for numerical analysis, it is not so good for the kind of interactive, exploratory work that Matlab and R are designed for.


Just in case if someone is really interested in APL expert (Dyalog APL) and/or remote job please contact me.


How? There is no contact information in your profile.


Hm. I filled in email and duplicated it in description.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: