More

tlmr · on Dec 12, 2014

What about for just a smallish single machine corpus of document (1000)?

nl · on Dec 12, 2014

The size of the corpus isn't the issue (apart from processing time of course).

The key issue in estimating how big a job it is is how complex your entity extraction and inference rulesets are.

tlmr · on Nov 30, 2014

Check out continuum's blaze project for python. Provides table and array abstractions for a variety of backends, including SQL solutions.

tlmr · on Nov 21, 2014

Of modules or entire programs?

one-more-minute · on Nov 21, 2014

I don't know exactly the current status or roadmap (if there is one), but the idea is to have both eventually. Packages and files will be cached automatically so that they load much more quickly, and it will be possible (eventually) to output standalone executables.

tlmr · on Nov 21, 2014

How would you compare this to using python with an easy JIT like numba?

curiouslearn · on Nov 21, 2014

Whenever I have tried to use Numba, I find it is slower than standard python. OTOH, Julia just works. Gives me the same performance as C.

Fede_V · on Nov 21, 2014

I actually both in terms of speed, and in terms of writing code they are remarkably similar approaches.

numba forces you into certain ways of writing code to achieve significant speed ups, but usually that's not too restrictive.

If you are used to writing numerical python code, you'll feel write at home in Julia - especially with the amazing PyCall module.

msravi · on Nov 21, 2014

I haven't used numba or benchmarked it, so I'm not really qualified to comment. I do however vastly prefer the MATLAB-like syntax of julia to that of python.

tlmr · on Nov 21, 2014

Is this going to hamper it from being the next python for both general and data science programming? Do you think I should invest in learning julia, or continue with python?

johnmyleswhite · on Nov 21, 2014

I often say this, but I think if you have doubts about whether you should learn Julia or Python, then you should certainly learn Python.

tlmr · on Nov 21, 2014

What do you think of the CMU course vs this text book in question?

capnrefsmmat · on Nov 21, 2014

The CMU course is very traditional. It covers basic exploratory data analysis (summary statistics, plotting data), basic probability, and hypothesis testing and estimation. There's no programming, nothing Bayesian, and only brief discussion of regression.

(I taught 36-201, the intro stats course that was used to build the OLI course, this summer.)

Statistical Inference, on the other hand, seems to take a Bayesian perspective and is very much not your standard intro stats class. It looks interesting and I'll have to skim through some of it.

tlmr · on Nov 10, 2014

What about Julia?

FraaJad · on Nov 11, 2014

Yeah, what about Julia?

which makes as much as sense as asking

- What about R? - What about GNU DAP? - What about FORTRAN?

for writing a TCP server.

tlmr · on Nov 9, 2014

How does the upcoming and even current numba+blaze power duo not allow library designers to easily write fast code?

vegabook · on Nov 9, 2014

Well that was the point of my original message. I am using numbapro and easily getting to c-speed with R-like vectorized convenience right now. It's why I question the "speed" argument as a non-argument when compared with Python. And I haven't even started using the cuda approach... You allude to another point though: Python is not standing still. Python and its environment is a mighty high mountain for Julia to climb if it's not going to move the game forward significantly so that its big ecosystem disadvantage is compensated. Julia cannot just do incremental improvement - it doesn't have enough momentum to make that a winning strategy. It needs to leapfrog to take on Python.

I should add one more point though. The post mentions Matlab 15 times and Python only 7. It's possible this whole Julia effort will be successful with the Matlab crowd which, up to now, has been watching with horrified fascination from the sidelines as open source ate its lunch.

tlmr · on Nov 10, 2014

Nice. Do you find yourself having to contort code to work with numbapro?

tlmr · on Nov 9, 2014

You mean like blaze in python?: http://blaze.pydata.org/docs/v_0_6_5/index.html

infinite8s · on Nov 9, 2014

Blaze has a bit of a broader focus than what I was talking about, since blaze mostly offloads the actual computation to a particular backend. But a combination of blaze and a custom lowering of the computation into machine code using numba would be similar (although without the type safety for guaranteeing that certain optimizations are possible)

tlmr · on Nov 10, 2014

I believe that the numba compilation is planned for the coming year.

tlmr · on Oct 20, 2014

I don't think so. Have you checked out seaborn, blaze and bokeh?

tdaltonc · on Oct 20, 2014

Not OP, but I've also been look for a way to leave ggplot2 behind and make the jump to 100% python for data analysis. These look neat.

Seaborn - http://web.stanford.edu/~mwaskom/software/seaborn/

Blaze - http://blaze.pydata.org/docs/v_0_6_5/index.html

Bokeh - http://bokeh.pydata.org/

hadley · on Oct 20, 2014

There's also the confusingly named ggplot (for python): https://github.com/yhat/ggplot

tdaltonc · on Oct 20, 2014

I've used that library, and I really don't like it at all. They tried to bring the R syntax to python, which ends up looking awful and is missing the point of the Graphical Grammar. In the same way that every language has it's own way of expressing control flow, every language should have it's own way to express the Graphical Grammar. We don't need R's GGplot2 in python, we need a pythonic way to express the Graphical Grammar.

If I had stronger python-fu I would love to build "GGPy".

tlmr · on Oct 20, 2014

Isn't GGPy seaborn?

tdaltonc · on Oct 20, 2014

Maybe. The plots look a lot like GGplot2 plots, any the syntax looks like python, but I haven't dug in to it to see if it builds plots using the Graphical Grammar under the hood.

elliott34 · on Oct 20, 2014

we've shifted to plot.ly for visualization and moved away from expensive BI tools. We set up python/pandas scripts on CRON to output realtime data to our local plot.ly web server, which makes a local copy of the data and updates the chart. You simply embed the chart as an iframe where ever want internally, and BOOM, you've got a real time chart (beautiful I may add).

www.plot.ly

jackgolding · on Oct 21, 2014

$60 per user per month is a bit steep...

elliott34 · on Oct 21, 2014

Considering enterprise BI software/platforms these days can costs upwards of 4k per month, this is roundoff error.

jackgolding · on Oct 22, 2014

I agree, just doesn't fit the projects I wish to use it for.