NumPy Exercises for Data Analysis in Python

amch · on Feb 27, 2018

Does anyone know of any similar resources for Pandas?

I've found the following to be quite helpful but would love to know if anyone knows of other resources in a similar vein: https://pandas.pydata.org/pandas-docs/stable/cookbook.html

00ajcr · on Feb 27, 2018

I started writing a '100-pandas-puzzles' set of exercises here: https://github.com/ajcr/100-pandas-puzzles

There's also pandas_exercises by Guilherme Samora (https://github.com/guipsamora/pandas_exercises) which is very good - it's split across multiple notebooks and is more extensive than my repo.

selva86 · on Feb 27, 2018

Nice stuff!

why5s · on Feb 27, 2018

Not a list of exercises, but Julia Evan's pandas cookbook has been incredibly helpful for me.

jofer · on Feb 27, 2018

This is very similar in spirit to https://github.com/rougier/numpy-100/blob/master/100%20Numpy.... In fact, now that I look at it a bit more, it seems like all of this post's examples are reworded versions of Nicolas Rougier's "numpy 100"...

amelius · on Feb 27, 2018

There's also Rosalind for bioinformatics problems to be solved in Python.

http://rosalind.info/problems/locations/

res0nat0r · on Feb 27, 2018

This looks interesting. Thanks for the link.

selva86 · on Feb 27, 2018

Look a little bit more, please.

jofer · on Feb 27, 2018

You're right. There's a fair number of original examples there as well. Still, there's a lot of overlap.

selva86 · on Feb 27, 2018

Well, that's bound to happen when framing questions on the same topic.

00ajcr · on Feb 27, 2018

There are some nice exercises here, good work.

For question 48 it might be simpler to just write

  np.sort(a)[-5:]

instead of using argsort() and then using fancy indexing. Better yet, use

  np.partition(a, kth=-5)[-5:]

which scales linearly with the size of the array.

Also, the one-hot encoding puzzle (51) would be more efficiently solved using

  (arr[:, None] == np.unique(arr)).view(np.int8)

In general, `for` loops over NumPy arrays should be avoided where at all possible.

selva86 · on Feb 27, 2018

Thanks for the suggestion, I will factor those in.

glup · on Feb 27, 2018

One thing that is holding me back in numpy is not knowing the runtime complexity of operations—of course I can profile code, but I should have better awareness when writing code in the first place. Without an algorithms background, I don't have strong intuitions on the runtime complexity of the primitives (np.unique). Any suggestions?

stabbles · on Feb 27, 2018

Switch to Julia! Hit @edit unique([1,2,3,2]) in the REPL and you see the implementation.

dman · on Feb 27, 2018

Nice! Ive been meaning to try out Julia for a while now. Is the numpy equivalent in Julia largely written in Julia itself? (as opposed to C/Fortran)

mindB · on Feb 28, 2018

Julia's numpy equivalent is basically the standard Array type from the standard library, which I'm 99% sure is native Julia.

thebooktocome · on Feb 28, 2018

If one is working on small (<= 15 by 15) matrices, the StaticArrays module [1] is also native Julia and is much faster than Base.Array. Since a StaticArray knows its own size after type inference, they are allocated on the stack, which is nice.

One downside is that unless you're doing BLAS-style operations, writing non-trivial transformations of StaticArrays always seems to require generated functions.

Anyway, I think this is a feature that numpy doesn't provide.

[1] https://github.com/JuliaArrays/StaticArrays.jl

meken · on Feb 28, 2018

You can do the same thing in IPython/jupyter with ?? e.g.

np.unique??

Sean1708 · on Feb 28, 2018

That only works for functions written in pure Python though, right? Although having said that I'm not sure how many of the functions that you'd actually want to look at the source for are written in C/Cython/Fortran.

enedil · on Feb 27, 2018

What other library tells you about complexity? And as you tell, if you don't know algorithns well, I'm pretty sure your implementations won't have better complexity.

hermitdev · on Feb 28, 2018

C++'s standard library containers & algorithms have strict algorithmic complexity requirements & guarantees.

For example from std::vector::insert [1]:

  Complexity
  1-2) Constant plus linear in the distance between pos and end of the container.
  3) Linear in count plus linear in the distance between pos and end of the container.
  4) Linear in std::distance(first, last) plus linear in the distance between pos and end of the container.
  5) Linear in ilist.size() plus linear in the distance between pos and end of the container.

[1][http://en.cppreference.com/w/cpp/container/vector/insert]

edit: formatting

cjbillington · on Feb 28, 2018

Well, often there are multiple ways of using numpy operations to do what you want, so it's good to have an idea of what numpy is doing under the hood so you can use the right functionality for the job at hand.

For example, np.einsum for all its greatness in the past wasn't faster than np.tensordot, but it was more flexible. One can tell einsum to try and use the same underlying BLAS functions that tensordot uses (which can parallelise the computation) if applicable, and it will likely be default for einsum to perform this optimisation automatically once the devs iron out some bugs. But for now, it pays to know how the two methods are different.

bllguo · on Feb 27, 2018

what an odd hangup. how does this have anything to do with numpy specifically?

the suggestion that jumps out is to just learn about algorithms.

ngould · on Feb 28, 2018

Not odd at all, actually. Numpy might implement certain functions differently from other libraries. With a background in algorithms, you could make an educated guess as to complexity, but without knowing the exact implementation it's still a guess.

dotancohen · on Feb 27, 2018

After a quick peruse, about half the exercises included new material for me. Anybody learning NumPy would do good to review this. Bookmarked!

loganzk · on Feb 27, 2018

Working through them and noticed a few small things.

For #3, you can make a boolean array with np.ones/np.zeros with the same dtype arg, saves a little bit of space.

ie np.ones((3,3), dtype=bool)

For #14, you can make use of the same compound boolean statements as you can in pandas to make it a bit simpler.

ie a[(a > 5) & (a < 10)]

For #15, this is a built in numpy function.

np.maximum(a,b).

That's as far as I've made it, but I'm really enjoying them.

selva86 · on Feb 28, 2018

Thanks for the No.14 man!

However, for No. 15, that is not the point of the exercise.

shrx · on Feb 28, 2018

> 13. How to get the positions where elements of two arrays match?

  > Desired Output:
  > #> (array([1, 3, 5, 7]),)

Why is (array([1, 3, 5, 7]),) the desired output, and not array([1, 3, 5, 7]) ?

longqzh · on Feb 28, 2018

For #15, if the number of elements is large, the speed will be slower than we expected, since maxx function is writren in pure python. But in my experience, it is much faster than for loop in pure python.

blt · on Feb 28, 2018

oh wow I wish I knew about r_ and c_ a few months ago! I'm still annoyed with numpy for being more clunky than Matlab for linear algebra, but resources like this are good for verifying that I'm doing stuff in a numpy-ic way. Thanks!

(Also numpy has some really nice features over Matlab, like [None,:] broadcasting and being able to index a parenthesized expression or function output without naming it. Ok, the latter is not really a feature, more of an example of how Matlab is broken as a language)

cosmosa · on Feb 27, 2018

Looks good, I will definitely use this as a reference!

boruto · on Feb 28, 2018

Are there similar exercises for Java-8 streams?

budadre75 · on Feb 27, 2018

anyone knows anything similar for matplotlib?