Hacker News new | past | comments | ask | show | jobs | submit login
NumPy Exercises for Data Analysis in Python (machinelearningplus.com)
466 points by selva86 on Feb 27, 2018 | hide | past | favorite | 33 comments



Does anyone know of any similar resources for Pandas?

I've found the following to be quite helpful but would love to know if anyone knows of other resources in a similar vein: https://pandas.pydata.org/pandas-docs/stable/cookbook.html


I started writing a '100-pandas-puzzles' set of exercises here: https://github.com/ajcr/100-pandas-puzzles

There's also pandas_exercises by Guilherme Samora (https://github.com/guipsamora/pandas_exercises) which is very good - it's split across multiple notebooks and is more extensive than my repo.


Nice stuff!


Not a list of exercises, but Julia Evan's pandas cookbook has been incredibly helpful for me.


This is very similar in spirit to https://github.com/rougier/numpy-100/blob/master/100%20Numpy.... In fact, now that I look at it a bit more, it seems like all of this post's examples are reworded versions of Nicolas Rougier's "numpy 100"...


There's also Rosalind for bioinformatics problems to be solved in Python.

http://rosalind.info/problems/locations/


This looks interesting. Thanks for the link.


Look a little bit more, please.


You're right. There's a fair number of original examples there as well. Still, there's a lot of overlap.


Well, that's bound to happen when framing questions on the same topic.


There are some nice exercises here, good work.

For question 48 it might be simpler to just write

  np.sort(a)[-5:]
instead of using argsort() and then using fancy indexing. Better yet, use

  np.partition(a, kth=-5)[-5:]
which scales linearly with the size of the array.

Also, the one-hot encoding puzzle (51) would be more efficiently solved using

  (arr[:, None] == np.unique(arr)).view(np.int8)
In general, `for` loops over NumPy arrays should be avoided where at all possible.


Thanks for the suggestion, I will factor those in.


One thing that is holding me back in numpy is not knowing the runtime complexity of operations—of course I can profile code, but I should have better awareness when writing code in the first place. Without an algorithms background, I don't have strong intuitions on the runtime complexity of the primitives (np.unique). Any suggestions?


Switch to Julia! Hit @edit unique([1,2,3,2]) in the REPL and you see the implementation.


Nice! Ive been meaning to try out Julia for a while now. Is the numpy equivalent in Julia largely written in Julia itself? (as opposed to C/Fortran)


Julia's numpy equivalent is basically the standard Array type from the standard library, which I'm 99% sure is native Julia.


If one is working on small (<= 15 by 15) matrices, the StaticArrays module [1] is also native Julia and is much faster than Base.Array. Since a StaticArray knows its own size after type inference, they are allocated on the stack, which is nice.

One downside is that unless you're doing BLAS-style operations, writing non-trivial transformations of StaticArrays always seems to require generated functions.

Anyway, I think this is a feature that numpy doesn't provide.

[1] https://github.com/JuliaArrays/StaticArrays.jl


You can do the same thing in IPython/jupyter with ?? e.g.

np.unique??


That only works for functions written in pure Python though, right? Although having said that I'm not sure how many of the functions that you'd actually want to look at the source for are written in C/Cython/Fortran.


What other library tells you about complexity? And as you tell, if you don't know algorithns well, I'm pretty sure your implementations won't have better complexity.


C++'s standard library containers & algorithms have strict algorithmic complexity requirements & guarantees.

For example from std::vector::insert [1]:

  Complexity
  1-2) Constant plus linear in the distance between pos and end of the container.
  3) Linear in count plus linear in the distance between pos and end of the container.
  4) Linear in std::distance(first, last) plus linear in the distance between pos and end of the container.
  5) Linear in ilist.size() plus linear in the distance between pos and end of the container.
[1][http://en.cppreference.com/w/cpp/container/vector/insert]

edit: formatting


Well, often there are multiple ways of using numpy operations to do what you want, so it's good to have an idea of what numpy is doing under the hood so you can use the right functionality for the job at hand.

For example, np.einsum for all its greatness in the past wasn't faster than np.tensordot, but it was more flexible. One can tell einsum to try and use the same underlying BLAS functions that tensordot uses (which can parallelise the computation) if applicable, and it will likely be default for einsum to perform this optimisation automatically once the devs iron out some bugs. But for now, it pays to know how the two methods are different.


what an odd hangup. how does this have anything to do with numpy specifically?

the suggestion that jumps out is to just learn about algorithms.


Not odd at all, actually. Numpy might implement certain functions differently from other libraries. With a background in algorithms, you could make an educated guess as to complexity, but without knowing the exact implementation it's still a guess.


After a quick peruse, about half the exercises included new material for me. Anybody learning NumPy would do good to review this. Bookmarked!


Working through them and noticed a few small things.

For #3, you can make a boolean array with np.ones/np.zeros with the same dtype arg, saves a little bit of space.

ie np.ones((3,3), dtype=bool)

For #14, you can make use of the same compound boolean statements as you can in pandas to make it a bit simpler.

ie a[(a > 5) & (a < 10)]

For #15, this is a built in numpy function.

np.maximum(a,b).

That's as far as I've made it, but I'm really enjoying them.


Thanks for the No.14 man!

However, for No. 15, that is not the point of the exercise.


> 13. How to get the positions where elements of two arrays match?

  > Desired Output:
  > #> (array([1, 3, 5, 7]),)
Why is (array([1, 3, 5, 7]),) the desired output, and not array([1, 3, 5, 7]) ?


For #15, if the number of elements is large, the speed will be slower than we expected, since maxx function is writren in pure python. But in my experience, it is much faster than for loop in pure python.


oh wow I wish I knew about r_ and c_ a few months ago! I'm still annoyed with numpy for being more clunky than Matlab for linear algebra, but resources like this are good for verifying that I'm doing stuff in a numpy-ic way. Thanks!

(Also numpy has some really nice features over Matlab, like [None,:] broadcasting and being able to index a parenthesized expression or function output without naming it. Ok, the latter is not really a feature, more of an example of how Matlab is broken as a language)


Looks good, I will definitely use this as a reference!


Are there similar exercises for Java-8 streams?


anyone knows anything similar for matplotlib?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: