Scipy Lecture Notes

Im my journey through data analytics, what helped me most is to fight with real datasets. Lectures are fine, but you don't really grasp the little details needed to do a proper job until you have messy datasets, very large datasets, have to deal with text in a non-english language, etc.

That's the most useful stuff in my opinion. Courses and lectures include sample data that don't really put you in the position to having no option than optimize your workflow because your box can't deal with it in a reasonable time.

Or when you go crazy because you can't perform some analysis because something somewhere is wrong and your debugger can't help you, and you just want to punch someone in the face.

That's how I discovered that cleaning and preparing data is about 90% of the job, avoid CSV for non-numeric data and use SQLite instead, when possible, the god-send of Knime, etc.

giu · on Jan 8, 2021

By real datasets you mean company-specific ones? Or do you happen to have some examples that are openly available which helped you a lot?

I definitely concur with your first point, since I made the same experience, specifically when working with company-specific datasets.

From my experience one also underestimates how much time cleaning up the data takes; there are quite a few steps you need to go through before you can really start to analyze a dataset.

spaniard89277 · on Jan 8, 2021

I happen to scrape a lot of large websites (mostly forums currently) and that's messy enough to force you into learning tricks.

I didn't stumble upon into any (tabular, at least) dataset that wasn't very curated.

Keep in mind that I studied sociology, so stuff that is a given for most HN people isn't for me. I had to learn a lot of CSS (for selectors), regex (still hate it), what's OLAP and how to take advantage of it (DuckDB) and a lot of stuff I'm not even aware now.

But I remember taking courses in my Uni, and later on, with R and Python. It was interesting, but no matter how deep into the rabbit hole of weird models I learnt, it felt... IDK, shallow?

Imagine yourself pulling data out of a company ERP, with human filled data. It won't be a walk in the park, just make some logit models and call it a day. You'll spend a lot of time trying to understand what's going on. And then you perform the models or make a dashboard.

giu · on Jan 8, 2021

Thanks a lot for your reply!

Scraping websites can be quite the messy business, since some websites change their document structure more often than others.

Nonetheless, it's still a very instructive activity and you can build quite the pipeline around it (scraping multiple websites, joining datasets, efficiently storing the data, etc.).

spaniard89277 · on Jan 8, 2021

Yeah, when data piled up I had to think about how to store it, RAM, and a bunch of other things that I didn't have to consider with sample data. Specifically RAM and how to transform data without so much need of it was a concern for some time.

rohan_shah · on Jan 8, 2021

I am also currently learning to scrape forums. And I am a philosophy student. Could you point to some resources that helped you learn it better?

jmt_ · on Jan 8, 2021

Learning CSS selectors and HTML structure, inspect element and the other dev tools builtin to your browser, and something like BeautifulSoup (for static/non-JS heavy pages) and Selenium (JS and other complicated pages) is pretty key imo. My background in web dev helped me with the HTML stuff. Basically, you fire up the page in a browser, inspect element to see how you can use CSS selectors to uniquely identify that data, then using BeautifulSoup or Selenium to parse and interact with the DOM will cover most web scraping cases.

spaniard89277 · on Jan 8, 2021

Are you looking for something specific? Most tools have documentation you can bang your head against.

pmart123 · on Jan 8, 2021

This is true, but this helps build data engineering/cleaning skills, which is a different but complementary skill to modeling.

adenozine · on Jan 8, 2021

What an incredible resource!

It's always great to see well-crafted python resources. It's so easy to get started in python and you can get pretty far without knowing the best ways to do things, so I'm glad there's things like this for newbies.

Maybe in the future, the statistics portion could be expanded. While I'm grateful for all this information, it is rather odd to leave out Bayesian stuff.

As an aside, HN comments with nothing to say except CSS comments is so shameful. Imagine collecting all this information and giving away this catalogue for free and having someone nitpick some silly sidebar zoom functionality. It's honestly despicable how often it happens. I hope the author knows how much this resource helps people out.

beojan · on Jan 8, 2021

> As an aside, HN comments with nothing to say except CSS comments is so shameful.

I see two top-level comments of this sort, and they're both at "I can't read it" severity.

pw6hv · on Jan 8, 2021

On Firefox, the left table of content pane overlaps with the text thus I cannot see the leftmost part of the paragraphs...

lhomdee · on Jan 8, 2021

Same issue with Safari.

Ignoring that small issue, this is a well-crafted and mature resource, one I wish I had access to 5 years ago! Good job to the authors.

dguest · on Jan 8, 2021

If you make the window narrower the content pane disappears, so it's possible to work around this.

pw6hv · on Jan 8, 2021

I think it's fixed now, it was not working correctly before but now the site looks great!

complex_pi · on Jan 8, 2021

Can you report on your screen resolution?

sireat · on Jan 8, 2021

This is an awesome resource but the general Python section could use some work.

I am assuming that target audience are scientists with a modicum of programming knowledge.

The list and especially dictionary section is a bit bare.

In the optimization section have a discussion on when to use lists, dictionaries, tuples and sets. (for example the difference between "needle" in my_list vs "needle" in my_set)

When to use something from collections and when to use ndarray. (the short answer being - it depends)

zappo2938 · on Jan 8, 2021

So .... where do I learn statistics in the first place? Let me rephrase the question. What is the most efficient way to learn the minimum viable amount of statistics?

st1x7 · on Jan 8, 2021

> What is the most efficient way to learn the minimum viable amount of statistics?

You need to add 3 constraints to the question

1. What is your starting point and current knowledge of mathematics and statistics?

2. Minimum viable for what? What do you need the statistics knowledge for?

3. How much effort can you afford to put into this over what period of time?

Then the answer ranges from "here are a couple of good youtube videos" to "here is how to design your own degree in statistics using freely available material".

yellowstuff · on Jan 8, 2021

Think Stats is a very good book aimed at Python programmers who want a broad overview of practical statistical techniques: https://greenteapress.com/thinkstats/html/index.html

zappo2938 · on Jan 8, 2021

Thank you for the recommendation. I build admin dashboards using stock (double entendre?) charting libraries and recently have been using my own d3.js visualizations with dynamic content. At this point, I might as well start to delve into data science and have been investing time developing math skills with calculus and linear algebra. I would like to also take some time and learn basic statistics concepts. I want a level up a little bit but don't see the point getting a Ph.D. in machine learning. I only need the basics to start from.

joshvm · on Jan 8, 2021

You probably want the second edition: https://greenteapress.com/wp/think-stats-2e/

The first edition PDF 404s for me.

cinntaile · on Jan 8, 2021

An introductory statistics course. I'm sure there are a few of those available online, both as a paid course and as free online university lectures.

enriquto · on Jan 8, 2021

The section "how does python compare to other solutions" is a bit lackluster, and heavily biased at the same time. It would be more useful if this section was written by proponents of each of the other "solutions".

reallydontask · on Jan 8, 2021

I agree with you 100%, maybe offer to write it up for them.

It's hard to write from a different point of view from that which you hold, or at least that's what I find

rajesht · on Jan 8, 2021

If you read it spicy, you are not alone. Human brain optimizes by reading first and last letters to the wodrs

complex_pi · on Jan 8, 2021

Co-editor of the lecture notes here, if someone has a question.

Bostonian · on Jan 8, 2021

Does anyone have a book they would recommend over this resource for learning Scipy? Or is this the best place to start?

mattip · on Jan 8, 2021

This is the best place to become familiar with the tools and to set the stage for your journey. Then find a problem you want to solve, and find more domain specific resources. Some people learn best from tutorials, some from video, some from courses, some from just banging their heads against a wall till they figure it out.

ABeeSea · on Jan 8, 2021

The sidebar covers the content when zooming. Terrible design for accessibility.

rajamaka · on Jan 8, 2021

Same here, but without zooming.

SiempreViernes · on Jan 8, 2021

Works fine for me

freakynit · on Jan 8, 2021

Wow!!!! This is gold. Super useful. Thank you for this :)

maztaim · on Jan 8, 2021

I skip-read this as SPICY lecture notes...

johndoe42377 · on Jan 8, 2021

This, by the way, should not be called "science". Science is a methodology of establishing aspects of truth (via reproducible experiments).

What it should be called accurately is "modeling". Mostly oversimplified and plainly wrong (like the Bayesian sect or any kind of predictive modeling - look how all covid models and simulations missed everything).

So, it is data modeling, not data science. And it is important to realize and understand the difference.

the_mango · on Jan 8, 2021

Am I the only one who read - Spicy Lecture Notes ?

dragonshed · on Jan 8, 2021

Definitely not the only one. Capitalization matters. For me, the mental transposition is less likely with 'SciPy' than with 'Scipy'

jagged-chisel · on Jan 8, 2021

this is the first time I have ever dyslexified SciPy into Spicy and I fear I will never read it correctly again.

tsjq · on Jan 8, 2021

Me Too !