Im my journey through data analytics, what helped me most is to fight with real datasets. Lectures are fine, but you don't really grasp the little details needed to do a proper job until you have messy datasets, very large datasets, have to deal with text in a non-english language, etc.
That's the most useful stuff in my opinion. Courses and lectures include sample data that don't really put you in the position to having no option than optimize your workflow because your box can't deal with it in a reasonable time.
Or when you go crazy because you can't perform some analysis because something somewhere is wrong and your debugger can't help you, and you just want to punch someone in the face.
That's how I discovered that cleaning and preparing data is about 90% of the job, avoid CSV for non-numeric data and use SQLite instead, when possible, the god-send of Knime, etc.
By real datasets you mean company-specific ones? Or do you happen to have some examples that are openly available which helped you a lot?
I definitely concur with your first point, since I made the same experience, specifically when working with company-specific datasets.
From my experience one also underestimates how much time cleaning up the data takes; there are quite a few steps you need to go through before you can really start to analyze a dataset.
I happen to scrape a lot of large websites (mostly forums currently) and that's messy enough to force you into learning tricks.
I didn't stumble upon into any (tabular, at least) dataset that wasn't very curated.
Keep in mind that I studied sociology, so stuff that is a given for most HN people isn't for me. I had to learn a lot of CSS (for selectors), regex (still hate it), what's OLAP and how to take advantage of it (DuckDB) and a lot of stuff I'm not even aware now.
But I remember taking courses in my Uni, and later on, with R and Python. It was interesting, but no matter how deep into the rabbit hole of weird models I learnt, it felt... IDK, shallow?
Imagine yourself pulling data out of a company ERP, with human filled data. It won't be a walk in the park, just make some logit models and call it a day. You'll spend a lot of time trying to understand what's going on. And then you perform the models or make a dashboard.
Scraping websites can be quite the messy business, since some websites change their document structure more often than others.
Nonetheless, it's still a very instructive activity and you can build quite the pipeline around it (scraping multiple websites, joining datasets, efficiently storing the data, etc.).
Yeah, when data piled up I had to think about how to store it, RAM, and a bunch of other things that I didn't have to consider with sample data. Specifically RAM and how to transform data without so much need of it was a concern for some time.
Learning CSS selectors and HTML structure, inspect element and the other dev tools builtin to your browser, and something like BeautifulSoup (for static/non-JS heavy pages) and Selenium (JS and other complicated pages) is pretty key imo. My background in web dev helped me with the HTML stuff. Basically, you fire up the page in a browser, inspect element to see how you can use CSS selectors to uniquely identify that data, then using BeautifulSoup or Selenium to parse and interact with the DOM will cover most web scraping cases.
It's always great to see well-crafted python resources. It's so easy to get started in python and you can get pretty far without knowing the best ways to do things, so I'm glad there's things like this for newbies.
Maybe in the future, the statistics portion could be expanded. While I'm grateful for all this information, it is rather odd to leave out Bayesian stuff.
As an aside, HN comments with nothing to say except CSS comments is so shameful. Imagine collecting all this information and giving away this catalogue for free and having someone nitpick some silly sidebar zoom functionality. It's honestly despicable how often it happens. I hope the author knows how much this resource helps people out.
This is an awesome resource but the general Python section could use some work.
I am assuming that target audience are scientists with a modicum of programming knowledge.
The list and especially dictionary section is a bit bare.
In the optimization section have a discussion on when to use lists, dictionaries, tuples and sets.
(for example the difference between
"needle" in my_list vs "needle" in my_set)
When to use something from collections and when to use ndarray.
(the short answer being - it depends)
So .... where do I learn statistics in the first place? Let me rephrase the question. What is the most efficient way to learn the minimum viable amount of statistics?
> What is the most efficient way to learn the minimum viable amount of statistics?
You need to add 3 constraints to the question
1. What is your starting point and current knowledge of mathematics and statistics?
2. Minimum viable for what? What do you need the statistics knowledge for?
3. How much effort can you afford to put into this over what period of time?
Then the answer ranges from "here are a couple of good youtube videos" to "here is how to design your own degree in statistics using freely available material".
Thank you for the recommendation. I build admin dashboards using stock (double entendre?) charting libraries and recently have been using my own d3.js visualizations with dynamic content. At this point, I might as well start to delve into data science and have been investing time developing math skills with calculus and linear algebra. I would like to also take some time and learn basic statistics concepts. I want a level up a little bit but don't see the point getting a Ph.D. in machine learning. I only need the basics to start from.
The section "how does python compare to other solutions" is a bit lackluster, and heavily biased at the same time. It would be more useful if this section was written by proponents of each of the other "solutions".
This is the best place to become familiar with the tools and to set the stage for your journey. Then find a problem you want to solve, and find more domain specific resources. Some people learn best from tutorials, some from video, some from courses, some from just banging their heads against a wall till they figure it out.
This, by the way, should not be called "science". Science is a methodology of establishing aspects of truth (via reproducible experiments).
What it should be called accurately is "modeling". Mostly oversimplified and plainly wrong (like the Bayesian sect or any kind of predictive modeling - look how all covid models and simulations missed everything).
So, it is data modeling, not data science. And it is important to realize and understand the difference.
* Pandas: https://pandas.pydata.org/docs/getting_started/index.html
* DSP: https://greenteapress.com/thinkdsp/html/index.html
* Numpy: https://www.labri.fr/perso/nrougier/from-python-to-numpy/
* Data Carpentry: https://datacarpentry.org/lessons/
* Data science path: https://github.com/ossu/data-science