Im my journey through data analytics, what helped me most is to fight with real datasets. Lectures are fine, but you don't really grasp the little details needed to do a proper job until you have messy datasets, very large datasets, have to deal with text in a non-english language, etc.
That's the most useful stuff in my opinion. Courses and lectures include sample data that don't really put you in the position to having no option than optimize your workflow because your box can't deal with it in a reasonable time.
Or when you go crazy because you can't perform some analysis because something somewhere is wrong and your debugger can't help you, and you just want to punch someone in the face.
That's how I discovered that cleaning and preparing data is about 90% of the job, avoid CSV for non-numeric data and use SQLite instead, when possible, the god-send of Knime, etc.
By real datasets you mean company-specific ones? Or do you happen to have some examples that are openly available which helped you a lot?
I definitely concur with your first point, since I made the same experience, specifically when working with company-specific datasets.
From my experience one also underestimates how much time cleaning up the data takes; there are quite a few steps you need to go through before you can really start to analyze a dataset.
I happen to scrape a lot of large websites (mostly forums currently) and that's messy enough to force you into learning tricks.
I didn't stumble upon into any (tabular, at least) dataset that wasn't very curated.
Keep in mind that I studied sociology, so stuff that is a given for most HN people isn't for me. I had to learn a lot of CSS (for selectors), regex (still hate it), what's OLAP and how to take advantage of it (DuckDB) and a lot of stuff I'm not even aware now.
But I remember taking courses in my Uni, and later on, with R and Python. It was interesting, but no matter how deep into the rabbit hole of weird models I learnt, it felt... IDK, shallow?
Imagine yourself pulling data out of a company ERP, with human filled data. It won't be a walk in the park, just make some logit models and call it a day. You'll spend a lot of time trying to understand what's going on. And then you perform the models or make a dashboard.
Scraping websites can be quite the messy business, since some websites change their document structure more often than others.
Nonetheless, it's still a very instructive activity and you can build quite the pipeline around it (scraping multiple websites, joining datasets, efficiently storing the data, etc.).
Yeah, when data piled up I had to think about how to store it, RAM, and a bunch of other things that I didn't have to consider with sample data. Specifically RAM and how to transform data without so much need of it was a concern for some time.
Learning CSS selectors and HTML structure, inspect element and the other dev tools builtin to your browser, and something like BeautifulSoup (for static/non-JS heavy pages) and Selenium (JS and other complicated pages) is pretty key imo. My background in web dev helped me with the HTML stuff. Basically, you fire up the page in a browser, inspect element to see how you can use CSS selectors to uniquely identify that data, then using BeautifulSoup or Selenium to parse and interact with the DOM will cover most web scraping cases.
That's the most useful stuff in my opinion. Courses and lectures include sample data that don't really put you in the position to having no option than optimize your workflow because your box can't deal with it in a reasonable time.
Or when you go crazy because you can't perform some analysis because something somewhere is wrong and your debugger can't help you, and you just want to punch someone in the face.
That's how I discovered that cleaning and preparing data is about 90% of the job, avoid CSV for non-numeric data and use SQLite instead, when possible, the god-send of Knime, etc.