Im my journey through data analytics, what helped me most is to fight with real ...

giu · on Jan 8, 2021

By real datasets you mean company-specific ones? Or do you happen to have some examples that are openly available which helped you a lot?

I definitely concur with your first point, since I made the same experience, specifically when working with company-specific datasets.

From my experience one also underestimates how much time cleaning up the data takes; there are quite a few steps you need to go through before you can really start to analyze a dataset.

spaniard89277 · on Jan 8, 2021

I happen to scrape a lot of large websites (mostly forums currently) and that's messy enough to force you into learning tricks.

I didn't stumble upon into any (tabular, at least) dataset that wasn't very curated.

Keep in mind that I studied sociology, so stuff that is a given for most HN people isn't for me. I had to learn a lot of CSS (for selectors), regex (still hate it), what's OLAP and how to take advantage of it (DuckDB) and a lot of stuff I'm not even aware now.

But I remember taking courses in my Uni, and later on, with R and Python. It was interesting, but no matter how deep into the rabbit hole of weird models I learnt, it felt... IDK, shallow?

Imagine yourself pulling data out of a company ERP, with human filled data. It won't be a walk in the park, just make some logit models and call it a day. You'll spend a lot of time trying to understand what's going on. And then you perform the models or make a dashboard.

giu · on Jan 8, 2021

Thanks a lot for your reply!

Scraping websites can be quite the messy business, since some websites change their document structure more often than others.

Nonetheless, it's still a very instructive activity and you can build quite the pipeline around it (scraping multiple websites, joining datasets, efficiently storing the data, etc.).

spaniard89277 · on Jan 8, 2021

Yeah, when data piled up I had to think about how to store it, RAM, and a bunch of other things that I didn't have to consider with sample data. Specifically RAM and how to transform data without so much need of it was a concern for some time.

rohan_shah · on Jan 8, 2021

I am also currently learning to scrape forums. And I am a philosophy student. Could you point to some resources that helped you learn it better?

jmt_ · on Jan 8, 2021

Learning CSS selectors and HTML structure, inspect element and the other dev tools builtin to your browser, and something like BeautifulSoup (for static/non-JS heavy pages) and Selenium (JS and other complicated pages) is pretty key imo. My background in web dev helped me with the HTML stuff. Basically, you fire up the page in a browser, inspect element to see how you can use CSS selectors to uniquely identify that data, then using BeautifulSoup or Selenium to parse and interact with the DOM will cover most web scraping cases.

spaniard89277 · on Jan 8, 2021

Are you looking for something specific? Most tools have documentation you can bang your head against.

pmart123 · on Jan 8, 2021

This is true, but this helps build data engineering/cleaning skills, which is a different but complementary skill to modeling.