Hacker News new | past | comments | ask | show | jobs | submit login
Web Scraping Indeed for Key Data Science Job Skills (jessesw.com)
75 points by jonbaer on March 30, 2015 | hide | past | favorite | 21 comments



As a little side project, I built a website at: http://skill.report which instantly does this for any job title. Go try it! I'd love to hear your feedback :)

It works by sampling job ads from Indeed, then applying some information extraction/retrieval/NLP algos to extract and weight the presence of identified skills and qualities. There's some occasional glitches in the algo (I need to fix some of the disambiguation data), but it usually gives reasonable results.

I was thinking of focusing the algorithm for giving really in-depth feedback on improving your resume specifically for a job. Now if only I could finish my PhD thesis I might actually have the time to do more with it...


Love the design and my first query "developer" turned up some good results. Some feedback:

- If I enter "c#" as a query it simply refreshes the page.

- A lot of the "skills" I am getting back are simply rephrased job titles (e.g. "web developer" returned "web applications, web development, web services, mobile application development, support, responsibility, web design, javascript, project and software developer." for the skills list)

Definitely has a lot of promise though if you can reliable filter out skills from job descriptions.


look through Indeed's pages of job results and click on all of the job links, but only in the center of the page where all of the jobs are posted (not on the edges).

I wrote a toolkit to help solve this problem [1]. An issue with taking the approach of hard-coding the result pattern to scrape is that it can break when the page changes. E.g., the author's code has:

  page_obj.find(id = 'resultsCol')
If Indeed ever changes that ID, the program won't work. In that respect, it's better to dynamically figure out where the results are.

And as far as cleaning up unicode and HTML entities, I like the "he" project [2]. Within text fields these HTML parsing libraries don't do a very good job so unfortunately this extra parsing is necessary. And sometimes properly stripping all duplicate whitespace involves getting very familiar with control characters (and corrupted control characters from bad encoding) as well as left-behind html tags/entities [3].

1. https://github.com/MachinePublishers/ScreenSlicer

2. https://github.com/mathiasbynens/he

3. https://github.com/MachinePublishers/ScreenSlicer/blob/maste...


Could you please elaborate on what you mean by "dynamically figure out where the results are"? Or how to go about doing it?

#Edit. I see your first link sorta answers that. And correct me if I'm wrong, but when I went there, it seemed that the library caters more towards automatic searching + paging, rather than extracting results?


It handles extraction too, trying to find where the results are and then extracting individually the title/summary/url/date.

To elaborate on the general approach I used, it was to take each node in the web page and get stats about all of them (e.g., position on page, amount of freetext, etc) and plug those stats into a neural net.

I worked on a different project some years ago that took the approach of looking for repeating tag patterns in the page, focusing especially on structural tags (as opposed to ones that are purely for formatting).

Another possible approach might be to just plug the whole result page into something such as Boilerpipe (https://code.google.com/p/boilerpipe/) and look at the set of urls in the text block it identifies.


I wish there was a system of web scrapers where the scraping logic is user-contributed and decentralized at the same time. Being decentralized, there would be no way for the owners of websites to stop anyone from scraping, and being user-maintained, the logic gets updated quickly whenever the original website's HTML template changes.


Assuming that someone has created such a system but has not released it, and that this person asked you what you might do with her system, what would you answer?

Do you envision that users would want to run such a system, e.g., if there was a public benefit to such information sharing?

What if the implementation was a group of small programs written in C that communicated with each other, and no browser extensions or scripting languages were required?

What if the system required attachment of dedicated hardware to the user's LAN, e.g., a $25 single board computer?


> Assuming that someone has created such a system but has not released it, and that this person asked you what you might do with her system, what would you answer?

Build stuff that takes information and uses it in new, interesting, creative, and useful ways. Right now there is a lot of extremely useful data that is trapped inside the interfaces of websites and apps that could be used in amazing ways but unfortunately there's no easy way to get at the data.

I don't think hogging information and intellectual property will last very long as a means of creating value. We as a society need to think of better business models and better ways to define progress than this.

> Do you envision that users would want to run such a system, e.g., if there was a public benefit to such information sharing?

Sure, if they are getting something out of it too. For example, the new ways of accessing information should only be usable if they participate in running the system.

> What if the implementation was a group of small programs written in C that communicated with each other, and no browser extensions or scripting languages were required? What if the system required attachment of dedicated hardware to the user's LAN, e.g., a $25 single board computer?

All this sounds good to me. I'd want the full hardware and software stack to be open-source though if it's going to be plugged into my home network, so that there's no chance of it violating my privacy.

One hurdle will be how to enforce that users MUST contribute a piece of their bandwidth in order to be able to use the fruits of the system (e.g. you need to help others make scraping API calls before you can issue calls yourself). Napster did this for music, but as with any centralized system, it will eventually get sued and shut down.

In order to decentralize this I think a cryptographic currency similar to Bitcoin will be needed: you get points for offering bandwidth, you need to spend points in order to make calls on other peoples' bandwidth.


This is very cool. I've been kicking around a similar approach to feed data into a recommendation engine: collect job listings, filter against a list of stop words, then see if ones with similar words turn out to be similar jobs.

One of the problems with this approach is (in my country at least) the heavy industry presence of recruitment firms means every job is listed up to four times: once by the employer, and once by each firm competing to find the hire.


If you don't like to code scrapers, you can always use something like http://import.io

the 'magic' API works on alot of list websites: https://magic.import.io/?site=http:%2F%2Fwww.indeed.co.uk%2F...


    for script in soup_obj(["script", "style"]):
            script.extract() # Only need these two elements from the BS4 object

That comment is a bit misleading. soup.extract() will remove those tags from the tree, 'script' and 'style' are the two tags you __don't__ need.

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#extrac...


Nice work. Although not in python at http://trendyskills.com it has a broader set of skills and more countries with an Open API for everyone.


Perhaps offtopic: Is web scraping a desired skill in the job market? Asking because unemployed.


Sort of. There are tons of people offering crawling-as-a-service, where you can do your own scraping. Then there are tools like import.io which let you point-and-click data right into your database. There's scrapy for and similar frameworks for Python. I'm sure node has more than a few npm packages out there to deal with dynamically rendered (e.g. infinite scroll) stuff out there. In short, it's a useful tool to be able to reach for in your toolbelt but not the hammer you'll use every day.


My guess is that web scraping isn't something you're going to see on job postings specifically, but if you can write a web scraper with your favorite programming language, that would be something useful to discuss and maybe even land you an interview.


The page is not available for me.


re: python vs. R

I haven't done python at all, but from reading bits and piece online, it seems to me that Python is a lot more about ML than Statistics.

Also, how does one decide what is a "data scientist"? Is it only people who do Stats+ML+IT? What about a researcher in economics or biology? Or marketing. Are those included? They do a lot of Stats and increasingly a lot of ML. Not so much information technology though (less concerned with data storage and retrieval, because they have other people to do that for them). I'd venture to say that researchers like that are far bigger number than pure play data scientists and would be interesting to see the technical skills for them. I bet SPSS and SAS would look a lot more in demand.


You can take a look at this book Analyzing the Analyzers The Authors surveyed data scientists, asking about their experiences and how they viewed their own skills and careers.They answer your question what is a "data scientist"

http://www.oreilly.com/data/free/files/analyzing-the-analyze...


That's a great paper. I wonder how I've missed it. I'm happy to to say that it partially confirms my own take on the situation. I thought of it only in 2 categories - Comp sci/math guys vs. Applied statistics guys (mostly from Humanities). Reality seems more nuanced then that (as always). Very nice read!


Just my personal biased perspective: You are correct, there is no good definition of "Data Science". Just as any other buzzword, it is widely devoid of meaning but useful to convey a general idea to certain audiences. Regarding R vs. Python, I personally use Python to process my data up to the point where I can hand it over to R in a convenient df-esque shape and use R for stats and plotting. Even if I do not require R for stats, I still use ggplot2 for plotting, though I'd be happy to see Seaborn or Bokeh evolve to a point where they can rival it.


The answer to what a data scientist does, is very different when you ask a self-proclaimed data scientist and a person that hires a "data Scientist". This creates can create disappointment on both sides.

It ranges (from the hiring side) from: "good if you know excel and you will report figures to management" to "please only apply if you have 2 Ph.Ds and 20 years of experience in C". So it is a bit wide...

the self-proclaimed "data scientists" probably goes from know a programming language/know visualising/know statistics up to "X Ph.Ds and Y years of experience where Y > 20"




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: