Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: Turn any website into an API (for those who miss Kimono) (simplescraper.io)
332 points by welanes on Nov 1, 2019 | hide | past | favorite | 74 comments


This is very cool! I love how you brought back the original Kimono UI with the checkmark and Xs for adding and removing data tags.

We built WrapAPI (https://wrapapi.com) back in the day, before we ended up starting Wanderlog (https://wanderlog.com), our current travel planning Y Combinator startup. This definitely is still an unsolved problem.

However, from a business point of view, we found that it was rather difficult to make a business out of an unspecialized scraping tool. The Kimono founders expressed a similar sentiment: ultimately, scraping is a solution looking for a problem.

Developers can often roll their own solution too, which limits your customer base and how much you can charge. Instead, vertical-specific tools that target particular industries seem to be the way to go (see Plaid as an example!)

Alternatively, you have to be good at Enterprise and B2B sales. This is a product that you need to get the word out, get a champion, and do customer success on since it has a substantial learning curve. We were not, so that was why we chose to focus on other projects to start out

Best of luck, and feel free to get in touch if you'd like to chat more


Thanks! Yeah the checkmark confirmation just feels effortless. Haven't got it perfected yet, but soon.

Really appreciate the insights.

You're right that much depends on mapping the solution to a particular problem. Are you selling yet another scraping tool or are you freeing data to drive better decisions / save time / yada yada.

With the right frame, a sensible price point, and as much complexity abstracted away as is possible, there may exist a business model - seems to be many opportunities hiding in plain sight.

Will reach out soon for sure. Best of luck with Wanderlog


I tried your site and am curious that, for ko pha ngan there is only one recommended resource. Shouldn’t there be more?

On my mobile device on brave iOS, entering the Date in the calendar was janky FYI and i had to click another text box to keep my date selection and make the calendar widget disappear, so I could submit the form.


Very insightful comments!


Curious, what comparison are you making with Plaid here?


Plaid, Yodlee, and others abstracts away extracting data from various banks and financial services providers, so they're providing a solution built on top of the same data extraction techniques that this tool uses


Oh, interesting. I thought they just provided secure authentication to an app’s end users’ bank accounts for things like payments (an alternative to someone like PayPal doing two microtransactions, then having you confirm the amounts as a way of validating it’s your account). It’s not like Plaid is scraping financial data though, right?


Scraping is incredibly common with banking apps like that, because many banks do not have APIs (and are only changing slowly).


Hey HN, I posted this in a comment thread the other day and (to my surprise) it got a positive reception so added a few more updates and decided to post it proper.

The idea is to be able to choose a website, select the data you want, and make it available (as JSON, CSV or an API) with as little friction as possible.

Kimono was the gold standard for a while so did yoink some of their ideas, while doing some other things differently.

Still needs some work but as an MVP would appreciate any feedback. Cheers.


>would appreciate any feedback

Any option for a firefox build?


Yes, working on it now.


also will try when on FF


When I saw this service last week, I think you had a section about a paid service where you do the scraping on a server and send the results. Do you offer that? How do you get around anti-scraping technology, if it exists?


Yeah, that's offered although it's currently free.

No particular tricks to avoid detection. It's Puppeteer under the hood with a few customizations which works well on the majority of sites tested so far.

Given the cat-and-mouse game around web scraping you may never cover every website, and that's ok.


Unrelated question. There's a "Made by Lanes" badge. What was made by lanes.io though? The web page?


Why is one page scrape 2 credits? Why not just 1?


I don't feel it is right to describe it as "turns a website into an API", rather "gives scraped data through an API".

"Turn website into an API", for me, evokes the image that I can automate (say) placing an order in Amazon as an API, or paying my bills automatically. It includes scraping, of course, but requires a lot more (mechanize/twill/selenium/phantom/etc power).

There was a company called Orsus that did exactly that. Last I heard about them it was the year 2000.


I like the idea but I was skeptical as to how well it works and noticed the video on the main page of your website which scans coinmarketcap seems to be wrong. It gets 200 cryptocurrency names but only 100 prices which means only the first result is correct.

I have a similar idea that I'm working on, your site is definitely bookmarked and will try the extension later.


Good catch, uberswe. Was an older video and I flubbed the selection process - here it is working correctly: https://www.kapwing.com/videos/5dbc3e33ee4d0f00136d01e6


Hi, is it the chrome extension that does whole work or There is a separate background task on your side that actually runs those recipes ?


Hi, it's both. There's 'local' scraping where the results of what you selected is ready to download as soon as you click results - no signup or server needed.

And then anything's that saved as a recipe runs in the cloud.


Nice, that looks much better!


Also interesting that this main example is also a violation of coinmarketcap's terms. They have a paid API.


If i use my pen and notebook to write down all those values, am i also in violation of those terms?

If they don't want their data to be scraped, it is up to them to secure it.


The argument you're making here is 'I don't believe in copyright'. Which is fine, but doesn't really negate my point.


It's a moot point. I very much believe in copyright, but you can't just put info in the public domain and yell, "Take a look but don't remember/retain it" in the name of copyright. If I redistribute it or reuse it for commercial purposes without your consent then maybe there is a case. But if I am just scraping it, i.e remembering it... Come on now.

Otherwise everyone who gets the lyrics to copyrighted songs or memorizes them and sing them in the shower is also in violation of copyright. Which would reduce the whole copyright thing to ridiculousness.


All I said that it's against their term of use. I didn't try to make a point about whether it should be or not. If you are curious about it, and whether using pen and paper is allowed, take a look at it.


I think so too. From their terms [1]

> You agree that you will not:

> Copy, modify or create derivative works of the Service or any Content;

> Copy, manipulate or aggregate any Content (including data) for the purpose of making it available to any third party; Trade, sell, rent, loan, lease or license any Content or access to the Service, whether commercially or free of charge;

> Use or introduce to the Service any data mining, crawling, "scraping", robot or similar automated or data gathering or extraction method, or manually access, acquire, monitor or copy any portion of the Service, or download or store Content (unless expressly authorized by CMC).

[1]: https://coinmarketcap.com/terms/


Then use data from a free API without any TOS, and more data like separating bid and ask:

http://cmplot.com/api.json


What is it about this service as a business model that prevents it from taking off? I’ve known at least two YC startups that tried to build businesses around this idea.

I think one or both were acquired and immediately shut down, but I’m not 100% sure about that.


I'm the founder of parsehub.

We are doing well and are independently owned.

I think there are 3 things that contribute to this:

1. It is very easy to make a prototype that looks "magical" but very hard to build something that works in real applications. There are an enormous amount of quirks that a browser allows, and each site you encounter will use a different set of those quirks. Sites also tend to be unreliable, so whatever you build has to be very resistant to errors.

2. There is a technological wall that every company in this space reaches where it is not yet possible to mass-specialize for different websites. So even if you're able to build a tool that works very well on any individual website, the technology is not there yet to be able to generalize the instructions across websites in the same category. So if a customer wants to scrape 1000 websites, they still have to build custom instructions for each website (5-10x reduction in labor vs scripting) when what they really want/is economically viable for them is to build a single set of instructions that will work for all similar websites (10000x reduction in labor vs scripting). This is something that we're working on for the next version of parsehub, but is still a couple years away from launch.

3. Many of the YC startups you hear about have raised funding from investors and have short term pressures to exit.

The combination of the three makes it very tempting to give up and sell.


#2 is what would transform this from a nice niche tool, to something very valuable. In the ecommerce space, tracking competitor pricing is a great example of this type of thing. I can also see use casese for recipe's, finance, healthcare, you name it. Those b2b use cases are worth real money.

Just curious, in your experimentation, have you found it necessary to train a new model for each "category"? Or have you found a way to generalize it?


Training a new model for each category is already possible today, but doesn't achieve the goal (mass-specialization).

The problem is that when you pre-train a model, you can only solve for the lowest common denominator of what every customer might want.

In ecommerce, for example, you might pre-train to get price, product name, reviews, and a few other things that are general to all ecommerce. But you won't pre-train it to get the mAh rating of batteries, because that's not common to the vast majority of customers (even within ecommerce). It turns out that most customers need at least a few of these long-tail properties that are different than what almost every other customer wants, even if most of the properties they need are common.

And so the challenge is to dynamically train a model that generalizes to all "battery sites" based on the (very limited) input from a customer making a few clicks on a single "battery site".


I worked on this for a long time -

1. it's possible to make it "easy to switch" by having common building blocks and only changing the "selector" across sites - lots of companies in the space do this

2. it's impossible to do "just DOM" or "just vision/text" if you want to be able to generalize "get the price of the items"

- DOM doesn't represent spacial positioning very well (see: fixed/absolute positioning, IDs and dom changing without the visuals changing, ...) so you'd need the equivalent of an entire browser rendering engine in your "model" anyways!

- vision/text is messed up by random marketing popups (see: medium, amazon, walmart, ...), it's significantly more computationally expensive to do, and can't currently get >95% accuracy (which makes it useless, scraping needs very close to 100% accuracy in most use cases)


> So if a customer wants to scrape 1000 websites, they still have to build custom instructions for each website...

Can't this be crowdsourced in some way? Having each individual entity reinvent the same wheel feels like the main problem to me. What if there was a marketplace? The ability to buy / trade / sell? Maybe subscription based in some way?

If I wanted to scrape 100 sites, it might be worth $1 per year per site. Those who put in the time make money. Those who don't have the time would pay.

This isn't a technology issue per se. It's scaling a solution to the final gap the technology can't cover. A different kind of mechanical turk?


Crowdsourcing works in cases where lots of customers are interested in the same set of attributes to extract.

But by definition, customers interested in long-tail attributes (i.e. virtually all of them) don't have others to source those from.


Yes. But there might be some who would not be interested but still do it for minimal pay.

It would also lower the barrier to entry and thus increase the size of the market. Imagine if the first X sites I tired all needed more work. I'd likely quit. But if that didn't happen, I'd more likely continue.

Crowdsourcing isn't The Answer. But it's certainly a better step in the right direction.


Yes, it can! See https://apify.com/marketplace

Disclaimer: I'm a co-founder of Apify :)


Possibly a mix between use cases, maintainability and economics. We used to scrape economic indicators data at a fintech startup and monetized it - every slight change to the website created an issue to the data feeds. It was a huge nightmare to maintain. Scraping any website is quite generic and doesn't really speak to a specific audience on a specific need. But more importantly, having been in the data and analytics industry for years, data has far lower margins than insights and recommendations. The market is willing to pay a crazy premium (look at how much all the consultants are being billed out for) to get insights and recommendations. Data itself isn't inherently valuable to most companies.


Repeatedly being acquired to be immediately shut down sounds like quite a good business model, if your goal is to be paid.

I wonder what other kinds of products and services would be good for that model. In other words, would tend to be acquired for good money in order to stop them.


Acquired by who?


Presumably, a company that wants your product to be shut down.

Potentially apocryphal example: I've heard of a certain FPGA company that bought a startup which produced FPGA compilation tools that could target multiple vendors' devices, in order to stop multi-vendor tools from existing because it made switching vendors too easy.


I would guess:

1. Narrow target

Your market is people who need scraped data to input into some kind of app/program/code, but don't have the resources/skills/time to use scrapy or whatever.

2. Sensitive to configuration

This is also the problem with visual code and ML apps, but you even a small issue with the source you are scraping from -- say, captcha, or login, or some weird format or css you did not anticipate -- makes it almost useless, whereas if you were coding up a solution you can (usually, not always) deal with it more easily.

Those are the reasons they shut down.

The reasons why they launch:

1. Many developers have this need

Many developers have built scrapers internally, and then used them so a lot of people have worked on this problem.

What follows from this is that they can productize it, see that other people have the need, imagine the market etc.


I applied to YC with an idea like this and was rejected. 12 times. Maybe it's not the idea. Maybe it's me. Or maybe it's YC.


I don’t know anything about your case, but the general rule is that ideas are worthless, it’s the execution that matter.



Hi, is it possible to make it compatible with firefox?


Sure, in fact I'll do it this weekend.


I'm also interested in this. I no longer use Chrome due to its pervasive surveillance and telemetry.


Maybe a better business model is to offer this as a service to site owners who are not tech savvy. Site owners then have the ability to offer an API to new customers making it a win / win. Site owners can now offer an API (free or paid), and API consumer can rely on getting data in the future.


I just gave this a shot on the ISO website to get a list of country codes[1], but it seems the selection algorithm breaks down when there's no specific classes applied to elements, as every td.v-grid-cell is selected, which is all of them, instead of the values of the alpha2 column for example.

This seems hard to solve entirely programmatically, maybe having a way to be more specific by providing a selector yourself or selecting multiple entries and having the plugin figure it out could add a lot of utility in such cases.

[1] - https://www.iso.org/obp/ui/#search/code/


I believe this could be a good solution to turn legacy software into an API. The “generated code” should be a reverse proxy, not a scraping lib.

Also, scraping a website to use/copy it’s data is illegal in my country (Belgium). I’m not sure this tool itself would be.


nothing can stop it. lots of belgian sites are scraped everyday across the world.


Is there a reason this doesn't spit out some python or JavaScript code to scrape the same info out?

This just seems to add another dependency to whatever I'm developing. Plus, it sends data through a server I don't control. (I assume)


Did you read the website? It says "Scrape locally or create recipes that run quickly in the cloud."

Also, what use could website spitting essentially the same python/js script over and over have?


I must have skimmed past that. Whoops. I avoided trying it out because it's not available on firefox, so I couldn't correct my assumption by testing it. Also, couldn't easily find copy of the extension source and gave up.

The site/extension basically has to do that each time it scrapes locally (or use generic parametrised scraper) If you wanted to use it in an API, my impression is that you can run it in chrome as an extension you need to get from the chrome store or tunnel your data through a third party server. Is that wrong?

Can you scrape data locally without running chrome/the extension? I can't tell from reading the site, sorry. (if it's actually there, please link an anchor tag to it or something please)


I like this.

Please consider adding the ability to script clicks on elements, e.g. buttons.

I manage a site where we load a subset of articles on initial page load and then have a "Load more" button that executes Javascript to load another batch of articles. Getting a list of articles from our CMS is a bit of a hassle so being able to scrape it easily instead would be ideal.


Hey, right now you can select a Pagination element that the app will use to load the next page / new data.

If the site's publicly accessible and you're able to share, send the details to mike @ simplescraper.io and I'll get this working for you.


Does this work with authenticated pages?


Yes - you're able to save data behind a login using the point and click functionality as it extracts whatever data is loaded in your browser ("local scraping").

And no - if you choose to also create a cloud recipe that runs on the server, the remote browser instance won't be able to access data behind a login.

It's possible but I'd rather not store third-party credentials for the time being.


It doesn't look like it. I got an error trying to scrape my HN upvotes url.


This is super cool. I really enjoyed and missed the kimono workflow. Automating something like this with browserless.io would be really fun (I run that project). Extensions is one of the things we’re looking to support.

Anyways give me an email at joel at browserless dot io if you ever want to chat


Cheers Joel. I have most of your blog posts on Puppeteer bookmarked - super helpful and well written.

For sure, once the app is a notch more tried and tested I'll get in touch. Appreciate it.


Awesome! One question I have after reading the page is - what is the pricing plans concerning credits? (for automated scraping)


Right now it's free and will be until it's stable. Starting price will be about $25 for 4000 scraping credits, 200k API calls and data storage.

This will likely change as I have more stats and feedback on usage and expenses. But the goal is to offer a price point that's fair and low relative to other options.


Kimono was cool, nice to see another option. I still have a Kimono t-shirt in a drawer somewhere.


Kimono t-shirt

Hmm, definite missed merch opportunity there.


How to use the 'pagination' feature ? The help guide doesn't even mentioned it.


Hey, yes the guide still needs work. Here's what you gotta do:

- Click the pagination icon and then click the pagination element (usually 'Next' or an arrow). The icon will turn green

- Click 'view results' and then choose to save the recipe

- Select the number of pages you'd like to scrape

- Run your recipe and it will scrape those pages


Looks good, could this be integrated into n8n.io to be used to drive a workflow?


Firefox add in please.


if you can add RSS feed response that would be great


If you need data from a website that updates on a regular basis there’s a recent Show HN I’ve seen that does exactly this https://news.ycombinator.com/item?id=21398524


OT: or just use puppeteer, not really hard, for free and you can rule the world




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: