Hacker News new | past | comments | ask | show | jobs | submit login

Site scraping is definitely my biggest struggle - this is the primary reason it's taken me four months to get to this seemingly simple app ready to post here. I now have a pretty good system in place. I keep a master list of scraping rules that I can update without needing to re-release new Fraidycat versions. I also have an update coming that will allow me to scrape at different stages of the rendering process and to scrape external files that the rendered page relies on. (This will be used for TikTok support, for instance.)

I realize this could be a bit of an arms race, but I don't think it has to be that way. Fraidycat doesn't syndicate the content - it encourages people to visit the actual site. So I believe a platform benefits from integrating well with it. Thanks for checking it out!




In case scraping doesn't work for a certain link, you could have a "limited" update feature: Download website html, compress and hash it and store locally; each update cycle, download it, compress, hash and compare to local copy. If it has changed, then simply light it up in the UI. For me, simply seeing that there's something new on the site I want tracked is enough information so I can visit it and check out the new article myself.

Of course, false positives are a downside. Someone fixing a typo shouldn't count as an update. I'm sure the community can think of settings for the "update sensitivity" where level 0 requires at least a new tag to appear on the page, level 1 requires a change of at least N characters, and level 2 notifies on even a change of one character.

I love this extension already and am willing to help out with PRs :-)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: