Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The struggles of building a feed reader (jackevansevo.github.io)
129 points by Jackevansevo on Oct 8, 2022 | hide | past | favorite | 49 comments



> Including an ETag or Last-Modified header in the body of a request when fetching a feed is a mechanism to tell the server to only return new/modified entries/items (aka: a changeset) since a specific date.

That's not right, is it? The headers are defined at the http level and with caching layers in between, the endpoint is free to return the current feed - with both new and seen entries. Are there many servers optimising that to a shorter feed?

At the very least, static blogs will not filter the entries - they're serving / not serving the same file, regardless of etag.


Un-duplicating RSS items is hard. Timestamps can't be trusted. IDs can't be trusted. Some sources will resend the same item with a different timestamp. RSS servers behind a load balancer may return different IDs and timestamps for the same items. I had to compute a hash of each item to reliably remove duplicates.

Here's a un-duplicator in a feed reader I wrote back in 2009.[1] This is used for printing RSS feeds on antique teletype machines. Reliable duplicate removal is essential when printing at 5 characters per second.

[1] https://github.com/John-Nagle/baudotrss/blob/master/messager...


True, I realized this when developing the initial version of zebra I posted a few days back. However relying on a SQL server that requires a unique URL turned out to be the easiest and most effective solution.

https://play.google.com/store/apps/details?id=thorio.solutio...

Beyond that you probably could cover the last 2 or so percent using string comparison against title and description or peppeteering the website.

Zebra as a social feed reader was a great learning: for example that a lot of sites circulate their content multiple times in different packages (/tiles) and very few flag paywalled content - still working on recognizing that. Any hints for a good way to distinguish that, when investigating the urls?


I was retrieving text from news sites, so URLS were not that relevant.

Some news services will re-issue a story with more information, keeping the same title and description. A full text check is necessary. I computed a secure hash of the text and compared that.


There are two extremes i know here: same/similar title changing content (we hit gold in seo, let's keep updating this "10 best foos for baring" page), changing title same content (anyone doing serious A/B testing).


Both true. The internet has way more covers than actual books you could say :). Content is very much repackaged over and over again.

However, I found, that URLs don't change as much as the titles and slightly edited texts - it will happens if course, but to go beyond that you would need a similarity hash of the actual content of the page and even that reaches it's limits pretty quickly:

Sometimes a change of title + few edits can change the whole narrative of a near-identical article. Looks like currently even ai can't solve that. And interestingly I've seen that happen even for pretty large newspapers "whatever clicks"...

Since zebra is meant to enable debate/exchange of perspectives on specific contents I think only URLs give some degree of certainty about pointing to the same content.

Happy to hear other ideas!


> Are there many servers optimising that to a shorter feed?

That's certainly not the official semantics for these headers, they're validators, so that the server can tell the client nothing changed (304). I would assume overriding these for some sort of pagination would also hinder intermediate caches, though I guess that has become less of an issue now that HTTPS is everywhere, but edge caches performing HTTPS termination might still take this information in account?

As far as HTTP caching semantics are concerned, the new version of the resource would replace the old one, and new clients would be served the latest cached version, truncated.

In fact the requests headers make that very clear, as they're called respectively If-None-Match and If-Modified-Since.

Incidentally there are also If-Match and If-Unmodified-Since headers (POST, PUT, DELETE), but I don't know if anyone actually uses them in the wild. IIRC they were intended for "transactional" update guarantees: you'd fetch a resource, then PUT to it with If-Match and / or If-Unmodified-Since, and you'd a 412 (Precondition Failed) if the resource had been modified in the meantime.


Yeah, the whole post is riddled with little technical misunderstandings. It’s nice to see someone working things out in the open though.


Author of the post, could you clarify?


I gave up around paragraph four; the article isn't a technical article, it's a touchy-feely people story about an old, blind man with a very white beard.


All three of your last comments have been about the wrong article.


I'm not sure I quite agree with this assessment, but thanks.


It's a tad confusing. I believe you're totally correct, this is how those headers should behave.

> The headers are defined at the http level and with caching layers in between, the endpoint is free to return the current feed - with both new and seen entries.

A lot of feeds behave this way. Especially if it's just a static blog

> Are there many servers optimising that to a shorter feed?

I've come across a fair few that behave with the semantics I described


Feed readers are my learning project, I use it to learn new languages. I've built and rebuilt readers in vbscript, vb.net, c#, php and python. php and python have been the easiest since they have good parser libraries. Also I've used SQL Server, MySQL, SQLite and just JSON flat files. I think I've built something like 10 or so variations. In the last few I've expanded to not only pull from RSS and included Hacker News, Twitter and an enhanced pull for Reddit feeds. Though I'm not pulling Twitter currently because of some API changes that I've haven't bothered to spend time on.

Helpful hint if you need favicons for your reader you can use Google.

https://www.google.com/s2/favicons?domain=techmeme.com

The above is a load balancer for this url where the t1 subdomain may change to t[1-9] but this URL allows you to change the image size.

https://t1.gstatic.com/faviconV2?client=SOCIAL&type=FAVICON&...

I use it to grab and store sizes 16,32,48,64 of the icons with a monthly update ping.

My current iteration is built in python with a mysql backend. It's setup in a river of news style with an everything river and one for each feed and I generate topic bundles also. The feed engine is running every 15 minutes grabbing 40 feeds at a time but the static site generator is only running every 6 hours to keep me from spending all my time reading news. Since I pull in Reddit feeds I found that it's great for feed discovery.


Just discovered this favicon "trick" the other day in RSS-Bridge code:

https://github.com/RSS-Bridge/rss-bridge/blob/5e664d9b2b0cb0...


I ran into a number of finicky issues building siftrss[1] a few years back. One I toiled over quite a bit was the discovery that Feedly, a very popular feed reader, does not support gzip. I haven't checked in recent years, but they may still not.

It's frustrating when you're forced to change the behavior of your "agnostic" application for the sake of a large, commonly-used third party tool in the ecosystem.

[1] https://siftrss.com/


I don't understand how feedly is the issue here. If the client doesn't say they accept gzip encoding, why are you sending gzip encoded content? It would be slightly weird if the feedly client doesn't ask for gzip, but this is standard HTTP content negotiation.

If an HTTP server is ignoring the Accept-Encoding header and choosing to serve a Content-Encoding that the client can't accept, that is the problem here. If the server and client can't come to an agreement, isn't that the purpose of HTTP 406? But, being able to serve both gzip'd and plain text versions of an XML file doesn't seem that crazy.


I'm fuzzy on the details as it's been 5+ year since I looked at it, but it wasn't as simple as that. I think it may have been that it worked over HTTP but not HTTPS, and/or they did say that accepted it but it broke under some circumstances.


Where's your own RSS icon then!? ;)

All jokes aside, you just described literally all the points I encountered while developing the built-in feed reader for HeyHomepage.com Good summary!

One thing I notice a lot of people say - like you - is "forcing users to link through to read the article on the original site (semi defeating the point of subscribing via feedreader)".

I don't really agree and my own approach focuses explicitly on sending visitors to the original site. I only show the snippet, even when the full content is in the feed. Imagine you did your best for your website, made it nice and shiny, you want people to see the site as well. The original site usually contains more content, like a photo or image, which might also be useful for visitors. Besides, I want the webmasters to know I was there by showing up in the visitor statistics (I attach a '&rss_ref=heyhomepage.com' to the end of the link to the original site).

I'm not saying one way is good and the other bad - there are valid reasons for seeing a feed reader more as an aggregator - but I wanted to point out there are valid reasons for doing the opposite as well.


His homepage has a `link rel="alternate"` meta and that's all that matters.

Original link should be preserved, but I disagree with you on the moral of appending '&rss_ref=heyhomepage.com' to the links. It is still tracking. Besides, server-side feed aggregators have a valid reason to cache the feed and canonicalize the item url to avoid dupes.


An RSS icon has an important signaling function, if you ask me. Automatic discovery is good, but why not also have a textual link or icon pointing to your feed!?

The original link to the feed is preserved for my users to click on. My system - in use with the user - pointing the user to someone else's website accompanied by a GET var containing the user's own URL is not tracking. The end website can also know that info from the HTTP referrer. It's a very crude implementation of a webmention, in a sense. Because it's not necessarily about the linking website, but about telling someone's RSS feed is in use!


> but why not also have a textual link or icon pointing to your feed!?

Sure. I was only providing my perspective, the perspective of a RSS feed aggregator writer. To me, automatic discovery is more important than an icon.

I am not sure you and I are talking about the same thing. I am against appending random query string to an otherwise perfectly fine url because it adds to the burden of the feed aggregator if it wants to de-duplicate the links gathered from various sources. Of course you are free to append anything to the URLs on your website for any purposes.


> I am not sure you and I are talking about the same thing. I am against appending random query string to an otherwise perfectly fine url because it adds to the burden of the feed aggregator if it wants to de-duplicate the links gathered from various sources.

The appended query string is only added to links in my system, not to any public feed. Otherwise, I agree with you.


> I attach a '&rss_ref=heyhomepage.com' to the end of the link to the original site

That is incredibly unsound (on a technical level, not merely in matters of taste).


I also attempted to build a feed reader a while back. In the process I built a feed discovery service:

https://discovery.thirdplace.no/?q=jackevansevo.github.io

It's not perfect but it's better than a simple parsing of <link> tags in the html.


Does your implementation there parse HTML and look for link tags or is it doing something else as well?

Edit: figured it out from https://discovery.thirdplace.no/about - it looks like it's using link tags but also has a big list of baked in known-patterns, e.g. these:

https://git.sr.ht/~thirdplace/feed-finder/tree/main/item/src...


Nice app with a clear use case!


Annoying thing with RSS readers is when a website implements some sort of "security feature", RSS reader might not be able to download any feeds. I had one occurrence where feed reader was asked to complete a captcha to reach content. Being a "bot" it of course failed. Another time one website was blocking all traffic from abroad, so RSS reader just got access errors, as server is located in another country.


This is a good list. I did this at a medium scale once (about 10,000 feeds that needed to be checked once per minute).

My favorite thing he mentioned is that various tags can have different meanings. Published, updated, description, content, subtitle. To do this at scale you need some configurations for each feed to specify where you can get information. Does <published> mean published, or does it actually mean updated? Everyone does it differently.

And the etag thing. Yeah…

One thing he didn’t mention is media. I think the HN crowd really likes RSS because the mostly-text tech blogs they like to read all support it, and it seems to work fine. But a lot of the population likes to read content that has embedded images and videos. Even slideshows sometimes. There are RSS extensions for this, but they suck for all the same reasons.

At my company we ended up abandoning RSS and writing a customizable web scraper instead (ingesting HTML pages). It was actually a lot easier than dealing with RSS.


I haven't used a feed reader in a long time, but I had a brief period when I was obsessed with Fraidycat. Worth a look if you're interested in a different approach to keeping up with people.

https://fraidyc.at/


FYI: Another feed reader I built (called pluto with sqlite as feed / data storage) see https://github.com/feedreader - used by OpenStreetMaps Blogs, Planet KDE, and others.

PS: For the (ongoing) struggle (trying) to "normalize" the RSS and ATOM feed formats (or JSON Feeds) see the feedparser gem - https://github.com/rubycocos/feedparser


Been there, done that. A lot of feeds, I means 99%+ have subtle bugs in the meta data that can be easily fixed and make feed reader writer's life easier and broaden your readership. There are rss validators, please make use of them. I have a lint tool for your blog that cross check meta data from the feed and meta data from the post:

https://roastidio.us/lint


A long time back I had a go at this too, but reimplementing ttrss's api instead of writing my own frontend: https://github.com/nvtrss/nvtrss

I learnt a lot. My goal was getting something working that the ttrss android app would connect to and I reasonably succeeded there, running it for a few years.

I went back to hosting the full ttrss application at some point.


This os great! I recognize a lot of the challenges I ran into (or decided to ignore!) When building the reader for https://havenweb.org . I had a particular chuckle at "#just for sorting", remembering feeds that kept bumping themselves to the top of my reader!


Hey, you're in my OPML list of shared links: https://www.heyhomepage.com/?module=timeline&view=sharedlist


What I am missing is a robust solution for keeping my feeds (blogs, podcasts etc) in sync between multiple devices, using a standardised protocol that enables the usage of many different clients on any platform.

There has been some attempts on tackling this problem, but none have managed to get it right and become truly universal, as far as I know.


This is a great read and I will be sure to use this when a project I have in mind needs to parse a variety of feeds. So far the default .NET SyndicationFeed class works well though.

Always wished RSS/ATOM had a dedicated field for images. Why didn’t they? Currently it always seems to involve some inline HTML in a CDATA element. Pretty gross.


There's "enclosure" for RSS. And Atom can have "<link rel='enclosure'>".

How I parse enclosures in my own timeline you can see here: https://www.heyhomepage.com/?module=timeline&post=4 (also with a nice link to the original source)


I did try that but none of the rss apps I tried displayed the image


That's a shame, because I think images - and maybe even short videos - can make RSS-based timelines way more rich in content.


I worked on a feed reader back in 2006. The worst feed discovery kluge I can recall needing to special case was that certainly the most popular blog at the time (Cute Overload) was a frameset around blogger. That was typical though, people’s sites are a mess.



Writing my own feed reader was one of my unfinished side projects. Thank you for sharing your struggling.


I'd like to know what issues the author has with ttrss


> Coördinated Universal Time

The o-umlaut doesn't occur in English, and "Nate Hopper" sounds like an English name.


It's a diaeresis symbol rather than an umlaut, used (infrequently) to show that it's pronounced co-or rather than coor. Quite archaic, unless you're the New York Times, who use it as part of their house style, but not wrong!


I think you're thinking of The New Yorker, not the New York Times.


Yes, you're right - I'm not from the US, but I think it's the classic example...


FWIW, I made https://readerize.com that doesn't rely on RSS. Freemium is coming soon, kindly bear with me. For now, signup/trial is free without needing a credit card.

If you don't agree with the philosophy, kindly move along, no need to downvote.

I hesitated for a long time too. One day I just decided to keep at it and launch.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: