Show HN: Orac – Filter news from social media using AI

jedwhite · on Oct 23, 2018

Developer here. Like a lot of programmers, I struggle with distraction, and I get frustrated with time wasted on addictive social media. I also feel frustrated by the lack of control I have over the algorithms that control what I see on social media, and the constant tracking and privacy invasion.

So I've been working on Orac as a way to get interesting and useful content that matters from social media, without losing focus. It's currently an early preview version. The front end is built as a web app with React and GraphQL, and it uses deep learning to rank quality and predict attributes about stories it finds shared on social media (such as seriousness, objectivity, political stance etc). The back end is AWS Kinesis for stream ingestion, Lamda for running inference, and DynamoDB and ElasticSearch, along with AWS AppSync.

While it's experimental, the content predictions are already pretty interesting, and it has a bunch of pre-built filters as a way to play around with them - such as moods, personas and filter bubbles.

Would love to know what folks think and if this is something useful to people that's worth pursuing!

fuddle · on Oct 23, 2018

Awesome concept! An interesting use of AI. Do you think it would it be possible to recognize a news story as fake news? The dataskeptic.com podcast have been doing a series recently on fake news.

jedwhite · on Oct 23, 2018

Thanks! If you watch the "Show Me Everything" raw feed for a while, it puts a "flag label" on different types of bad content - including clickbait, fake news, hate speech, extremism and state news and propaganda. A lot of those stories get stripped out when the filters are on. But the "crap detection" filter does flag those stories in the raw feed in real-time too. It's still learning but it's already decent at detecting this sort of content. I was thinking of adding a "fake news" view (or "Crap detector view") as it's interesting to see the techniques the people who propagate that stuff use!

nikolay · on Oct 24, 2018

Is this branding related to Blake's 7?

jedwhite · on Oct 24, 2018

Yes it is. It's a tip of the hat to Terry Nation :)

I'm Australian and loved Blake's 7 as a kid.

The initial news sources are fairly US-centric (English speaking highest follows by category with a pretty noisy mix across a wide spectrum). That's because they are intended to be used for testing the algorithms as a start against a very noisy feed with a lot of objectionable content as well as good content. But the idea is that you will add your own social feeds and accounts (and news sources) and it will then skew to similar sources.

nikolay · on Oct 24, 2018

Oh, thank you! Well, most kids liked Blake, but my favorite was Avon. Here's a little compilation with him: https://www.youtube.com/watch?v=SWHLU8fwi80

Jaruzel · on Oct 24, 2018

Probably not. The author seems american; The site has a very US news bias, and only has US politics filtering.

Which is a shame. I hope the author plans to expand it to have a more global viewpoint.

For those that don't know - 'Blake's 7' was a 1970s UK sci-fi TV show that had a portable 'super-computer' called Orac, whose personality was arrogant and aloof. The show was set in a dystopian future where a despised militarised government rules the galaxy. Still worth watching even now to be honest.

jedwhite · on Oct 24, 2018

I'd commented above that the current feed is intended for testing, but that the plan is for you to add your own social feeds and news sources, and that it will then skew similar in finding additional sources.

The "filter bubbles" are also intended as a couple of examples, so initially I've used US ones.

I'm Australian and while I'm based in the US, one of the problems Orac is trying to solve is localisation. Orac is running geo and locale predictions on content, but they aren't very good so far, so I haven't included them in the UI or filtering yet. But a few folks who I've shown have said that it would be useful to filter content that's relevant by location, so that is definitely in the roadmap!

jedwhite · on Oct 24, 2018

Just to clarify: that is a "yes this is intended for a global audience" :)

The preview version is fairly US-centric (language used, initial sources followed, example filter bubbles etc). But that is very much a starting point to show how the concept works in action. So that will change as it develops. The underlying algorithms are built with both global use and local filtering in mind. So stay tuned on that front!

jedwhite · on Oct 24, 2018

Part of Orac's mission is to be very judgmental about content when ranking its quality and importance, so the allusion to being aloof and arrogant is little bit of a joke, but purposeful :)

nikolay · on Oct 24, 2018

Thanks for the summary. Blake's 7 was my favorite Sci-Fi show as a child.

maverick2007 · on Oct 23, 2018

So I've been thinking about this a little bit and I'm not sure if I'm the outlier here but when I really want to focus on something, I'll turn off social media completely. The medium is the distraction in itself, no matter who I follow. With that being said, I think that this tech has a ton of potential.

One use case off the top of my head would be for fantasy football. I'd love to see some sort of mode where I can get high quality news for my chosen few players or teams. I run into an issue right now where there's a ton of noise around "is this player injured or not" and it'd be great to have some sort of AI that could do a better job than I can in filtering what's fake or not.

Plenty of other interesting use cases for this tech, this is just one that jumps out for me!

jedwhite · on Oct 23, 2018

Thank you! Your feedback is similar to a lot of the feedback I've received from friends I've shown it to, so I don't think you're an outlier at all.

I think most people do something similar to you, and block themselves from social media. I know I do. People resort to deleting their accounts, temporarily blocking access using browser extensions, or removing the apps completely. The problem I've found is that there is interesting work-related content that I do want to know about that gets shared through social media (and even more messaging). And I get work messages through facebook and twitter, and then half my day is blown. I think that's worse for people on a maker's schedule. You need the important content (say new deep learning research) but you blink and you've shot half a day after glancing at facebook. The direction I'm heading with Orac is to use the topic modelling and a doc2vec approach to match up clusters of concepts between content, and your todo list. That's still a way off but this is intended to be a step in that direction.

The deep learning models are still in progress, but they're already getting pretty good at ranking content and identifying topics etc. One friend described them as "scary" and I don't think they are quite there but they are improving.

Adding "build your own" filters is definitely on the road map - so something like the fantasy football tracking is a great use of that!

maverick2007 · on Oct 23, 2018

I didn't think about it that way but you're totally right! The reason I don't open up social media right now is because I know there's going to be garbage there that distracts me. If I could open Orac up and it would block anything that it deemed as distracting then I don't see a reason why I wouldn't use it.

I really look forward to seeing what becomes of this!!

jedwhite · on Oct 24, 2018

Hey thank you - yes that's a key driver. You don't want to have to throw the baby out with the bathwater, so to speak :)

There is incredibly useful and inspiring content out there on social media. I just don't want to have to trawl through a cesspool of crap to find the nuggets of usefulness buried in it!

lj3 · on Oct 23, 2018

I look forward to trying this out. I tend to share your opinion that social media is full of junk, but every once in a while you learn something new and interesting. If your app can help separate the wheat from the chaff then I think there will be plenty of interest.

sjclemmy · on Oct 23, 2018

Blake’s Seven? [0]

0: http://blakes7.wikia.com/wiki/Orac

jedwhite · on Oct 23, 2018

Well picked! :)

Another Terry Nation fan I'm guessing.

There's a feature being worked on that is some way down the track, but will make sense of the name. If you'll forgive the teaser, but it's related to having conversational control.

sjclemmy · on Oct 24, 2018

I was a child when it was first broadcast and I loved it. It was the inspiration for many a childhood game. We always used to argue about who was going to be who - everyone wanted to be Blake, no-one wanted to be Servalan.

I think I recall trying to make Orac out of a shoe box and felt tip pens!

jedwhite · on Oct 24, 2018

I've been planning to make an Orac from a perspex case using a Raspberry Pi with some speakers and a camera (and a lot of pretend wiring). But haven't had chance to get further than some rough plans :)

bradknowles · on Oct 24, 2018

So long as you don't create a time loop where we steal the ship from our future selves, I think you're probably okay. ;)

dwd · on Oct 24, 2018

An anti-troll conversational AI with the sarcasm and ad hominem insults of Avon?

jedwhite · on Oct 24, 2018

"Of course I'm properly switched on. Having depressed the activator button what else would you expect?"

I may need to tone down the dialogue a little from the inspiration's level of snark ;)

dwd · on Oct 24, 2018

You might enjoy this obscure gem:

https://m.youtube.com/watch?v=7UtkbHL2tZg

bradknowles · on Oct 23, 2018

That was my first thought.

I'm glad I'm not the only one! ;)

SomeHacker44 · on Oct 24, 2018

I was going to post the same thing! Glad we HN members have some cultural proclivities (and age?) in common.

stuartcw · on Oct 24, 2018

Genius name. I wish I had grabbed that domain.

bradknowles · on Oct 24, 2018

So, one angle I'm interested in is the surfacing of new sources of good information. For example, many popular sites are basically just aggregators of content from other sources, with perhaps a little light commentary on top.

Many years ago, I might have gone to Slashdot for the commentary from the site members, but that devolved into a festering sewage pit a long time ago. But slashdot does still sometimes link to good articles -- you just have to ignore all the commentary on the site.

So, how do you keep discovering good sources of content and feeding them into the system? If I wanted to feed a bunch of sites into the system and let you do the work of filtering them for me, how would I go about that?

Another angle I'm interested in is the deduplication of content, and hoisting the value of the earlier posts over the later ones that are just regurgitating what some other site said? And related to that, how do you surface newer posts that are actually an update to the older article, with new information?

Any thoughts or observations you can share?

jedwhite · on Oct 26, 2018

I think you'll like some of the ideas we're working on.

For this preview release, you can't control the sources. But when it gets released for real, you will be able to create an account an add your own social accounts (twitter, facebook, linkedin, reddit) to monitor, and also add your own news sources (RSS feeds, websites to spider etc).

Based on the sources you add, Orac will then try to find other interesting similar sources for content to include in the feed.

On the back-end, we're basically building a database of media content with quality and attribute predictions designed for searching and filtering based on topic model matching.

There are some interesting ideas we're working on with de-duplication (establishing canonical sources, effectively). But that's super early.

YC0mbi_Dave · on Oct 29, 2018

Great stuff Jed. Have had a play with the latest preview and like it a lot. Also starting to get a better sense of how you could apply Orac in an organisational context and look forward to discussing next time we catch up. Cheers, Dave

YC0mbi_Dave · on Oct 29, 2018

PS: I can see a couple of use cases for it - e.g. teams working on projects or tracking media coverage of specific issues. [Disclaimer - I know Jed]

tixocloud · on Oct 23, 2018

Great concept. Would you be able to share how you would be able to filter based on mood, personas, etc?

I can think of a few commercial use cases where this could be incredibly useful but would likely based on the methodology you’ve used. Guess I’m interested in how do you define quality and what attributes are you predicting.

jedwhite · on Oct 23, 2018

Thank you! Under the hood, there are about 80 different predictions or scoring attributes being run. A lot of those are deep learning models.

A mood or a persona is really an algorithm representing a batch of those. A lot of them you can get pretty close to just by using the Power Filter directly. But there is some secret sauce.

But broadly speaking, you can think of a Hacker persona as being someone interested in Science and Tech and Education as key broad topic areas, wanting more serious content, and wanting content from credible science-based sources. They prioritise useful content, with more in-depth coverage.

An Activist might be someone with a center-left to left stance, interested in social fields, politics and psychology, but also influenced by trends and what is currently generating a lot of "heat".

If you have representations of what good quality content "within that community of personas" looks like, then you can train models and test scoring systems that try to capture them.

Content quality is a big topic. But our working thesis is that there are canonical examples with domains of expertise of what great content looks like - PG's essays, Pulitzer Prize Winning Journalism, award winning research papers. And there are similar examples of what very bad content looks like - hate speech, extremism etc.

So that's been the broad basis for the models that underly the predictions.

[edits for typos]

vinni2 · on Oct 23, 2018

Just wondering if you could provide more details on your model like do you use a CNN or an LSTM? Is there a paper you have written about it? Also how do you get the data for training your models? Also do you do any retraining as new data comes in? Do you also have some groundtruth to measure the performance of your models?

jdh30 · on Nov 4, 2018

I just noticed your post (https://news.ycombinator.com/item?id=18050422) about courses but cannot comment on it because it is too old. I am an employer with strong opinions. If you'd like to hear them please e-mail me here: https://www.idtechex.com/contact/team/dr_jon_harrop.asp

jedwhite · on Oct 26, 2018

Thanks, there is a little more information in my reply to abrichr above. But in short, there are a large number of predictions run on each piece of content, and they each use different approaches - including traditional statistical models and ML, off-the-shelf APIs, as well as custom deep learning models we've developed through experimentation. So there's an evolving mix of models using different approaches (including CNNs, LSTM-like approaches / word2vec and some topic modelling using a doc2vec-style approach). It's still very much experimental though, and some predictions are way better than others!

I'd love to do an arxiv.org paper or mlconf presentation down the track, as I think we have some interesting work here and I have an academic interest in the field, but it's way too early. In essence, we're working this out as we go to "make the thing" as practitioners first based on some unique insights, rather than it being a product derived from research first.

[Edits for typos/clarity]

tixocloud · on Oct 23, 2018

Thanks. Makes sense. Then is definitely commercial opportunity I’m sure :)

darrenwestall · on Oct 23, 2018

How can I get in touch with you please? I’d love to plug this into what I’m doing, there is a lot of synergy.

jedwhite · on Oct 23, 2018

Hey Darren, very happy to chat and find out more about what you're doing. Anyone is welcome to email me about Orac with ideas, feedback etc too - jed [at] orac [dot] ai

abrichr · on Oct 23, 2018

In Power Filter mode, it seems the following knobs are available:

* Overall rating

* Political Stance

* Political Balance

* Seriousness

* Reading Level

* Objective Tone

* Credibility

* Virality and Engagement

* Temperament

How do you measure these objectively?

jedwhite · on Oct 26, 2018

This is a great question, but it's hard to address in a comment. However, I'll have a go (in brief) :)

There are biases in all algorithms because they represent an encoding of either judgement or bias in training data selection.

So our aim here is to control for bias, and take some steps towards giving the user more control, compared to an approach like Facebook's where the algorithm is entirely a black box.

The approach we've taken is, effectively, to try to codify editorial judgement and professional journalistic best practices into the system, and the selection of training data. As well as being a programmer / computer scientist, I'm also a professionally trained journalist and was editor of Australia's leading computer magazine.

The knobs aren't direct representations of model predictions themselves, but weighted summary scores based on algorithms using a number of lower-level predictions (using statistical models, ML/DL models, ensembles). The personas/moods/filter bubbles use more of the individual attribute predictions.

Having said that, in practice it's all a bit of an experimental mish-mash currently, and we have a lot of work still to do. Some predictions are way more effective than others, as you can see browsing through it. And others (like source and author quality and attribute prediction) are learning and improving over time. But we have a lot of iteration and experimentation ahead of us!

In practice, the initial feedback has been that the predictions are surprisingly good. But sometimes they are way off the mark.

21stio · on Oct 23, 2018

Hey Jed, cool stuff! What's your vision for the product?

jedwhite · on Oct 23, 2018

There are a couple of ideas driving it, but the main one is to use AI (deep learning sequence to sequence models especially) to filter and recommend content to help busy people stay focused on their work.

The background is that social media is making us dumber and less productive.

So I want to make something that lets you focus on your work by finding the best content matching what you're working on or interested in, and presenting it in a distraction-free way.

21stio · on Oct 23, 2018

I totally agree.

How do you want to enable the user to use that functionality?

Is it only intended to apply the classification to content that's available throw orac.ai?

What are your thoughts on delivering it as a browser extension and/or api?