Hacker News new | past | comments | ask | show | jobs | submit login
WebXray (webxray.org)
207 points by snake117 on Sept 22, 2015 | hide | past | favorite | 100 comments



Seems like a great opportunity to use SQLite, then folks could share around database files. Why the hell don't more people use SQLite for things like this?


The author ('tilbert) is hellbanned. You can see his comments when you turn on showdead in your profile.

Re SQLite, he replied: "@coleifer: there was a sqlite branch, but it doesn't scale well to many-million record sets which is what I have been doing. the design of the software allows drop-in db replacement, it just lacks the code. I can't decide to go back and sqlite or to just make a web front-end."


> The author is hellbanned

Comments by some (not all) new accounts are killed by default when they look like possible spam or troll activity. It's not a great solution because it leads to false positives like this one. On the other hand, doing nothing is worse. It's a hard problem, and it certainly doesn't mean that the user is banned.

We're about to release software to let the community unkill these, which we hope will be a much better solution.


thanks TeMPOraL!


> The core of webXray is a python program which ingests addresses of webpages, passes them to the headless web browser PhantomJS, and parses requests in order to determine those which go to domains which are exogenous to the primary (or first-party) domain of the site.

Naturally this means a different user agent and finger print which could ultimately mean the script is fed a different page altogether. The odds of that are probably low but still; someone could have a really shitty website that uses hundreds of trackers but could serve WebXray something completely without them.

I would like to see this type of stuff as web browser extensions. That way we can get the exact, most correct information possible. Also would simplify a semi-convoluted build process that seems to have tripped up a few readers.


Firefox has an extension called Lightbeam which does this. The author of the WebXRay software, tilbert, has already noted in a comment below that his software can be run in batch mode on multiple sites which is an advantage over the extension.

Also, see the TED talk 'Tracking our online trackers' by Gary Kovacs - http://www.ted.com/talks/gary_kovacs_tracking_the_trackers?l....


The major ad blocking extensions could do this. They already know about the requests. It's also possible to change the user-agent (say, use the top 10) and quickly gather the data using a bunch of cloud servers.


also, the main extensions use black-lists, webXray grabs all the requests.


Yes, but they can figure out what on the blacklist was used for a given site. For example, Ghostery has Ghostrank which is pretty similar to this--it sends back to advertisers what stuff was blocked.

https://www.ghostery.com/en/faq/how-does-ghostery-make-money...


this is the biggest issues, depending on the cloud ip block you can get banned.


yeah, I already spoof the UA string, I've done a bunch of testing to verify it is working and getting me the right code. more likely way to get banned is I'm hammering ad networks from the same IP addr.


hey, the captcha system here is an f'n nightmare - anyway I wrote the software! happy to answer questions will try to do so below

since I am hell banned I can only edit this comment to reply.

as for video, I discuss research here: https://www.youtube.com/watch?v=OqW8erWi1Wo

but if you mean screencast of the software I haven't had time.

---

@coleifer: there was a sqlite branch, but it doesn't scale well to many-million record sets which is what I have been doing. the design of the software allows drop-in db replacement, it just lacks the code. I can't decide to go back and sqlite or to just make a web front-end.

---

@captn3m0 I have an academic paper in revision that is an analysis of the alexa 1M list, I also have other projects i development.

---

@snorrah: this was my first python project, so I went with the newest version. it's made a lot of things very difficult, especially porting to a web version.

---

@linuxlizard: proxy is problem when you want to do a lot of concurrent tests, I usually load about 64 pages in tandem to get good speeds on large sets.

---

@radmuzon: webdxray runs large batch jobs, so you can get lunch, and come back with all of pages analyzed. I know it does work on windows, and I apologize for not being able to provide directions...see comments above.

---

@TeMPOraL: thanks!

---

@pearjuice: yeah, I wish there was an easier way to get python3 to talk to mysql, that's the biggest PITA.


At SpeedCurve we've built something similar for tracking and understanding the impact on website performance that third party requests can have. It's a big issue for websites when their user experience can be affected by resources that are not even under their control. We've seen websites where over 90% of the requests on a page are made to third parties.

Here's a dashboard showing third party usage for The Guardian over the last 30 days: https://speedcurve.com/demo/thirdparty/1/1/chrome/1/30/39tfn...

Great to see an open source list of domains linked to organizations. We've built our own list as well and we'll look at contributing them to this project.

(Disclaimer: I'm the founder of SpeedCurve)


Would love help building the org_domain.json list, please get in touch!


I think this would be quite interesting integrated into a web proxy. Surf through the proxy, gathers all the nth party HTTP.


proxy is problem when you want to do a lot of concurrent tests, I usually load about 64 pages in tandem to get good speeds on large sets.


Here is the domain->org data https://github.com/timlib/webXray/tree/master/webxray/resour.... It is missing alot of mobile ads and tracking players. Anyone know how we can fix that?


constantly being updated, my hope with open-source is people will help add to it. if you want to help, I'd love that!

---

I see the hell ban, I'm posting all my replies at the bottom in a comment I'm editing - could somebody note this above?


I'll see if I can make a contribution :)

Do you have a technique your using to match domains to organisations sometimes it can be hard to discover.


whois, detective work, crunchbase...also do work in china which is even tougher: http://www.theguardian.com/technology/2015/sep/21/google-is-...


Since I am on Windows, it will take some time for me to set it up. Can someone please explain what extra information I get as compared to the Lightbeam extension (formerly Collusion) in Firefox?


webdxray runs large batch jobs, so you can get lunch, and come back with all of pages analyzed. I know it does work on windows, and I apologize for not being able to provide directions...see comments above.


I am pretty frustrated by build processes of modern day applications. Wanted to give this a quick spin, but looking at the installation instructions all I see is compilers, optimizers, minifiers, interpreters, package managers, package-package managers, dependency systems and then, maybe then, you pray to your configuration-God that everything clicks together and runs on your system.

I was about to ask why isn't there a simple, unified build tool for ANYTHING, but I think that is what got us here in the first place...


It's the technological singularity. Thanks to the various "code academy" initiatives going on around the world, there is a growing middle area--between software developers and users--of scripters, people who plug components together but don't do a lot of greenfield programming.

It used to be that being a scripter was a stepping stone on the way to developers, mostly because back then scripting could only get you so far. Now, you can apparently make an entire career being a scripter, if said code academies are correct in the promise that they can find you a job with the extremely shallow curriculum I've seen them provide.

This isn't a bad thing, it's pretty amazing that it doesn't take a decade of dedicated study to do so much anymore. It's just that in our current culture we lump them in with developers because they are clearly more than just users. We still expect a set of resources put into a Github repository to have a significant amount of new programming, rather than just being glue code between a few commonly available libraries. But that's more of a problem of our lack of ability to differentiate between large, greenfield projects and small, configuration-oriented projects at-a-glance than it is a problem of programming being "too easy".

Though, if we recast scripters as users instead of developers, then it's a terrible thing. It means that the real software developers of the world have written a bunch of software with a really, really shitty user interface.

Either way, it's not the scripter's fault.


I don't use any extra python libraries, it's all to get python3 to talk to mysql. (also I'm not a skiddie...)


Are you serious? "brew/apt-get these packages", many of which you'll probably have already, is so much better than the way it used to be (you'd have a tarball with a configure file if you were lucky, and if you were very lucky it would work).


I also wish it were easier, the major pain point is python3/mysql.


Actually, all they need to do is create and publish a Dockerfile. Then you could just do: docker run webxray.


I've heard good things but haven't learned docker, I'll look into it. Thanks for the suggestion.


Pleasantly surprised to see something requiring Python 3!


this was my first python project, so I went with the newest version. it's made a lot of things very difficult, especially porting to a web version.


A great project on top of this would be to run this over the Alexa top 20k sites list to a depth of say 5-10 and see the results.


I have an academic paper in revision that is an analysis of the alexa 1M list, I also have other projects in development.


Funny how this uses all platform agnostic software and the Windows install instructions are to buy an ubuntu VPS.


Because then they can link to DigitalOcean with their referral code


So what? It doesn't cost you anything; but they get a couple of pennies for their effort.

I can understand the thought behind ad blocking (it fucks up the experience), but a tangential referral code? Who cares?


Ha, they actually did. Tacky.


see comments below, I don't know windows so can't write directions. I'm poor so just wanted to cover my hosting, but I removed the referral regardless as I didn't realize I would be attacked for it.


I don't get why people complain about that. Keep the code but add a disclaimer if it's so much of an issue for some..?


Why are you hellbanned?


it thought I was spamming b/c replying quickly.


Keep the referral code that's what it's for.


What's tacky about that. DO referral codes do not even pay out cash. The program only provides credit allowing them to continue running their services. A DO referral credit is not a way to make money, just a way to offset costs.


What is wrong with a referral link? Please explain in as much detail as you like.


a screencast / video of an example would be completely helpful.


as for video, I discuss research here: https://www.youtube.com/watch?v=OqW8erWi1Wo but if you mean screencast of the software I haven't had time.


I kind of meant screencast but saw from other comments you were kinda busy. Great project ;)


Very nice, this could be the basis for some cool visualizations!


I have done a few, but it's not my forte. vice included some I made in an article they did about my work a few months back: http://motherboard.vice.com/read/looking-up-symptoms-online-...


tilbert: you're hellbanned, apparently the system thought you were a spammer or troll, you should send an email to HN (hn@ycombinator.com) asking them to un-ban you.


We unkilled that user's comments in this thread and made sure the account wouldn't get caught by the software again. We also detached this comment from https://news.ycombinator.com/item?id=10258192 and marked it off-topic.


i am not tilbert...?


For those curious about this 'tilbert' character - if you switch on 'show dead' it turns out he (allegedly) wrote the software in question.

Think the hellban must just be a glitch - so.. Amongst other things, Tilbert wrote:

"

... my hope with open-source is people will help add to it. if you want to help, I'd love that!

---

I see the hell ban, I'm posting all my replies at the bottom in a comment I'm editing - could somebody note this above?

---

I think you guys/gals have a point [about the referral link to DigitalOcean] the referral link is gone. ;-)

---

@coleifer: there was a sqlite branch, but it doesn't scale well to many-million record sets which is what I have been doing. the design of the software allows drop-in db replacement, it just lacks the code. I can't decide to go back and sqlite or to just make a web front-end.

---

@captn3m0 I have an academic paper in revision that is an analysis of the alexa 1M list, I also have other projects i development.

---

@snorrah: this was my first python project, so I went with the newest version. it's made a lot of things very difficult, especially porting to a web version.

---

@linuxlizard: proxy is problem when you want to do a lot of concurrent tests, I usually load about 64 pages in tandem to get good speeds on large sets.

---

@radmuzon: webdxray runs large batch jobs, so you can get lunch, and come back with all of pages analyzed. I know it does work on windows, and I apologize for not being able to provide directions...see comments above

" * There's more. Enable 'ShowDead' in Profile to see rest of his comments.


There's no way to reply to hellbanned users. People have to reply to others, and hope the hellbanned user sees it. If you enable showdead in your profile, you'll see that tilbert replied to you.


We are all tilbert.


I certainly am.


HA welcome back


[deleted]


Most of the rest of us manage to never get banned, ever. And say some pretty non-conventional things.

Its not what you say; its how its said?


If you're respectful and polite, you won't get banned. I've said some unpopular things, yet I've never been hellbanned.


I am tilbert!


it seems the gods have forgiven me.


Not gods; software. You got hit by a spam filter. It's fixed now.


in 2015 I'm not sure there is much of a difference between the two. ;-)


tlibert, you are hellbanned. Perhaps a moderator can help with that? I think you triggered spam detection by posting too fast.


tlibert should also note that although they're editing an existing comment, that comment is [dead] and so can only be seen by people who show dead comments (not many). I guess they need a new account or to contact admins.


Very confusing presentation here. For one thing, the author seems to be referring to links/anchors in a web page as "requests"!

A request is the dynamic, transient action which occurs when a client such as a browser initiates a connection to a server and presents a command like GET or POST.

I suggest an opening paragraph along the following lines:

"WebXray" is a sort of web crawler which analyzes a given cluster of pages for their relationships with each other, as well as external pages which they link to.

It provides information about pages which direct a user's browser to various sites for the purposes of tracking, using tricks like hidden images.

(Do I have that approximately right?)


No, he's not talking about links/anchors. He's talking about requests made by the browser caused by scripts which load content from third-party servers. This sounds similar to Mozilla's Lightbeam https://www.mozilla.org/en-US/lightbeam/


it monitors third-party HTTP requests, not links: "webXray is a tool for detecting third-party HTTP requests on large numbers of web pages and matching them to the companies which receive user data...The core of webXray is a python program which ingests addresses of webpages, passes them to the headless web browser PhantomJS, and parses requests in order to determine those which go to domains which are exogenous to the primary (or first-party) domain of the site. This data is then stored in MySQL for later analysis."


So in other words PhantomJS is used as an API to have some pages crawled, and the links emanating from those pages are captured.


sorry, it doesn't monitor links, it captures network-level requests, that's what I analyze.


I think the requests PhantomJS makes while loading and executing the page are recorded and their URLs stored.


So it wouldn't work if PhantomJS informed about this with, say, a callback function?


I use phantomjs because it can run headless and is low resource. the request data is processed in python though.


Not links. Requests. PhantomJS is a scriptable browser, as a second of googling will tell you.


yes! from the api docs:

onResourceRequested

Introduced: PhantomJS 1.2

This callback is invoked when the page requests a resource. The first argument to the callback is the requestData metadata object. The second argument is the networkRequest object itself.

The requestData metadata object contains these properties:

id : the number of the requested resource method : http method url : the URL of the requested resource time : Date object containing the date of the request headers : list of http headers The networkRequest object contains these functions:

abort() : aborts the current network request. Aborting the current network request will invoke onResourceError callback. changeUrl(newUrl) : changes the current URL of the network request. By calling networkRequest.changeUrl(newUrl), we can change the request url to the new url. This is an excellent and only way to provide alternative implementation of a remote resource. (see Example-2) setHeader(key, value)


    Windows Specific Instructions

    Get a linux cloud server (which cost fractions of a cent per hour these days).
    Ubuntu is the easiest flavor of Linux to get started with and the directions
    above will serve you well. Seriously, this is your best option. You can do it.
    I'm both confident in your abilities and proud of you for taking this important
    step in life.
This is not a very helpful attitude. The only "UNIX-y" thing I see it's doing is forking for concurrency. I understand that Python's global interpreter lock limitation makes processes more desirable than threads for concurrency, and on UNIX-like systems this isn't a problem because starting new processes is very cheap. But that doesn't mean it wouldn't "work" on Windows, just be a little slow on starting each subprocess.

Or is it more about not wanting to track down how to install software on Windows?

(EDIT: and as others have pointed out, it's kind of cheesy to use the moment to plug your referral code for DigitalOcean)


Everyone opinions will differ but for that it's worth I personally don't see the problem with a referral link. Maybe it should have been labelled as such, just for the sake of transparency. But since they are recommending a Digital Oceans Ubuntu droplet then it seems kind of appropriate for that recommendation to include a referral.

It's really no worse than having adverts or Amazon referrals. In fact it's less intrusive than the former.


> I personally don't see the problem with a referral link. Maybe it should have been labelled as such, just for the sake of transparency.

I fully agree. I wouldn't remove it at all, just add a a mention that it is your referral link.

This is free software folks. If the coffers aren't held out with donation links, I see nothing wrong with him getting a cut elsewhere.

In fact, the nature of this software is unthanked work and relatively unknown to the general public. It addresses a real issue, and it addresses it for free.

"Cheesy" is criticizing him for the referral link.


How much do people get from these kinda referral links anyways? I feel that if you're only gonna get a few clickthroughs, then its not worth putting the link there and have people judge you / your intentions.


He admitted it in his response here:

>in response to referral link comment: I host the site on my own dime so I didn't think getting $.25 from DO would make me a bad person...also, I really don't have time to learn windows to make instructions. I'm a grad student and make slightly above minimum wage annually, if I were to be evil and greedy I'd probably not be doing what I do with my life. I'm giving away thousands of hours of work and data I could make money off of doing evil things.

>its not worth putting the link there and have people judge you / your intentions.

This makes no sense to me. His intentions are clear. Far more clear than the intentions of third party HTTP requests on the big websites you visit. And you didn't even need to hear it from him himself to know that he hosts his site and needs a few shekels to support it. Of course, you're entitled to your own opinion, but I could really care less if people "judge" me for a DO link, and I hope the tilbert feels the same way.

I don't understand in general this hate for people who post referral links. Do you not hover over a link before you click it? See all the garbage appended to the query string in the URL? If I don't want to click it, I don't. I don't hate the person for adding a referral link. I don't understand the hate for the people who do. Again, you're entitled to your own opinion but I'm not a fan of this minutiae being hammered in this HN thread. I'll just go to the website on a clean window and search the same product/website/service.

Finally, the third party advertisers are judging him far more harshly for this software, anyways. Their corrupt model is (in a small way) subverted. If he was worried about "judgment" he wouldn't have offered this software in the first place.


Huh? What provoked this strong response? I'm not hating anyone here. I'm just thinking about this from his / my potential perspective where if you've already put so much effort into the project expecting zero returns, why 'taint' your own product? Its like putting ads on personal blogs that's gonna get you a few bucks in a year. Maybe I'm just particular about these kinda stuff and want to keep things 'pure' and 'clean'.


I didn't find the response strong at all.

It was far less strong than the initial attack on his adding a referral link in the first place.

A referral link is (arguably, again, opinions here ;) ) much cleaner and more transparent than having your analytics data being pushed out to a dossier tracking your behavior, emotions, and extrapolating patterns on the two.

You can hover your cursor on a referral link to see what it is. You can even open up a Javascript console and see where the true location of any given link is taking you to, should you click on it.

You can't see what's being done with third party gathering information about you.

And funnily (read: ironically) enough, its the entire point of this software in the first place. I repeat that I'm confused about the concern over this minutiae considering the fact that when you go look up something online, someone is getting paid in some way, shape, or form. I don't mean to come off strong, and I apologize if you took it in that way.

A referrer gets paid in creation of a consumer whereas an advertising company gets paid as a spy, essentially. Using Amazon as an example, isn't the purpose of the service to sell you something? Instead of selling you as a product?[0]

[0] https://medium.com/@unsetbit/dear-amazon-you-dropped-somethi...


thank you.


No, it's criticizing him for not being forthright about it. I think it's important for people to know whether or not their activity is helping financially support someone. I may specifically want to support the project maintainer, so I will try to remember to use their referral link later if I don't have the time to do it now. Or I may specifically not want to support the maintainer, so I will circumvent their referral link. Either way, my job is made a lot easier if the maintainer discloses that the link is a referral link.


I removed it, I thought it was obvious once you clicked on the link. again, I'm a PhD student, I live at the poverty line and host this project myself.


Honestly, you should have left it up there and told the first few to go <strike>fuck themselves</strike> find something better to do than criticize someone who gave them a FOSS program.

Yeah, things that are free need to make money somehow, hence ads and ref links. It's asinine to think that a person should just give up their own cash to provide a free service. It's the epitome of using someone as a means to an end; and you're the one getting used.


I do a lot of research, advocacy, and editorial work which is very critical of many powerful companies; I have to be sure there is no accidental impression I am "for" or "against" anybody or benefitting financially in any way from my research. that is actually why I changed it.

any sane person can see that releasing my intellectual property for free wasn't an evil scheme to make $.25 from digital ocean. ;-)


I haven't used windows in 15 years and couldn't find anybody to write the directions for me, so I went with snarky over spending time figuring out how to install python3 and mysql on windows. ;-)

in response to referral link comment: I host the site on my own dime so I didn't think getting $.25 from DO would make me a bad person...also, I really don't have time to learn windows to make instructions. I'm a grad student and make slightly above minimum wage annually, if I were to be evil and greedy I'd probably not be doing what I do with my life. I'm giving away thousands of hours of work and data I could make money off of doing evil things.

---

I think you guys/gals have a point, the referral link is gone. ;-)


I think the better solution would have been to just disclose that it was a referral link.


I can't speak for getting a Linux cloud server, but I've seen plenty of recommendations to use a VM when trying to use some software on Windows. Unless you're specifically writing C# or using IIS or SQL Server I don't necessarily believe it's bad advice. In the coming years, with the way C# is changing, Linux/FreeBSD may become the desired platform to run C# code (just like it is for Java.)


>> I've seen plenty of recommendations to use a VM when trying to use some software on Windows...

Is that a sign that developers are starting to not care about Windows?


personally, I feel the real magic of VMs is I can play around, keep notes, royally screw up my system doing something idiotic, trash the machine, spin up a new one, and resume from my notes. I'm old enough to remember the dark days of RHL CDs running on actual computers...


If I was consulting on a flat fee, that's the advice - virtualize to the most compatible OS - I'd typically offer because it is the most likely to deliver the most business value over the long run. On the other hand, if I was charging by the hour, I'd spend my time and the customer's dollars creating a one-off installation that is likely to break and need future $upport when components change.

On the third hand, if the task was trivial, then there would be no reason to complain about a lack of bespoke Windows instructions, since they would almost write themselves. Of course if bespoke Windows instructions are non-trivial, then obviously it's hard to see merit in the complaint.


it is non-trivial for me as don't use windows, and I already spent a fair amount of blood, sweat, and tears developing the code.


Your instructions are literally just "install this software". Windows is not so different that the instructions can't be just "install all this software". We can install software without apt perfectly fine.


So clone the project: https://github.com/timlib/webxray , add Windows instructions, and send a pull request. In true idiomatic Windows tradition, even command line cooties can be avoided: https://desktop.github.com/


yes, but I don't own a windows license so I can't test it. linux is free so I can.


I agree, hence the complaint lacks merit.


You just wasted a lot of time on criticizing this guys windows instructions. Good Job!


I type pretty quickly.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: