Seems like a great opportunity to use SQLite, then folks could share around database files. Why the hell don't more people use SQLite for things like this?
The author ('tilbert) is hellbanned. You can see his comments when you turn on showdead in your profile.
Re SQLite, he replied: "@coleifer: there was a sqlite branch, but it doesn't scale well to many-million record sets which is what I have been doing. the design of the software allows drop-in db replacement, it just lacks the code. I can't decide to go back and sqlite or to just make a web front-end."
Comments by some (not all) new accounts are killed by default when they look like possible spam or troll activity. It's not a great solution because it leads to false positives like this one. On the other hand, doing nothing is worse. It's a hard problem, and it certainly doesn't mean that the user is banned.
We're about to release software to let the community unkill these, which we hope will be a much better solution.
> The core of webXray is a python program which ingests addresses of webpages, passes them to the headless web browser PhantomJS, and parses requests in order to determine those which go to domains which are exogenous to the primary (or first-party) domain of the site.
Naturally this means a different user agent and finger print which could ultimately mean the script is fed a different page altogether. The odds of that are probably low but still; someone could have a really shitty website that uses hundreds of trackers but could serve WebXray something completely without them.
I would like to see this type of stuff as web browser extensions. That way we can get the exact, most correct information possible. Also would simplify a semi-convoluted build process that seems to have tripped up a few readers.
Firefox has an extension called Lightbeam which does this. The author of the WebXRay software, tilbert, has already noted in a comment below that his software can be run in batch mode on multiple sites which is an advantage over the extension.
The major ad blocking extensions could do this. They already know about the requests. It's also possible to change the user-agent (say, use the top 10) and quickly gather the data using a bunch of cloud servers.
Yes, but they can figure out what on the blacklist was used for a given site. For example, Ghostery has Ghostrank which is pretty similar to this--it sends back to advertisers what stuff was blocked.
yeah, I already spoof the UA string, I've done a bunch of testing to verify it is working and getting me the right code. more likely way to get banned is I'm hammering ad networks from the same IP addr.
but if you mean screencast of the software I haven't had time.
---
@coleifer: there was a sqlite branch, but it doesn't scale well to many-million record sets which is what I have been doing. the design of the software allows drop-in db replacement, it just lacks the code. I can't decide to go back and sqlite or to just make a web front-end.
---
@captn3m0 I have an academic paper in revision that is an analysis of the alexa 1M list, I also have other projects i development.
---
@snorrah: this was my first python project, so I went with the newest version. it's made a lot of things very difficult, especially porting to a web version.
---
@linuxlizard: proxy is problem when you want to do a lot of concurrent tests, I usually load about 64 pages in tandem to get good speeds on large sets.
---
@radmuzon: webdxray runs large batch jobs, so you can get lunch, and come back with all of pages analyzed. I know it does work on windows, and I apologize for not being able to provide directions...see comments above.
---
@TeMPOraL: thanks!
---
@pearjuice: yeah, I wish there was an easier way to get python3 to talk to mysql, that's the biggest PITA.
At SpeedCurve we've built something similar for tracking and understanding the impact on website performance that third party requests can have. It's a big issue for websites when their user experience can be affected by resources that are not even under their control. We've seen websites where over 90% of the requests on a page are made to third parties.
Great to see an open source list of domains linked to organizations. We've built our own list as well and we'll look at contributing them to this project.
Since I am on Windows, it will take some time for me to set it up. Can someone please explain what extra information I get as compared to the Lightbeam extension (formerly Collusion) in Firefox?
webdxray runs large batch jobs, so you can get lunch, and come back with all of pages analyzed. I know it does work on windows, and I apologize for not being able to provide directions...see comments above.
I am pretty frustrated by build processes of modern day applications. Wanted to give this a quick spin, but looking at the installation instructions all I see is compilers, optimizers, minifiers, interpreters, package managers, package-package managers, dependency systems and then, maybe then, you pray to your configuration-God that everything clicks together and runs on your system.
I was about to ask why isn't there a simple, unified build tool for ANYTHING, but I think that is what got us here in the first place...
It's the technological singularity. Thanks to the various "code academy" initiatives going on around the world, there is a growing middle area--between software developers and users--of scripters, people who plug components together but don't do a lot of greenfield programming.
It used to be that being a scripter was a stepping stone on the way to developers, mostly because back then scripting could only get you so far. Now, you can apparently make an entire career being a scripter, if said code academies are correct in the promise that they can find you a job with the extremely shallow curriculum I've seen them provide.
This isn't a bad thing, it's pretty amazing that it doesn't take a decade of dedicated study to do so much anymore. It's just that in our current culture we lump them in with developers because they are clearly more than just users. We still expect a set of resources put into a Github repository to have a significant amount of new programming, rather than just being glue code between a few commonly available libraries. But that's more of a problem of our lack of ability to differentiate between large, greenfield projects and small, configuration-oriented projects at-a-glance than it is a problem of programming being "too easy".
Though, if we recast scripters as users instead of developers, then it's a terrible thing. It means that the real software developers of the world have written a bunch of software with a really, really shitty user interface.
Are you serious? "brew/apt-get these packages", many of which you'll probably have already, is so much better than the way it used to be (you'd have a tarball with a configure file if you were lucky, and if you were very lucky it would work).
see comments below, I don't know windows so can't write directions. I'm poor so just wanted to cover my hosting, but I removed the referral regardless as I didn't realize I would be attacked for it.
What's tacky about that. DO referral codes do not even pay out cash. The program only provides credit allowing them to continue running their services. A DO referral credit is not a way to make money, just a way to offset costs.
tilbert: you're hellbanned, apparently the system thought you were a spammer or troll, you should send an email to HN (hn@ycombinator.com) asking them to un-ban you.
We unkilled that user's comments in this thread and made sure the account wouldn't get caught by the software again. We also detached this comment from https://news.ycombinator.com/item?id=10258192 and marked it off-topic.
For those curious about this 'tilbert' character - if you switch on 'show dead' it turns out he (allegedly) wrote the software in question.
Think the hellban must just be a glitch - so.. Amongst other things, Tilbert wrote:
"
... my hope with open-source is people will help add to it. if you want to help, I'd love that!
---
I see the hell ban, I'm posting all my replies at the bottom in a comment I'm editing - could somebody note this above?
---
I think you guys/gals have a point [about the referral link to DigitalOcean] the referral link is gone. ;-)
---
@coleifer: there was a sqlite branch, but it doesn't scale well to many-million record sets which is what I have been doing. the design of the software allows drop-in db replacement, it just lacks the code. I can't decide to go back and sqlite or to just make a web front-end.
---
@captn3m0 I have an academic paper in revision that is an analysis of the alexa 1M list, I also have other projects i development.
---
@snorrah: this was my first python project, so I went with the newest version. it's made a lot of things very difficult, especially porting to a web version.
---
@linuxlizard: proxy is problem when you want to do a lot of concurrent tests, I usually load about 64 pages in tandem to get good speeds on large sets.
---
@radmuzon: webdxray runs large batch jobs, so you can get lunch, and come back with all of pages analyzed. I know it does work on windows, and I apologize for not being able to provide directions...see comments above
"
* There's more. Enable 'ShowDead' in Profile to see rest of his comments.
There's no way to reply to hellbanned users. People have to reply to others, and hope the hellbanned user sees it. If you enable showdead in your profile, you'll see that tilbert replied to you.
tlibert should also note that although they're editing an existing comment, that comment is [dead] and so can only be seen by people who show dead comments (not many). I guess they need a new account or to contact admins.
Very confusing presentation here. For one thing, the author seems to be referring to links/anchors in a web page as "requests"!
A request is the dynamic, transient action which occurs when a client such as a browser initiates a connection to a server and presents a command like GET or POST.
I suggest an opening paragraph along the following lines:
"WebXray" is a sort of web crawler which analyzes a given cluster of pages for their relationships with each other, as well as external pages which they link to.
It provides information about pages which direct a user's browser to various sites for the purposes of tracking, using tricks like hidden images.
No, he's not talking about links/anchors. He's talking about requests made by the browser caused by scripts which load content from third-party servers. This sounds similar to Mozilla's Lightbeam https://www.mozilla.org/en-US/lightbeam/
it monitors third-party HTTP requests, not links: "webXray is a tool for detecting third-party HTTP requests on large numbers of web pages and matching them to the companies which receive user data...The core of webXray is a python program which ingests addresses of webpages, passes them to the headless web browser PhantomJS, and parses requests in order to determine those which go to domains which are exogenous to the primary (or first-party) domain of the site. This data is then stored in MySQL for later analysis."
This callback is invoked when the page requests a resource. The first argument to the callback is the requestData metadata object. The second argument is the networkRequest object itself.
The requestData metadata object contains these properties:
id : the number of the requested resource
method : http method
url : the URL of the requested resource
time : Date object containing the date of the request
headers : list of http headers
The networkRequest object contains these functions:
abort() : aborts the current network request. Aborting the current network request will invoke onResourceError callback.
changeUrl(newUrl) : changes the current URL of the network request. By calling networkRequest.changeUrl(newUrl), we can change the request url to the new url. This is an excellent and only way to provide alternative implementation of a remote resource. (see Example-2)
setHeader(key, value)
Windows Specific Instructions
Get a linux cloud server (which cost fractions of a cent per hour these days).
Ubuntu is the easiest flavor of Linux to get started with and the directions
above will serve you well. Seriously, this is your best option. You can do it.
I'm both confident in your abilities and proud of you for taking this important
step in life.
This is not a very helpful attitude. The only "UNIX-y" thing I see it's doing is forking for concurrency. I understand that Python's global interpreter lock limitation makes processes more desirable than threads for concurrency, and on UNIX-like systems this isn't a problem because starting new processes is very cheap. But that doesn't mean it wouldn't "work" on Windows, just be a little slow on starting each subprocess.
Or is it more about not wanting to track down how to install software on Windows?
(EDIT: and as others have pointed out, it's kind of cheesy to use the moment to plug your referral code for DigitalOcean)
Everyone opinions will differ but for that it's worth I personally don't see the problem with a referral link. Maybe it should have been labelled as such, just for the sake of transparency. But since they are recommending a Digital Oceans Ubuntu droplet then it seems kind of appropriate for that recommendation to include a referral.
It's really no worse than having adverts or Amazon referrals. In fact it's less intrusive than the former.
> I personally don't see the problem with a referral link. Maybe it should have been labelled as such, just for the sake of transparency.
I fully agree. I wouldn't remove it at all, just add a a mention that it is your referral link.
This is free software folks. If the coffers aren't held out with donation links, I see nothing wrong with him getting a cut elsewhere.
In fact, the nature of this software is unthanked work and relatively unknown to the general public. It addresses a real issue, and it addresses it for free.
"Cheesy" is criticizing him for the referral link.
How much do people get from these kinda referral links anyways? I feel that if you're only gonna get a few clickthroughs, then its not worth putting the link there and have people judge you / your intentions.
>in response to referral link comment: I host the site on my own dime so I didn't think getting $.25 from DO would make me a bad person...also, I really don't have time to learn windows to make instructions. I'm a grad student and make slightly above minimum wage annually, if I were to be evil and greedy I'd probably not be doing what I do with my life. I'm giving away thousands of hours of work and data I could make money off of doing evil things.
>its not worth putting the link there and have people judge you / your intentions.
This makes no sense to me. His intentions are clear. Far more clear than the intentions of third party HTTP requests on the big websites you visit. And you didn't even need to hear it from him himself to know that he hosts his site and needs a few shekels to support it. Of course, you're entitled to your own opinion, but I could really care less if people "judge" me for a DO link, and I hope the tilbert feels the same way.
I don't understand in general this hate for people who post referral links. Do you not hover over a link before you click it? See all the garbage appended to the query string in the URL? If I don't want to click it, I don't. I don't hate the person for adding a referral link. I don't understand the hate for the people who do. Again, you're entitled to your own opinion but I'm not a fan of this minutiae being hammered in this HN thread. I'll just go to the website on a clean window and search the same product/website/service.
Finally, the third party advertisers are judging him far more harshly for this software, anyways. Their corrupt model is (in a small way) subverted. If he was worried about "judgment" he wouldn't have offered this software in the first place.
Huh? What provoked this strong response? I'm not hating anyone here. I'm just thinking about this from his / my potential perspective where if you've already put so much effort into the project expecting zero returns, why 'taint' your own product? Its like putting ads on personal blogs that's gonna get you a few bucks in a year. Maybe I'm just particular about these kinda stuff and want to keep things 'pure' and 'clean'.
It was far less strong than the initial attack on his adding a referral link in the first place.
A referral link is (arguably, again, opinions here ;) ) much cleaner and more transparent than having your analytics data being pushed out to a dossier tracking your behavior, emotions, and extrapolating patterns on the two.
You can hover your cursor on a referral link to see what it is. You can even open up a Javascript console and see where the true location of any given link is taking you to, should you click on it.
You can't see what's being done with third party gathering information about you.
And funnily (read: ironically) enough, its the entire point of this software in the first place. I repeat that I'm confused about the concern over this minutiae considering the fact that when you go look up something online, someone is getting paid in some way, shape, or form. I don't mean to come off strong, and I apologize if you took it in that way.
A referrer gets paid in creation of a consumer whereas an advertising company gets paid as a spy, essentially. Using Amazon as an example, isn't the purpose of the service to sell you something? Instead of selling you as a product?[0]
No, it's criticizing him for not being forthright about it. I think it's important for people to know whether or not their activity is helping financially support someone. I may specifically want to support the project maintainer, so I will try to remember to use their referral link later if I don't have the time to do it now. Or I may specifically not want to support the maintainer, so I will circumvent their referral link. Either way, my job is made a lot easier if the maintainer discloses that the link is a referral link.
I removed it, I thought it was obvious once you clicked on the link. again, I'm a PhD student, I live at the poverty line and host this project myself.
Honestly, you should have left it up there and told the first few to go <strike>fuck themselves</strike> find something better to do than criticize someone who gave them a FOSS program.
Yeah, things that are free need to make money somehow, hence ads and ref links. It's asinine to think that a person should just give up their own cash to provide a free service. It's the epitome of using someone as a means to an end; and you're the one getting used.
I do a lot of research, advocacy, and editorial work which is very critical of many powerful companies; I have to be sure there is no accidental impression I am "for" or "against" anybody or benefitting financially in any way from my research. that is actually why I changed it.
any sane person can see that releasing my intellectual property for free wasn't an evil scheme to make $.25 from digital ocean. ;-)
I haven't used windows in 15 years and couldn't find anybody to write the directions for me, so I went with snarky over spending time figuring out how to install python3 and mysql on windows. ;-)
in response to referral link comment: I host the site on my own dime so I didn't think getting $.25 from DO would make me a bad person...also, I really don't have time to learn windows to make instructions. I'm a grad student and make slightly above minimum wage annually, if I were to be evil and greedy I'd probably not be doing what I do with my life. I'm giving away thousands of hours of work and data I could make money off of doing evil things.
---
I think you guys/gals have a point, the referral link is gone. ;-)
I can't speak for getting a Linux cloud server, but I've seen plenty of recommendations to use a VM when trying to use some software on Windows. Unless you're specifically writing C# or using IIS or SQL Server I don't necessarily believe it's bad advice. In the coming years, with the way C# is changing, Linux/FreeBSD may become the desired platform to run C# code (just like it is for Java.)
personally, I feel the real magic of VMs is I can play around, keep notes, royally screw up my system doing something idiotic, trash the machine, spin up a new one, and resume from my notes. I'm old enough to remember the dark days of RHL CDs running on actual computers...
If I was consulting on a flat fee, that's the advice - virtualize to the most compatible OS - I'd typically offer because it is the most likely to deliver the most business value over the long run. On the other hand, if I was charging by the hour, I'd spend my time and the customer's dollars creating a one-off installation that is likely to break and need future $upport when components change.
On the third hand, if the task was trivial, then there would be no reason to complain about a lack of bespoke Windows instructions, since they would almost write themselves. Of course if bespoke Windows instructions are non-trivial, then obviously it's hard to see merit in the complaint.
Your instructions are literally just "install this software". Windows is not so different that the instructions can't be just "install all this software". We can install software without apt perfectly fine.