How about instead of starting with an insult (I can't believe you didn't already know this) you instead congratulate them on putting together a full working library, with pretty, easy-to-grasp examples then offer up some research links that they could use to further refine and improve their system. It's our job to teach people, you can't expect everyone to suddenly know everything.
To the Datashader team: I apologize for the above comment. Good job in building and launching a tool for others to use, and great choices for examples!
“Dropbox is just rsync on a cron job” and “OP’s project is not completely novel” are unfortunately common tropes on HN...
Like you said. Starting with “Great job launching. What are the benefits of using Dropbox over, say just rsync + cron?” would go a long way towards improving the environment around here.
My problem isn't that isn't novel or new, it is that it is presented as novel. Visualization is great (and the visualization here looks good), but saying things like 'datashader: turn data into images' when that is literally what rendering is, is a nonsense way to approach a valid topic.
Saying 'visualize big data and billions of points' when the buzzwords are just there to sugar coat an accumulation buffer, gets into a territory of reinventing the wheel but naming it 'the flattened infinite curvature hypersphere'.
So, before you get too self righteous at least realize that it is the delivery and lack of context and precedent, not the actual work that is the problem.
I suspect marketing Datashader as an "accumulation buffer" wouldn't have the same effect on its target audience (data visualisation developers) as presenting it simply as a way to "Turn data into images".
I'm also curious (as a fledgling graphics programmer) - what leads you to believe that Datashader uses an accumulation buffer internally? I would think that they use some magic to draw all the points in a single draw call using instanced rendering, but I am very naive :)
If you watch the video, one of the creators explains this point, which makes datashader different from, say, Bokeh. (Pipeline explained around 7:20 in the video)
D3 and Bokeh and other web-based visualization tools, in general, plot HTML or CSS primitives to the browser. This approach works great for smaller datasets, but doesn't scale to millions/billions.
Datashader aggregates (accumulates) graphical representations of data into images, then provides a way to get those to the browser and work well with the other libraries. That high level description leaves out 95% of the critical practical details of visualization, which the creators of datashader handle.
Datashader's approach is a bit different from an accumulation buffer, though similar in principle. It's not 3D rendering, and has no need for a z ordering; instead it's essentially 2D histogramming. For points, it simply takes each point, calculates which pixel it would land in, and aggregates per pixel, without ever storing all the data points per pixel. The key benefit over something like SciPy's histogram2d functions are in how it is implemented and used -- highly optimized, and highly integrated with viz tools so that it can let you just interact with your data naturally as if you had infinite resolution. Try it and see!
> For points, it simply takes each point, calculates which pixel it would land in, and aggregates per pixel, without ever storing all the data points per pixel.
That is literally what opengl does. If you mean a histogram per pixel in depth, that's literally voxels in perspective space.
If there are usability benefits here, that's great, but everything seems to be centered around there being new rendering techniques here, when not only are they not new, they're completely trivial, with solidified names and formalized math.
It's actually not merely an accumulation buffer. It's a shader pipeline that allows for arbitrary Python code to be executed at each stage of data processing. It's actually very much like "renderman for data", but with Python (via Numba, Dask for performance).
The pipeline is also built in such a way that it permits front-end JS viewers like Bokeh to drive a very dynamic experience.
It is novel in the sense that it combines the interactivity of a D3/Bokeh/whatever JS based visualization (typically limited to a few thousand points) with the massive data display capability of offline rendering.
“Turn data into images” is a much better phrase than “rendering.” Anyone who is not intentionally being pedantic for dramatic effect would understand the purpose of the product based on reading that phrase, whereas “Datashader: a renderer” is less clear and could refer to products with an entirely different scope/purpose.
Their product page is well-written and accurate. It sounds like you want them to purposely describe their product as something that is inferior to what it actually is.
If you want one word, Datashader is a rasterizer. It takes data of many types (points, lines, grids, meshes) and creates a regular grid where each grid cell's value is a well defined function of the incoming data. Not sure anyone would be any happier with "rasterizer" than "renderer" or "shader" or any other single word...
I can't see a single reference to this being somehow something you couldn't do before in the linked page. It describes what it does, it doesn't make claims of superiority over other approaches, ...
As a newcomer to the field, parent post is far more welcoming than the bevy of trash Ninja flavor-of-the-week bullshit that makes it seem impossible to catch up.
I only actually got my bearings and self-confidence as a programmer when I realized that most of the people pushing blogs with subscriptions about "cutting-edge" tech were literally snake-oil salesmen and shovel merchants.
That coding wasn't actually different from anything else I had learned in my life, and that there were some fundamentals I could latch onto, and grow from there upwards. All this nonsense about the field experiencing a revolution that upends all existing knowledge year-after-year is far more mentally taxing.
You're missing the point of this project. It's not about the feasibility of throwing a billion points at a pile of software, to get an image. I can do that with a simple Python script. It's about doing so to create a meaningful and accurate data visualization, and not just a picture of, say, shiny spheres or a scene from Avatar.
I actually have a background in 3D computer graphics, and it's precisely because of my detailed knowledge of raytracing, rasterization, OpenGL, BMRT, photon maps, computational radiometry, BDRFs, computational geometry, and statistical sampling, etc... that when I came to the field of data science & specifically the problem of visualizing large datasets, I realized the total lack of tooling in this space.
The field of information visualization lags behind general "computer-generated imagery" by decades. When I first presented my ideas around Abstract Rendering (which became Datashader) to my DARPA collaborators, even to famous visualization people like Bill Cleveland or Jeff Heer, it was clear that I was thinking about the problem in an entirely different way. I recall our DARPA PM asking Hanspeter Pfister how he would visualize a million points, and he said, "I wouldn't. I'd subsample, or aggregate the data."
Datashader eats a million points for breakfast.
Since you're clearly a computer graphics guy, the way to think about this problem is not one of naive rendering, but rather one of dynamically generating correct primitives & aesthetics at every image scale, so that the viewer has the most accurate understanding of what's actually in the dataset. So it's not just a particle cloud, nor is it nurbs with a normal & texture map; rather, it's a bunch of abstract values from which a data scientist may want to synthesize any combination of geometry and textures.
I chose the name "datashader" for a very specific and intentional reason: we are dynamically invoking a shader - usually a bunch of Python code for mathematical transformation - at every point, within a sampling volume (typically a square, but it doesn't have to be). One can imagine drawing a map of the rivers of the US, with the shading based on some function of all industrial plants in its watershed. Both the domain of integration and the function to evaluate are dynamic for each point in the view frustum.
> thinking up fancy names for reinventing the wheel
They're not claiming to have reinvented the wheel, they're just explaining what it is.
> 'Turning data into images' isn't exactly a new concept.
No, but doing so on large data accurately (the last word is important that you cut off) is not something I know I can easily achieve in a different python library faster. I'd like to know if I could.
We renamed from Abstract Rendering to Datashader for affordances of human cognition.
This is a great paper from Gordon Kindlmann and Carlos Scheidegger talk about how to gauge the accuracy of a visualization, as part of an effort to come up with an algebraic process for visual design: https://vis.cs.ucdavis.edu/vis2014papers/TVCG/papers/2181_20...
Using their metrics around "confusers" and "hallucinators", Datashader came out as one of the few things that doesn't suffer from such intrinsic limitations.
There are a lot of red flags in the abstract of that paper alone.
> Rendering techniques are currently a major limiter since they tend to be builtaround central processing with all of the geometric data present.
This is completely untrue - OpenGL and virtually all real time rendering is done using z-buffer techniques that were originally used because they don't need all the geometry present. These techniques date back to the 70s and were some of the first hidden surface rendering algorithms.
> This paper presents Abstract Rendering (AR), a technique for eliminating the cen-tralization requirement while preserving some forms of interactivity.
Interactivity might be novel here so that is what should really be focused on, if anything. I don't think coining a new term and acronym that don't seem to relate to what is happening is a going to be a good choice to communicate the techniques.
> AR is based on the observation that pixelsare fundamentally bins, and that rendering is essentially a binning process on a lattice of bins.
This observation was made in the early 80s and has been the backbone of renderman renderers for almost 40 years. Renderman calls them 'buckets'.
> This approach enables: (1) rendering onlarge datasets without requiring large amounts of working memory,
Renderman originally rendered film resolution images with high resolution textures with only 10MB of memory.
> (3) a direct means of distributing the rendering task across processes,
Giving different threads their own buckets is standard for any non-toy renderer. Distributing buckets across multiple computers is part of many toolsets.
> high-performanceinteraction techniques on large datasets
This is the only part that has a chance of being novel, but paper only shows basic accumulation of density for adjacency matrices. The visualization are timed in the multiple seconds but look extremely simple, and for some reason are rendered 'out-of-core' on a computer with 144GB of memory even though it seems very unclear that these images couldn't be made with z-buffer rendering in opengl.
> This is a great paper from Gordon Kindlmann and Carlos Scheidegger talk about how to gauge the accuracy of a visualization
It looks like that paper is about the transformations of visualizations for higher dimensional data, not rendering accuracy, so these two things are being conflated even though they are completely separate concepts.
> It looks like that paper is about the transformations of visualizations for higher dimensional data, not rendering accuracy, so these two things are being conflated even though they are completely separate concepts.
Actually, no. The paper may not have been explicitly clear about this, but the ENTIRE point of a "data visualization" system is to transform potentially high-dimensional datasets, with a large number of columns, into meaningful images by a series of steps. You seem to be interpreting this narrowly, and imagining that geometry is already pre-defined in the dataset, so then of course this looks like a fairly trivial 2D accumulator.
That is not the intent, nor is the common use case.
For data visualization, the question of "how do I accurately aggregate or accumulate the 25 - 1million points in this bucket" is a deep one. There is NO data visualization system that programmatically gives access to this step of the viz pipeline to a data scientist or statistician. Most "infoviz" tools gloss over this problem - they do simple Z buffering, or cheesy automatic histograms of color/intensity, etc. These are almost always "wrong" and produce unintended hallucinators.
Your first comment - about "not needing all the geometry present" - indicates that you are not understanding the nature of the problem datashader was designed to solve. There is no simple "cull" function for data science; there is no simple "Z" axis on which to sort, smush, blend, etc. At best, your data points can be projected into some kind of Euclidean space on which you can implement a fast spatial subdivision or parallel aggregation algorithm. But once that's done, you're still left holding millions of partitions of billions of points or primitives, each with dozens of attributes.... what then?
I'm not sure why you would coin a term 'Abstract Rendering' and talk about 'out of core rendering' then turn around and say that transforming high dimensional data sets is part of rendering. Rendering is well defined and very established, coming up with transformations and calling that part of rendering is nonsense. You made this mess yourself by trying to stretch the truth.
Ah, the good old "MapReduce is basically functional programming 101" trope, usually resulting from a fundamental misunderstanding of the problem the framework / tool in question solves.
Well, in that particular case, Google has brought this on themselves by naming their sort-of-a-product after a common FP idiom, and then hyping the hell out of it.
It doesn't matter if something has been done before if the new way hooks into new systems. Tools do not exist in a vacuum. All tools are apart of a system of tools that should always be considered when evaluating any component piece.
I also was expecting something new, but in their defense they've made a very appealing version of something old. I'm sure there are a lot of people out there who haven't thought about saturation with "large" data sets before.