This article misses one of the coolest things about the Zarr format - that it's flexible enough that it's also becoming widely used in climate science.
In particular the Pangeo project (https://pangeo.io/architecture.html) uses large Zarr stores as a performant format in the cloud which we can analyse in parallel at scale using distributed computing frameworks like dask.
>flexible enough that it's also becoming widely used in climate science.
For scientific instruments just the opposite took place in the 1990's with chromatography.
They adapted a climate file type for laboratory instrument use.
It was quite a saga but in a nutshell it was the manufacturers who well understood the benefit of a common file type once computers had taken enough of a hold. Most lab analyses didn't need a computer except for the most data-intensive, which could never wait and were being handled by mainframes or minis before the PC appeared. Only by those that could afford such costly investments like oil companies and drug companies. Which are some of the most desirable customers for instrumentation.
At the time HP's lab instrument division (spun off decades ago as part of Agilent) had the most agressive sales approach but if a potential customer was already entrenched with one of the few other vendors there was no efficient way to carry their past proprietary data forward into a new system.
So the vendors got together, formed the Analytical Instrument Association (AIA) and began work in earnest toward a common file type they could all agree on for data interchange, without having to abandon each of the file types they had developed earlier which remained in everyday use within their proprietary software.
Once this had been accomplished it would be more of a battle of the salespeople with one less (big) technical obstacle to overcome.
They ended up choosing a well-established extensible file type, netCDF, which was in wide use at NOAA without encumbrance. Tailored it to the task and after a number of years the new standard was almost complete. Major vendors were already supporting it.
Well, by that time HP had already become dominant so they stealthily lost interest for a while before pulling out of AIA completely.
Fortunately with no further progress in sight the whole thing was dropped in the lap of ASTM, where it has remained unchanged for over two decades. That is even more fortunate because hindsight has shown the difference between an ideal standard and a nonideal standard is that an ideal standard never changes, and this is one of the best examples. These are often referred to as CDF files even though that does not stand for Chromatography Data File, they were just named netCDF by the weather service during the mainframe days.
So for more than 20 years each PC chromatography data system has had the capability of handling AIA files for either input or output. The underlying proprietary files which each software package utilizes under the hood for real-time data acquisition may have some unique features, but the common netCDF interchange covers all the essential bases needed for post-processing. Anything less is sub-standard.
Only in the last few years has a major software package appeared which for the first time does not fully support CDF's and that is the latest offering from Agilent of all places.
It can only create optional CDF files from the data acquired by their own instruments, so far there is no allowance for inputting a standard CDF file for processing by their software any more.
Just like open access, I think at some point this sort of interoperability just has to be demanded by those actually funding the research. Two many narrow interests with no spare effort to coordinate otherwise.
I'm in a field that would have to change significantly if funding agencies demanded open access. I would love it.
Right not there's basically zero funding for people who work on making our data open. It's kind of a hobby project that you do if you have free time and no one has found a real job for you.
I don't think anyone calls for funding just for making the data open. Funding should be for some scientific goal, with the added requirement that all data and code are made open source/open access once the funded project reaches a milestone (e.g., when publishing a paper).
I think part of the challenge here is that in some fields, proprietary tools and formats are so ingrained that requiring openness could be a massive burden to the point where it would be difficult to put together a reasonable budget that includes real open data access.
I think it's important to acknowledge here that there are degrees of open access. Simply providing data files is relatively easy. Making reproducible workflows as a non-developer can be nearly impossible in some cases.
What is ingrained is the idea among my collaborators that open data is kind of an "us vs them" problem: it's our data, and the public hasn't been demanding access, but if we gave it to them they would just scoop us and steal all the science, or they would do something stupid and force us to disprove it.
It's all quite speculative but the limited data releases so far haven't caused the field to collapse as some doomsayers predicted. Meanwhile I'd just like more open data because it would make collaboration easier.
> Making reproducible workflows as a non-developer can be nearly impossible in some cases.
Why is that the case? Size?
I am very familiar with dealing with old proprietary things that have to be fought tooth and nail to give you a simple text format, but that doesn't seem to be nearly the scale of trouble as you describe.
Old proprietary things can definitely be part of the problem. A lot of people will just cut and paste various scripts, download various packages, and make random edits until things work. I can't blame them. These things can be hard and for those not from a computer science background, it can be difficult to comprehend how things actually work.
Just getting old software to compile can be a challenge in itself. You could say to just put everything in a Docker container. Then when you have it working, anyone can just use that container. That's not so easy for someone who doesn't even know what Docker is. I'm not saying that to suggest people are incompetent. As developers, we really should do better to make things easy.
It's almost the same in the field of ELN/LIMs. Researchers are required to use one, but there's no wider strategy mandating the systems to be interoperable
Microscope vendors have gone to incredible efforts to obfuscate their file formats. Some inventing custom compression codecs. I worked for a company that spent enormous amounts of time reverse engineering these protocols.
When you're dealing with a specific subdomain, it's certainly possible you can exploit known properties of data in that domain to achieve better compression than a general purpose algorithm.
That said, it's certainly worthwhile benchmarking (de)compression speed and the compression ratio to see if the difference is actually meaningful enough to be worthwhile.
I'd hope imagemagick or some other program can at least get the data out into another format. I did not know about openslide before reading that mail, but I'll try to keep it in mind.
Here's hoping this eventually gets fixed in a similar sort of way the .mkv format "fixed" video.
I am not sure if the video actually says the name of the microscope, though, because he was mad at them for not testing it at high altitude where he lived at the time.
Mochii starts at $48K for the imaging only unit. The spectroscopy-enabled version, which provides full featured x-ray spectroscopy and spectrum imaging, is $65K.
It has an integrated metal coater option available for $5,000, and we offer a variety of optical cartridge exchange programs that can fit your consumables utilization and pricing needs.
I feel like this is just going to lead to another design-by-committee overly complex standard that very few can actually implement, and the desire to reinvent instead of reuse existing standards with some additions is responsible for much of that. A lot of the features they mentioned like scalable compression / "region of interest" were also present in JPEG2000. IMHO "reduce, reuse, recycle" works well as a software design principle too.
Yes, it's really frustrating that Zarr is promoting these exploded file/object trees and not providing a viable standard serialization for an entire image.
It's ironic that they mention FAIR but don't see how much better it would be to have a serialized byte-stream you can cite and share with checksums etc.
Absolutely. I spent an entire summer writing translation layers for a bunch of different microscopes so that they could interface with the detector software of the company I was interning for. It was tedious work!
> Each pixel must be labelled with metadata, such as illumination level, its 3D position, the scale, the sample type and how the sample was prepared.
Each pixel? Why? All of those except the 3D position apply to all the pixels in a given image, and the (2D) position of a pixel can be inferred from its location in the image.
Wait - are there optical microscopes that can create 3D images? I know you can see a 3D image if you peer into a binocular microscope, but AFAIK cameras for those things are always 2D cameras.
I use the Zarr format (for climate science data rather than microscope data), and I think this is just poor wording in the article. In the Zarr specification the metadata is stored separately from the actual chunks of compressed array data. So the metadata applies at the array level, not the pixel level.
> Wait - are there optical microscopes that can create 3D images?
I think so - they do it by scanning lots of images at different focal lengths to create a 3D section (I think?). There are whole projects just for visualizing the multi-TeraByte 3D image files produced - Napari is an open-source image viewer which opens OME-Zarr data.
They don't really "label every pixel" in the sense that I think about it.
Instead, they have a collection of dense arrays representing the image data itself, then have metadata at the per-array, or overall level.
A typical dataset I work with is multidimensional, it starts as:
1) 2D planes of multiple channel image intensities, typically 5K x 5K pixels, each covering just part of an overall field of view. These are like patches when you do panoramas- take 20 partly overlapping shots. Each plane contains multiple channels- that could be "red green and blue" or more complicated spectral distributions.
2) 3D information- the microscope takes photos at various depths, only the "in-focus" (within the volume of view) information. These can be stacked (like depth stacking) or turned into a 3D "volume".
3) Maybe the data was collected over multiple time points, so (1) and (2) repeat every hour. Other parameters- like temperature, etc, could also represent an entire dimension.
4) Every 2D plane has its own key-value metadata, such as "what color channels were used", "what objective was used" (magnification), and lots of other high-dimensional attributes (that's what they mean by "each pixel must be labelled with metadata"- the 3D position is the same for every pixel in a 2D plane.
5)
Generally all of this is modelled as structures, arrays, and structures of arrays/arrays of structures. In the case of OME-zarr, it's modelled as an n-Dimensional array with dimensions expressed in a filesystem hierarchy (first dimension typicall the outermost directory, innermost dimension usually a flat file containing a block of scalar points using some compressed numpy storage. Then at each level of the directory you have additional .json files which contain attributes at that level of the data.
Those partly overlapping 2D planes are often assembled into panoramas, which can be a lot more convenient to work with. There are various tools for working with this- I've used navigation map javascript, but napari is a desktop app wiht full suport for sectioned viewing of high-dimensional (7d) data.
OME-zarr is nice because it sort of uses the same underlying tech that the machine learning folks use, and it's ostensibly "optimized for object storage", but I still have lots of complaints with the implementation details, but it's important for me not to distract the OME-zarr team from making the standard successful.
Probably not all these things need per-pixel metadata, but anisotropy exists and that means many of the variables you'd think are per-exposure are actually dependent on where in the exposure the individual pixel is. For instance, illumination level isn't uniform across a field of view for all cameras, and may need to be normalized.
As far as I know, not technically (although I haven't kept up in the area), but you can definitely sweep through a volumetric sample; there are microscopes that can for example illuminate a thin z-plane of a transparent sample and collect the image or those that can reject out-of-focus (off-z) light for a particular z-plane, then move to another z-plane, etc. and then generate a volume on the software side.
Yes, I think you are talking about light-sheet microscopy. This tries to illuminate a thin layer and then images with a 2D sensor like in a normal digital camera. This is often a monochrome sensor, and extra metadata records what kind of optical filters were in place. A multi-channel image would then have separate planes captured with different optical band-pass filters. You need metadata to know how a set of planes relate to each other, whether shifting through space or changing filters or just measuring the same configuration again at different time points.
There are also confocal "laser-scanning" microscopes which effectively illuminate a single point in space and image with what is effectively a single monochrome pixel sensor. In this case, the raw signal is pretty much a time series like you might imagine with digital audio. You need metadata to tell you how the optics were being shifted around, to interpret each sample as representing a point in space at a particular time and with what filters.
Normally, the first round of interpretation is done during the initial file save, packing measurements into a file format that still expects planar, rectangular pixel arrays with a regular grid spacing for the neighboring pixels. You pretend the whole image plane was captured at a time point, but it was really measured sequentially somewhat like a cathode-ray tube displays an image one pixel at a time. A session might produce many such planes, and metadata is still needed to understand whether these planes represent the same plane observed over time and/or parallel planes spanning a volume and/or different channels of the same plane.
On the other end of the spectrum, there are slide scanners which image a much larger stage area by shifting around a 2D camera and taking many overlapping images. So a 2048x2048 sensor might be used to produce a 100,000 x 60,000 pixel image plane. Each sub-image would record its position within the stage area and these do not necessarily line up at precise multiples of image pixels. There might also be gaps where the scanner first detects the overall shape of specimen(s) on the slide and then plans a set of images to cover the specimen while skipping over the background area. This sparsity saves time for imaging as well as storage and transfer times by skipping empty areas.
Haven't seen the details, but I wonder what's the main difference between this OME-Zarr format and DICOM WSI, that's mostly aimed at the same type of images.
Thinking out loud, this looks like the consolidation of several formats and projects. Bioformats, OME, Zarr...
Back in the day I had the feeling that the OME/Bioformats people were more centered on research, and the DICOM crowd were mostly from the clinical side.
Spent a decade managing microscopy data and tbh I don’t know how important or useful this is. People can share data in whatever format they have and it wouldn’t be hard for me or others to import it one way or another. Not that I have ever felt the need to do such verifications. It’ll take weeks to months to do anything like that.
There's real overhead. For instance, one spec that comes to mind the vendor starts the 0,0 pixel in the lower right. The spec calls for 0,0 to be the lower left. That leads to an inverted handedness for all images. Not a problem in projection, but as soon as you start in 3D, it creates a real mess. Some software packages try to detect when the camera was this vendor, and correct for it (silently). Other software packages refuse to correct for it, because they don't have a reliable way to detect it. A pipeline might have 3-4 of these software packages involved, eventually you have no idea if you handedness is correct and no real way to tell without being at a high enough resolution that you can use biological landmarks of known handedness.
Don't get me started on Euler angles.
Caveat being I'm more used to electron microscopy, maybe these things aren't as important with light microscopy because the resolutions are lower?
> People can share data in whatever format they have and it wouldn’t be hard for me or others to import it one way or another.
It's one thing for a highly-skilled user with a decade of experience to be able to import it eventually, another for an unskilled user to just have off-the-shelf tooling do it the same way for everyone automatically. This is more about meta-studies and unlocking new use cases. It sucks that the state-of-the-art for sharing academic microscopy data right now is basically looking at raster images embedded in PDFs, or email the authors, and then write a custom Scikit-Image script. Imagine if you had to read a PDF catalog and then email someone to order something off Amazon, or if your favorite CRUD app consisted instead of having an expert read a PDF and email a screenshots to you. What if sending those very emails to different recipients required implementing each users custom IMAP-like mail client. That sounds absurd, but it's kind of the way academic data sharing works now, lots of people are re-inventing the wheel and creating custom file formats.
Consider, for example, the work of Dr. Bik (example at [1]) who identifies cloned sections from microscopy data. Or what if, instead of each researcher having to generate their own images, or get lucky and remember a particular image there was a Getty Images/AP Newsroom platform where you could just filter for your particular subject and imaging parameters and share your data. A collection of proprietary RAW files with randomly-formatted Excel documents for metadata would allow individual researchers to get their work done, but would be pretty worthless in comparison.
I work with microscopy data and we have to convert all the proprietary-but-still-readable-by-some-random-package image data generated by microscopy companies to ome-tiff/ome-zarr for it to be in a manageable format. I think it's great!
I write a slide management platform that doesn't convert images to a single format and it mostly works. From time to time you do get issues though (cause you see millions of slides). Sometimes vendor's own libraries fail to open the slides (cough Phillips cough). Or there is a new version of firmware that sets some flag that wasn't used before.
We support DICOM supp 145 too, but it's no panacea. There are still vendor specific quirks. The "surface" is larger (cause you expect all the metadata to be there in the standard format) so you still sometimes see differences.
i hope a new format, be it N5 or OME-zarr, achieves critical mass. but i'm not holding my breath. NIfTI never really caught on, despite having a very nice spec.
unless scope manufacturers unite behind a format, biologists are going to keep writing bespoke analysis pipelines for the formats their images come in. it's going to take some top down regulation from the NSF/NIH requiring that all scopes support the open format of choice by a certain date to be eligible for purchase with grant money.
Without knowing anything about this, I wonder if standards are trying to work at the wrong level of abstraction. Like worrying about whether to use GIF or PNG to compress map images, when really it would be better to standardize the underlying data like OpenStreetMap. A better goal might be to decide on a single container format for discrete cosine transform (DCT) data in compressed images like how JPEG works.
We probably need a multidimensional lossy compression scheme for 2D images, 3D volumes/video, and 4+D metadata involving multiple spectrums, exposures, etc. I'd probably start with a multidimensional DCT with a way to specify the coefficients/weights of each layer and metadata (optionally losslessly):
Then store the data in zipped JSON or BSON, or maybe TAR then ZIP.
Then I'd write an importer and exporter from/to all of the other formats, with a single 0-1 quality argument or maybe an array of qualities for each layer, with a way to specify coefficients or load them from files.
Once a known-good implementation worked, I'd add additional compression schemes that can compress across dimensions, so that a level of zoom only has to save a diff from a previous layer at a slightly different scale or focus (like a 3+D MPEG format, this is probably the standard that we're currently missing, does anyone know of one?).
This all seems fairly straightforward though, so I wonder if proprietary software patents are the real blocker..
The OME-zarr creators are pretty experienced in this field and have already done the work you're describing. Let's not create another standard after OME-zarr.
Hey thanks, I tend to agree. Although if I read this right, OME-zarr punted on lossy compression (which is all that really matters) so IMHO it doesn't have much use in the real world yet even though people evidently put a lot of work into it:
i.munro (Ian Munro) August 15, 2023, 8:54am
On the original point about size I suspect your answer is compression.
I have found that a lot of image formats use lossy compression.
I’m not sure about CZI specifically.
AFAIK ome-zarr (sensibly IMO) uses only lossless compression.
Best
Ian
and:
sebi06 (Sebi06) August 15, 2023, 1:42pm
Hi all,
for CZI we offer currentyl two compression methods:
JPG-XR
ZSTD
So we have lossless and lossy options depending on your needs. All of them are implemented in our C++, #NET or Python APIs and also supported from BioFormats.
CZI is basically JPEG compression and is probably missing the multidimensional compressor which is the secret sauce missing from the various formats. It also mentions licensing and legal so is maybe lost in the weeds and not open source.
Without more info, it feels like this is still an open problem.
My interest in this is for video games, since so many are multiple GB in size but are also missing this multidimensional compressor. If we had it, I'd guess a 10:1 to 100:1 reduction in size over what we have now, with no perceptual loss in quality.
Please don't use lossy compression for scientific data.
Disk space is cheap relative to people-time and data-collection-time.
"Perceptual" is misleading- we're not just humans looking at jpegs. Many people are doing hard-code quantitative analysis for machine learning. better to downscale the images (2X, 4X, 8X or more) than to use lossy compression if you are truly low on disk budget.
It sounds like there should be a standard for compression of scientific data, probably as a noise ceiling instead of a compression ratio, if the noise is normally distributed (a big if). But if I had to choose, I would side with you and vote to use lossless in order to avoid a whole class of errors that might come from lossy compression. Especially when we can get a 1 TB SSD for under $50.
To play devil's advocate, I think there might be an opportunity for lossy compression of images below the Nyquist Rate of the microscope:
I only bring this up because I worked with industrial cameras at a previous job and even the best lenses only got us to maybe 1/2 or 1/4 of the resolution that the sensors were capable of. Microscopes should probably include an effective resolution in each image file's metadata. If they don't, then we have no data on how noisy the image is, so this is all a moot point.
Next, when you make a point like "we can get a 1TB SSD for under $50". It shows you're not understanding the domain microscopists work in. Their scopes typically cost $100K (with a camera for another $50K, and a stage that costs $10K). The cost of the storage system (or parking the data in S3) for 10 years is usually far less than the capital cost of the scope, or the operational expense of maintaining staff on site.
I believe the effective resolution can already be computed from the captured metadata, which includes the objective's numerical aperture, as well as the frequency used for imaging (many microscopes today use lasers to image with a very specific frequency to cause fluorenscence, which itself has a fairly tight output spectrum).
Thanks those are better terms. I remember doing a test where the camera photographs a ceramic plate with an almost perfectly black diagonal line to calculate the effective resolution, but I can't remember what it's called, maybe the diffraction limit. It's basically a measure of how blurry the image is, where a matched lens and sensor would look like a Bresenham line with no antialiasing.
TBH, I think that the scientific community has huge blind spots, mostly due to its own gatekeeping. Some of the things like this that seem to be a struggle remind me of the design-by-committee descent of web development. The brilliant academic work of the 1990s has been mostly replaced by nondeterministic async soup and build processes of such complexity that we have to be full stack developers just to get anything done. All the fault of private industry hoarding the wealth and avoiding any R&D that might risk its meal ticket. Starving the public of grants, much less reforms that are actively blocked like UBI. Now nobody ever seems to step back and examine problems from first principles anymore. That's all I was trying to do.
Edit: I found the term for calculating the effective resolution of a camera with a high-contrast diagonal line, it's "slanted edge modulation transfer function (MTF)":
> lossy compression (which is all that really matters)
No, not in this case. This might be great for game files or whatever you are familiar with but lossy compression would basically mean to be mangling the data after going through all the effort of collecting it (I suspect you have only little experience with lab work?).
When doing experiments, everything is documented in full detail. Settings on machines and equipment, preparation steps, version numbers of software used for processing, …
You really don‘t want to lose information on your experiment.
Lossy compression is a bad idea for scientific images. For instance, we often need to understand the statistics of photon detection events in background regions. That’s one of the first things to get tossed.
DICOM technically supports lossy compression, but I think it's regarded as something of a historical mistake. It's pretty much never used and some viewers will plaster big warnings on lossily compressed series and others will outright refuse to load them.
ome-zarr puts metadata in json not XML. There was an older standard that put the metadata in XML. I've already written converters (converted the XML schema to JSON schema), it wasn't really an issue. The raw data is in blocked compressed numpy ararys in a directory structure on disk.
Why bother? Someone will come up with a model that will just suck it all up and convert it automatically... and by someone I mean some sufficiently motivated LLM.
In particular the Pangeo project (https://pangeo.io/architecture.html) uses large Zarr stores as a performant format in the cloud which we can analyse in parallel at scale using distributed computing frameworks like dask.
More and more climate science data is being made publicly available as Zarr in the cloud, often through open data partnerships with cloud providers (e.g. on AWS (https://aws.amazon.com/blogs/publicsector/decrease-geospatia...) ERA-5 on GCP(https://cloud.google.com/storage/docs/public-datasets/era5)).
I personally think that the more that common tooling can be shared between scientific disciplines the better.