DocArray: a data structure for unstructured data

mdaniel · on Oct 5, 2022

That linked document must be written for a very specific target audience because it seems to use a lot of words without providing context of the problem it's solving. The example in their GH repo is a little more "this does what?": https://github.com/jina-ai/docarray#example-1-represent-mult... although even that is a little "don't worry where there seemingly random methods on Document came from" which https://docarray.jina.ai/api/docarray.document/#docarray.doc... seems to fill in a little more

Then again, I guess folks were similarly confused in the past:

https://news.ycombinator.com/from?site=jina.ai

https://news.ycombinator.com/from?site=github.com/jina-ai

scott_s · on Oct 5, 2022

I felt the same. The most important thing for a new library or language to do in its introduction is to show meaningful examples that solve a problem in the target domain in the canonical way.

vladsanchez · on Oct 6, 2022

If you or others need that much explanation it's probably not for you. Perhaps read [Multimodal Search](https://en.wikipedia.org/wiki/Multimodal_search) and see whether it clarifies you.

jonbaer · on Oct 5, 2022

Thanks for pointing that out, I came across it via @ https://qdrant.tech/documentation/install/#docarray so I was not sure what other libs were integrating it

danbrooks · on Oct 5, 2022

Thanks, the GitHub readme is much clearer.

Q6T46nT668w6i3m · on Oct 5, 2022

I’m excited to look at this but the comparisons are misleading! E.g., NPY has existed since 2007 (I believe it was the first NumPy RFC) and is exceedingly popular; JSON can, of course, represent multi-dimensional arrays (you can’t get much simpler or readable than [[1, 2], [3, 4]]); I also don’t understand “Pythonic experience” for JSON since dictionaries are so ubiquitous; I could go on …

fhaltmayer · on Oct 5, 2022

The description of this is kind of confusing but I think the easiest way to understand it is that it is a data processing pipeline of sorts. Take unstructured data and apply transformation and computation. A similar project to this is Towhee (https://github.com/towhee-io/towhee). This project tries to simplify unstructured data processing and provides pretrained models and pipelines from their hub.

jbverschoor · on Oct 5, 2022

> What is DocArray?

Still don't know what it is.

nsxwolf · on Oct 5, 2022

It's a data structure for unstructured data.

BiteCode_dev · on Oct 5, 2022

And it gots electrolites.

jbverschoor · on Oct 5, 2022

But it's also like a library and like protobufs, but it's named after arrays. I don't see how the format works, etc.

In other words, it's everything, which usually means it's really bad at everything, and in reality it's nothing.

So again, what is it? The first screenshot is some python code, and then it's talking about hugging face.

It'd call it 0clarityArray

nsxwolf · on Oct 5, 2022

I have the same impression.

"If you are a deep learning engineer who works on scalable deep learning services, you should use DocArray: it can be the basic building block of your system."

I mean, wow. The basic building block of your system. The very nucleus of any scalable deep learning service! But, what is it?

hnaccountme · on Oct 6, 2022

There is no such thing as unstructured data. If it were unstructured it would be noise. What people call unstructured data is, when they had little to no foresight when designing the original data structures. Any new data fields are just pilled on later and called unstructured