Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: EVA – AI-Relational Database System (github.com/georgia-tech-db)
237 points by jarulraj on April 30, 2023 | hide | past | favorite | 36 comments
Hi friends,

We are building EVA, an AI-Relational database system with first-class support for deep learning models. Our goal with EVA is to create a platform that supports AI-powered multi-modal database applications operating on structured (tables, feature vectors, etc.) and unstructured data (videos, podcasts, pdf, etc.) with deep learning models. EVA comes with a wide range of models for analyzing unstructured data, including models for object detection, OCR, text summarization, audio speech recognition, and more.

The key feature of EVA is its AI-centric query optimizer. This optimizer is designed to speed up AI-powered applications using a collection of optimizations inspired by relational database systems. Two of the most important optimizations are:

+ Caching: EVA automatically reuses previous query results (e.g., inference results), eliminating redundant computation and saving you money on inference.

+ Predicate Reordering: EVA optimizes the order in which query predicates are evaluated (e.g., running faster, more selective deep learning models first), leading to faster queries.

Besides saving money spent on inference, EVA also makes it easier to write SQL queries to set up multi-modal AI pipelines. With EVA, you can quickly integrate your AI models into the database system and seamlessly query structured and unstructured data.

We are constantly working on improving EVA and would love to hear your feedback!




It appears from initial reading that it must be possible to support pure NLP tasks with this, but there weren't examples for these in the documentation, so I'm not sure. Does it support NLP models?

Ex: Could I have a store of articles and run NLP tasks against it?


Great question! Yes, EVA supports NLP pipelines thanks to our recent integration of Hugging Face pipelines last month. Here is an illustrative text classification application:

  -- Text classification application in EVA
  CREATE TABLE IF NOT EXISTS MyCSV (id INTEGER UNIQUE, comment TEXT(30));

  LOAD CSV 'csv_file_path' INTO MyCSV;

  CREATE UDF HFTextClassifier
  TYPE HuggingFace
  'task' 'text-classification';

  SELECT HFTextClassifier(comment) FROM MyCSV;
EVA supports many other NLP pipelines, including summarization and text2text generation.

[2] is an illustrative notebook that presents an HF-based object segmentation pipeline (not NLP-based though). We would love to jointly explore how to best support your NLP pipeline. Please consider opening an issue with more details on your use case.

[1] https://github.com/georgia-tech-db/eva/blob/4fa52f893e7661d4...

[2] https://evadb.readthedocs.io/en/latest/source/tutorials/07-o...


Very cool. Also, love seeing rambling wrecks from Georgia Tech here!

While this is a very cool project, making a very obvious demo that people can use to leverage it would make this stand out in the current ecosystem of tools like this.


Thanks! Likewise :)

Thanks for the suggestion! I just added links to the demo applications earlier in the README. All applications are Jupyter notebooks that you can open in Google Colab.

* Examining the emotion palette of actors in a movie: https://evadb.readthedocs.io/en/stable/source/tutorials/03-e...

* Analysing traffic flow at an intersection: https://evadb.readthedocs.io/en/stable/source/tutorials/02-o...

* Classifying images based on their content: https://evadb.readthedocs.io/en/stable/source/tutorials/01-m...

* Recognizing license plates: https://github.com/georgia-tech-db/license-plate-recognition

* Analysing toxicity of social media memes: https://github.com/georgia-tech-db/toxicity-classification


I personally wouldn’t put the Emotion one first on the GitHub README, that was the only one I opened, before clicking the license plate one and a) see it was a whole other GitHub demo and b) opened two files to see both doing parsing/loading models without any SQL before getting bored and closing the project.

Maybe I’m not the target market but seeing the 2nd and 3rd example in your list here, which actually has SQL query examples, were much more interesting and relevant IMO


Thanks for the helpful suggestion! Just reordered the examples in the README.

Here are the illustrative queries:

  -- Object detection in a surveillance video
  SELECT id, YoloV5(data)
  FROM ObjectDetectionVideos 
  WHERE id < 20

  -- Emotion analysis in movies
  SELECT id, bbox, EmotionDetector(Crop(data, bbox)) 
  FROM HAPPY JOIN LATERAL  UNNEST(FaceDetector(data)) AS Face(bbox, conf)  
  WHERE id < 15;


Could you turn this into a psql extension? If this is integrated into an actual database that can be used in production, this may have a future. Otherwise no one will touch this, and it’d be yet another useless and cute experiment from the academia.

edit: thank you for clarifying, it looks like this is not a new database engine and is a cache/query layer.


Thanks for the helpful suggestion! EVA uses an SQL database system for managing structured data using sqlalchemy. It runs on PostgreSQL out of the box. You only need to provide the database connection url in the EVA configuration file.

Thanks for your candid comment. We take it very seriously. EVA is already being used in production by some collaborators and we would love to support more early adopters :) Please let me know if I can DM you to get more feedback.


Nice.

I’ve skimmed over the documentation and it wasn’t clear. It looked like the database was designed from scratch. If this is a caching/syntactic sugar over a mix of DB and inference queries, this is interesting and feels a lot less risky.


Thanks for following up on this.

We designed EVA from scratch for managing unstructured data (e.g., video, audio, images, etc.). EVA leverages relational database systems to manage structured data and widely-used libraries to manage feature embeddings (FAISS library [1]). We aim to leverage decades of experience in relational database systems and reduce risk in production deployment.

[1] https://github.com/facebookresearch/faiss


Do you support weighted similarly search? I.e. when I have several embeddings and need to put a weight factor in front of the cosine similarity when I’m performing a query?

Faiss seems like an excellent choice. How do you get the vectors into it from the database? Or are they stored separately? I’m currently using pgvector and it’s not GPU optimized. But the advantage is that it enjoys the same levels of data protection as the rest of the database.

Actually, are there any vector similarity search query sample? I see the feature extractor, but can’t seem to find any similarity search samples.


Great questions!

EVA does not currently support a weighted similarity search. We are working on creating a notebook to illustrate similarity queries. But, EVA already supports the queries of this form:

  -- Step 1: Extract objects in Reddit images using the YOLO object detector
  CREATE TABLE reddit_dataset_object (name, data, bboxes) 
  AS SELECT name, data, labels FROM reddit_dataset
  JOIN LATERAL UNNEST(YoloV5(data)) AS Obj(labels, bboxes, scores);

  -- Step 2: Build index over features extracted using SIFT
  CREATE INDEX reddit_sift_object_index
  ON reddit_dataset_object (SiftFeatureExtractor(Crop(data, bboxes)))
  USING HNSW;

  -- Step 3: Retrieve the top 10 most similar images
  SELECT id FROM reddit_sift_object_index
  ORDER BY Similarity(SiftFeatureExtractor(Open(”“input_img_path.jpg”)),
           SiftFeatureExtractor(data))
  LIMIT 10;
https://github.com/georgia-tech-db/eva/blob/bfd424fd5beb3cec...

EVA directly persists the feature vectors in a FAISS index. It does not use a relational database system for this purpose. FAISS supports retrieving the original vector through ID (required for similarity search).

We would love to jointly explore how to support such weighted similarity search queries. Please consider opening an issue with more details on your use case.


I'm having trouble understanding what this does. Does it let you compose models via a SQL-like syntax?


That’s correct! You can compose multiple models in a single query to set up useful AI pipelines.

Here is an illustrative query that chains together multiple models:

   -- Analyse emotions of faces in a video
   SELECT id, bbox, EmotionDetector(Crop(data, bbox)) 
   FROM MovieVideo JOIN LATERAL UNNEST(FaceDetector(data)) AS Face(bbox, conf)  
   WHERE id < 15;


Global Defence Initiative selected


Welcome back commander


This looks very interesting. I am thinking of testing it out to see its accuracy for text detection and extraction in multiple PDFs. This will sound like an amateur question, but what is the policy on the files used? Do you store them for data training? I am asking as , in the long term, I might use this on some more private files.


We are in the process of supporting a native `LOAD PDF` command. Meanwhile, you could convert each PDF into a series of images and load them using the `LOAD IMAGE` command. You could then run any text extraction user-defined function (e.g., `textract` [1]) over the loaded documents with additional filters based on your constraints (like PDF author or creation date). As EVA is designed for local usage, you can run it on local private files. We would love to jointly explore how to best support your text extraction pipeline. Please consider opening an issue with more details on your use case.

[1] https://textract.readthedocs.io/en/stable/python_package.htm...


How are you guarding against prompt injection attacks, e.g. either in the queried data, or in untrusted query parameters?


Honestly, we have not extensively thought about prompt injection attacks -- the equivalent of SQL injection attacks in AI-Relational database systems :)

If you have any thoughts on addressing this, please do share! We will incorporate that in the LLM-based functions in EVA.


I've written a whole series of posts about prompt injection that you might find useful: https://simonwillison.net/series/prompt-injection/


Hey Simon! Thanks for sharing this. I have long admired your work on Datasette :) We will check out your posts for ideas on coping with prompt injection.

I just came across your recent post on the ChatGPT SQL function in SQLite [1]. We just added a ChatGPT-based UDF in EVA [2]. I would love to hear your thoughts on the difference between these two approaches.

Another coincidence is that EVA uses SQLite for managing structured data by default. Can EVA's SQLite database be an interesting use case for Datasette?

[1] https://simonwillison.net/2023/Apr/29/enriching-data/ [2] https://github.com/georgia-tech-db/eva/pull/655


The approaches look pretty similar. My chatgpt() function is pretty much the most basic possible implementation of that pattern - it's just a SQLite custom-SQL function written in Python.

You should absolutely try pointing Datasette at that SQLite database, I imagine it would work really well!


Thanks so much for sharing your thoughts! I also felt that they are pretty similar. But, I am guessing that SQLite (similar to most relational database systems) does not automatically cache the results of functions, do non-trivial cost-based optimization for functions in queries, or reorder function-based predicates based on the estimated cost of running the functions, etc.

Edit: I have shared more details on the function-aware optimization in EVA in this post (in case you are interested) -- https://news.ycombinator.com/item?id=35764355#35773608

Sure, we will try it out and keep you posted :)


You can cache function results yourself in Python if you want to - my implementation also sums up the tokens used by the calls to the functions.

Influencing optimization isn't possible using regular Python-based custom SQL functions though. I think you can influence that stuff in SQLite if you create more complex virtual table functions, but those aren't exposed through the regular Python sqlite3 module yet.


Thanks for the clarifications. Token summation is a cool optimization :)

Query optimizers in SQL database systems typically optimize based on the time to execute the function on a local server. The token summation optimization generalizes time-based optimization of local functions to dollar-based optimization for remote functions.

Execution Time-based optimization: FunctionFoo(input 1) = 2x FunctionFoo(input 2)

Dollar-based optimization: ChatGPT(prompt with 100 tokens) = 2x ChatGPT(prompt with 50 tokens)

We are also exploring dollar-based optimization in EVA, and will check out your openai-to-sqlite tool for ideas [1].

[1] https://datasette.io/tools/openai-to-sqlite


Very nice… any plans for supporting self hosted LLMs like BERT LLAMA etc?


Great question! We will be adding a ChatGPT-based user-defined function this week (https://github.com/georgia-tech-db/eva/pull/655/).

With LLM-based functions, EVA will support more interesting queries like this:

  SELECT ChatGPT(TextSummarizer(SpeechRecognizer(audio)),
         "Is this video related to the Russia-Ukraine war?")
  FROM VIDEO_CLIPS;
Here, EVA sends the audio of each video clip to a speech recognition model on Hugging Face. It then sends the recognized text to a text summarizer model. EVA executes both models on local GPUs. Lastly, EVA sends the text summary to ChatGPT as a part of the prompt. The ChatGPT UDF is executed remotely.

The critical feature of EVA is that the query optimizer factors the dollar cost of running models for a given AI task (like a question-answering LLM). It picks the appropriate model pipeline with the lowest price that satisfies the user's accuracy requirement.


Great but personally I am interested in locally runnable LLM models instead of sending data to the cloud service like chatGPT.


Got it! EVA is designed for the local use case. You can define a Python function that wraps around the LLM model and use it anywhere in the query (we refer to such functions as user-defined functions or UDFs).

This notebook illustrates a UDF that wraps around a custom PyTorch vision model: https://evadb.readthedocs.io/en/stable/source/tutorials/04-c...

These functions can be written quickly (~50 lines of Python code). Here is the built-in Resnet50 UDF in EVA: https://github.com/georgia-tech-db/eva/blob/master/eva/udfs/...

This page describes the steps involved in writing a UDF in EVA: https://evadb.readthedocs.io/en/stable/source/reference/udf....

Please open an issue on the Github repo; we will gladly support your use case :)


Is the benefit here that EVA supports a declarative style of composition over LangChain's (or similar) imperative style?


Great question! Besides improving usability, the key feature of the EVA database system is the query optimizer that seeks to speed up exploratory queries over a given dataset and save money spent on inference.

Two key optimizations in EVA's AI-centric query optimizer are:

- Caching: EVA automatically caches and reuses previous query results (especially model inference results), eliminating redundant computation and reducing query processing time.

- Predicate Reordering: EVA optimizes the order in which the query predicates are evaluated (e.g., runs the faster, more selective model first), leading to faster queries and lower inference costs.

Consider these two exploratory queries on a dataset of dog images:

  -- Query 1: Find all images of black-colored dogs
  SELECT id, bbox FROM dogs 
  JOIN LATERAL UNNEST(YoloV5(data)) AS Obj(label, bbox, score) 
  WHERE Obj.label = 'dog' 
    AND Color(Crop(data, bbox)) = 'black'; 

  -- Query 2: Find all Great Danes that are black-colored
  SELECT id, bbox FROM dogs 
  JOIN LATERAL UNNEST(YoloV5(data)) AS Obj(label, bbox, score) 
  WHERE Obj.label = 'dog' 
    AND DogBreedClassifier(Crop(data, bbox)) = 'great dane' 
    AND Color(Crop(data, bbox)) = 'black';
By reusing the results of the first query and reordering the predicates based on the available cached inference results, EVA runs the second query 10 times faster!

More generally, EVA's query optimizer factors the dollar cost of running models for a given AI task (like a question-answering LLM). It picks the appropriate model pipeline with the lowest price that satisfies the user's accuracy requirement.

Query optimization with a declarative query language is the crucial difference between EVA and inspiring AI pipeline frameworks like LangChain and TxtAI [1]. We would love to hear the community's thoughts on the pros and cons of these two approaches.

[1] https://github.com/neuml/txtai


Does this rely on local GPU compute on the database server? or can it integrate with cloud based or external GPU servers?


Yes, EVA works out of the box on an AWS server with GPU [1].

[1] https://aws.amazon.com/nvidia/


thanks, that's great!

I guess though I was curious if the GPU has to be on the server itself, or if it's able to harness a remote GPU. Since the database server is likely to be long running it will be expensive to rent GPU enabled hardware for all the time it is on line.


Thanks for the clarification.

EVA currently does not support remote GPUs. Our ongoing integration [1] of the Ray distributed compute framework [2] into EVA will soon allow us to support remote GPUs. We would love to jointly explore how to best support remote GPUs. Please consider opening an issue with more details on your use case.

[1] https://github.com/georgia-tech-db/eva/blob/master/eva/exper...

[2] https://docs.ray.io/en/latest/cluster/vms/getting-started.ht...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: