Your Multimodal Data Is Constantly Evolving - How Bad Can It Get?
Navigating the Complex Landscape of Multimodal Data Management
June 12, 2024Introduction
The data landscape has dramatically changed in the last two decades. Twenty years ago, a data scientist may have only interacted with standard structured databases such as PostgreSQL. But today, as companies race to leverage the growing capacities of AI models, data scientists and engineers juggle multiple data types at once—text, image, video , etc. Like jumping from two dimensions to three, the shift to multimodal data is simultaneously exciting and challenging. It’s key to understand not only the changing landscape, but also how to get maximum value from multimodal data and choose the right tools.
How does your data evolve?
Multimodal data is inherently complex. But as AI use cases explode, multimodal data practices are rapidly evolving in step. A few elements typically drive this evolution.
Annotations
Annotating multimodal data is key to accurately training models during supervised learning. Annotations already enhance data richness and complexity, but annotation processes can also change over time. In medicine, for instance, breakthrough discoveries could require teams to refine their annotation process, and new medical imaging techniques might demand more granular annotations to keep models up to date.
Embeddings
Embeddings encode different types of data in a shared vector space, making it easy to identify and represent relationships between those vectors. For example, embeddings constitute a core component of user-facing recommendation engines because they enable similarity searches.
Embeddings can change over time for a couple of reasons. Changes in the real world can lead to changes in the underlying embedding spaces. For example, social media sentiment data may shift during significant world events.
Changes can also come from within. Variations in a company’s data collection cadence and methodology, updates to an AI model, or the introduction of new modalities can all cause an evolution in the underlying embeddings.
New classifications
As a company’s AI models mature, those models can be used to extract (newer) insights from incoming data. Imagine a company that’s trained a model to detect faces. With that newly-trained model, the company can create an entirely new classification set of facial emotions, which become its own data points.
Derived data
Companies can combine information from multiple modalities into richer, more useful representations—this is derived data. For example, a company may want to perform a sentiment analysis after collecting product reviews that include text and uploaded user images. By concatenating the two sets of embeddings, the company now has a derived dataset that combines information from both the text and images, which will be useful for training models that need to understand the relationship between the two.
Derived data can evolve for a number of reasons. In the example above, the company could start enabling users to upload videos in addition to text and images. Sometimes, in order to saturate training of large models on fast machines, companies might be forced to create copies of their data, in the formats required as input for these models, leading to additional provenance and data governance information.
Provenance information
As AI models play an increasingly active role in real-life applications, capturing provenance information—the metadata that describes data’s origin and history—is becoming critical.
Say a healthcare provider relies on AI models with MRI image inputs for diagnoses. It’s crucial that the provider can track the source of the data (machine, parameters used), immediate processing steps (noise reduction, motion correction), manual annotations, and transformation history (fusion with other modalities, like CT or PET scans) so that it can quickly address any mistakes from the model.
New use cases
One of the most satisfying aspects of working with AI models is getting them to succeed. Say an e-commerce company builds a recommendation engine to show its customers products with similar colors. Customers start clicking on and buying those related products. Success! Now the growth team wants to evolve the recommendation model to include products made of similar materials, so the engineering team must add further filters on the product metadata.
Because AI models are limitless in their applications, it’s almost guaranteed that a given company’s set of use cases will evolve over time.
Why is it challenging to manage evolving multimodal data?
Schema challenges with relational databases
Traditional relational databases, such as PostgreSQL or MySQL, have served engineers and data scientists well for decades. But when it comes to multimodal data, these databases fall short for one glaring reason: rigid relational schemas do not play well with the complex relationships that exist in multimodal data.
For example, imagine an AI model that helps doctors recommend treatments based on a mixture of doctor, patient, treatment, and CT scan data. While one could theoretically structure four PostgreSQL datasets with interlinking foreign keys, many-to-many mappings will require additional reference tables. The liquid, evolving nature of multimodal data is at odds with rigid schema enforcement, so engineers do themselves a favor by scaling in systems that allow for flexible schemas over time.
Datasets
A key component of training AI models is the specific data used, but managing versions of the same dataset can be tedious and costly. For example, say a recommendation model is trained on a dataset of sofa images. The company begins working with new vendors and gets a fresh dataset of sofa images, so the engineering team trains v2 of the model with the new data but notices that the model’s recommendation quality deteriorates.
In cases like this, it’s important to be able to track the changes made to the model and the actual datasets used. Unfortunately, many data teams revert to manually storing copies of the data. This quickly becomes not only an organizational nightmare – which dataset trained which model? - but also a pricey one, since multiple versions of similar datasets increase storage costs.
Scalability
As a company’s metadata and data grow, engineers and data scientists are forced to reckon with scalability. Not only must they consider raw storage capacity and cost (vertical scaling) but also how to distribute workload across multiple nodes (horizontal scaling). Without proper planning, companies can quickly find themselves paying too much for multiple databases, sacrificing the performance of their AI workflows, or both. Balancing these concerns with ease-of-setup is a challenge for every team.
Easily connecting processing pipelines with data updates
As a business grows, so does its data ingestion. Think of how an e-commerce company’s product catalog constantly evolves, or how a travel service aggregator collects new reviews from its customers. As existing data pipelines grow and change, it’s crucial to seamlessly connect these datasets with existing data infrastructure, enrich the data with metadata, and fold them into AI model workflows to glean useful insights.
This is easier said than done: today, data and engineering teams find it challenging to update existing data schemas, process visual data, and easily label data to enable their workflows to keep pace with new ingestion.
Challenges with consistent views and transactions across data pieces
It’s easy to underestimate the importance of standardizing the engineering and data teams’ view of multimodal data. If a company uses multiple disconnected databases, not only will these teams (and by extension, the whole company) struggle to build a single view of the data, but it will also be tricky to build consistent read/write transaction processes across these multiple databases.
Unfortunately, this is the reality for too many companies today: because there are few products tailor-made for multimodal data management, teams often opt for several disjointed databases, setting themselves up for an endless Sisyphus-style struggle to maintain a consistent view of their multimodal data.
How do we simplify these challenges with ApertureDB?
Data storage and preprocessing
Out of the gate, ApertureDB supports storage of multimodal data types like documents, images, videos. The query interface has in-built preprocessing support for image and video data, simplifying downstream processes that rely on these data and helping searches and analyses run faster. This also removes the need for users to create copies of this data to support various format requirements downstream and often results in lowering network traffic since most such operations result in downsampling the data.
Vector database
ApertureDB comes with a vector database, optimized for storing, indexing, and querying high-dimensional vector data. This enables several use cases, like:
- Powering recommendation engines with similarity searches
- Building accurate chatbots with RAG
- Enabling powerful search applications with semantic and multimodal searches
In-memory graph database: the connective tissue
Importantly, ApertureDB comes with an in-memory graph database that stores application metadata as a knowledge graph. By leveraging the flexibility of a graph database, users can seamlessly connect metadata between any user-defined entities as well as their vector representations and original data.
For example, users can connect AI models to the data used to train them, task the model with classifying new data, and attach accuracy values to the new classifications. This enables searches such as “Find images classified by Model X where accuracy is > 0.9.” This also allows users to combine their vector searches with advanced graph filtering before accessing the required data in a suitable format for downstream ML processing.
The graph database also makes it easy to adjust schemas on the fly as AI needs change, although ApertureDB does not require users to declare schemas up front.
Query engine: unifying interface for applications
ApertureDB features a unified API across all the aforementioned data types based on a native JSON-based query language, coordinated by an orchestrator. Not only does this API help standardize a team’s view of its multimodal data, but it also helps ApertureDB users avoid needing to compose queries that deal with multiple systems.
Transaction support across various modalities
ApertureDB implements ACID transactions for the queries spanning the different data types thus offering relevant database guarantees at the level of these complex objects.
Integrations across ML pipelines
ApertureDB’s Python SDK offers convenient ETL and ML processing wrappers over the JSON query language, and simplifies integrations across the AI toolchain. This makes it easy for engineers to write standardized queries to serve multimodal data to their applications in the required format.
Schema dashboard
ApertureDB offers a dashboard UI that allows users to easily check what objects exist in a dataset, the object properties, and how different objects relate to each other.
This dashboard makes it surprisingly simple for data science, engineering, and analytics teams to manage the complex relationships between multimodal data types.
Conclusion
AI workflows are exploding. Every industry, from e-commerce to logistics to medicine, is racing to uncover new uses for ever-evolving multimodal data. Engineers must keep up with the pace. This is why we built ApertureDB: to give engineers and data scientists a purpose-built tool for multimodal data management, search, and visualization which could replace the hodge-podge of DIY solutions that existed in the market.
If you’re interested in learning more about how ApertureDB works, reach out to us at [email protected] . We have built an industry-leading database for multi-modal AI to future-proof data pipelines as multimodal AI methods evolve. Stay informed about our journey by subscribing to our blog.
I want to acknowledge the insights and valuable edits from Ian Yanusko as well as feedback from Ayla Khan (Recursion Pharmaceuticals).
Originally posted at: