RAG

The 3 Ps of AI Engineering: Providers, Provenance, and Polyglot – Building the Backbone of RAG

The 3 Ps—Providers, Provenance, and Polyglot Persistence—form the backbone of scalable RAG systems, transforming prototypes into resilient AI solutions that handle diverse data types and deliver high-quality, context-rich responses.

Travis Frisinger

Nov 4, 2024 • 12 min read

Powering AI with the 3 Ps: Providers, Provenance, and Polyglot Persistence

Retrieval-augmented generation (RAG) systems represent a leap forward in AI’s ability to deliver adaptable, dynamic responses by drawing on many data sources. However, as we push RAG systems from POC (Proof of Concept) to production, challenges such as data integration, synchronization, and scalability issues emerge, potentially undermining performance and reliability. Addressing these obstacles requires a strong data foundation, structured around three essential pillars: Providers, Provenance, and Polyglot Persistence—collectively known as the 3 Ps.

Each of these principles plays a unique role in transforming RAG systems from promising POCs into resilient, production-grade solutions:

Providers bring diverse data sources - structured, semi-structured, and unstructured - to fuel AI responses, ensuring relevance and depth as new sources are added.
Provenance tracks data lineage and transformations, establishing data integrity, trust, and adaptability as requirements grow.
Polyglot Persistence optimizes data storage across multiple database types, enabling efficient and relevant data retrieval that scales with complexity.

Building RAG Systems: The 3 Ps of AI Engineering – Providers, Provenance, and Polyglot

In this blog, we’ll explore each of these pillars, examining the tools that support them, where custom solutions are typically required, and practical tips for building a robust, scalable RAG system. By understanding and leveraging the 3 Ps, engineers can create AI solutions that transition seamlessly from POC to production, delivering high-quality, contextually rich responses at scale.

1. Providers: The Source of Diverse Data

Providers are the entry points for data, supplying the diverse information that fuels an AI system. For RAG systems, where diversity and updates are critical, Providers encompass structured data like databases, semi-structured data like PDFs and PowerPoint presentations, and unstructured data like customer reviews or social media posts. Without a robust network of Providers, an AI system risks delivering limited or outdated responses, reducing its value to users.

Example in Practice

Consider a knowledge management system for a large enterprise aiming to assist employees in finding information quickly. The AI system might draw on:

Structured Data:
- Databases: Employee records, project timelines, and organizational charts.
Semi-Structured Data:
- PDFs: Policy documents, compliance guidelines, and training manuals.
- PowerPoint Presentations: Recent sales pitches, strategic plans, and quarterly reviews.
Unstructured Data:
- Customer Reviews: Feedback from various platforms providing insights into product performance.
- Emails and Chat Logs: Internal communications that can shed light on project statuses or commonly faced issues.

When an employee asks, "What's the current policy on remote work?" the system can:

Retrieve the latest PDF policy document.
Reference any recent PowerPoint presentations from HR about policy changes.
Pull in relevant customer reviews that might have influenced policy adjustments.

By integrating data from these diverse Providers, the AI system offers a comprehensive, accurate, and context-rich response.

Existing Solutions

Many tools can simplify Provider integration, especially for handling different data types:

ETL Platforms:
- Apache NiFi: For building and managing complex data flows, including semi-structured data.
- Airbyte, Fivetran, and Stitch: For connecting APIs, databases, and third-party data providers.
Document Processing Tools:
- Unstructured: For extracting text and metadata from various document formats like PDFs and PowerPoint presentations. It excels at processing unstructured and semi-structured data, making it ideal for integrating diverse documents into your RAG system.
- Textract (AWS) or Azure AI Document Intelligence: For extracting structured information from documents.

These tools offer connectors and processors for various data sources, serving as a base layer for data ingestion.

Where Custom Work Is Needed

While existing tools handle standard formats, custom integration may be required for:

Complex Documents: Parsing specialized or proprietary formats.
Real-Time Data: Ingesting unstructured data like live customer reviews or social media feeds.
Data Freshness and Consistency: Ensuring the most recent versions of documents are used, especially when multiple versions exist.

Custom ETL pipelines might be necessary to handle:

Version Control: Managing different versions of semi-structured documents.
Metadata Extraction: Pulling out specific fields or annotations from documents.
Content Categorization: Classifying unstructured data into meaningful categories for retrieval.

Practical Tip

Use ETL tools for standard Providers and build modular ingestion pipelines for specialized or real-time data sources. "Utilize document processing libraries like Unstructured to extract content from PDFs and PowerPoint presentations.

Consider leveraging natural language processing (NLP) techniques for unstructured data like customer reviews to extract valuable insights such as sentiment, key themes, and customer intents.

Sentiment Analysis: Use NLP models to determine whether a review expresses positive, negative, or neutral sentiment. This helps your AI system understand overall customer satisfaction and prioritize responses to negative feedback.
Topic Modeling: Apply techniques like Latent Dirichlet Allocation (LDA) or transformer-based models to identify common themes within the reviews. This aids in categorizing feedback and identifying prevalent issues or features that customers frequently mention.
Entity Recognition: Utilize Named Entity Recognition (NER) to extract specific entities such as product names, services, or specific issues mentioned in the reviews. This enables more precise responses and better routing of information within your system.

Tools and Libraries:

Sentiment Analysis:
- VADER Sentiment Analyzer: A lexicon and rule-based sentiment analysis tool that is effective on social media and customer review texts.
- TextBlob: A simple library for processing textual data, which includes a built-in sentiment analysis module.
- Hugging Face Transformers: Provides access to pre-trained transformer models like BERT or RoBERTa for more nuanced sentiment analysis.
Topic Modeling:
- Gensim: An open-source library for unsupervised topic modeling and natural language processing, supporting algorithms like LDA and LSI.
Entity Recognition:
- spaCy: A powerful library for advanced NLP tasks, including efficient and accurate named entity recognition.
- Flair: An easy-to-use NLP library built on PyTorch, offering state-of-the-art NER models.
- Hugging Face Transformers: Utilize transformer models fine-tuned for NER tasks.

By processing unstructured data with these NLP techniques and tools, you enrich your AI system's understanding of customer feedback, enabling it to deliver more accurate and context-aware responses. This not only improves user satisfaction but also provides valuable insights for continuous improvement of your products or services.

Designing your system with a flexible architecture will make it easier to integrate new Providers as they become relevant and maintain the agility needed for real-time responses.

2. Provenance: The Lifeline of Data Integrity

Sourcing data alone isn’t enough—tracking the history and transformations of this data is where Provenance comes into play. Provenance ensures data traceability, documenting every transformation from raw input to output. In AI engineering, particularly for RAG systems handling varied data types, Provenance is critical to ensure data integrity, accountability, and transparency. It enables engineers to trace any decision back to the data sources that informed it, making it easier to adapt when data changes.

Example in Practice

Imagine an AI-driven legal assistant used by a law firm, integrating data from multiple sources:

Structured Data:
- Case Management Systems: Details about ongoing and past cases.
Semi-Structured Data:
- Legal Documents in PDFs: Contracts, court filings, and legal briefs.
- PowerPoint Presentations: Summaries of case strategies or legal precedents.
Unstructured Data:
- Client Emails: Communications that may impact legal strategies.

Provenance allows for key info to be tracked:

Document Versions: Knowing which version of a contract was used in forming a legal argument.
Source Attribution: Identifying which court filing influenced a recommendation.
Transformation Steps: Recording how a PDF was parsed and which NLP techniques were applied to extract information.

If a legal recommendation changes due to an updated court ruling, Provenance enables lawyers to trace this change back to the specific document and even the exact passage that caused the shift, ensuring accountability and compliance with legal standards.

Existing Solutions

Several metadata management and data lineage tools support Provenance tracking:

OpenLineage: Designed for collecting and integrating lineage data across tools, OpenLineage is a metadata standard supported by various platforms like Airflow, dbt, and Spark. It’s ideal for capturing data lineage across complex workflows.
DataHub: Originally developed by LinkedIn, DataHub is a metadata platform that supports data discovery, observability, and lineage tracking. It provides robust lineage visualizations and can handle metadata for data lakes, warehouses, and streaming systems.
Amundsen: Another LinkedIn project, Amundsen focuses on data discovery and metadata, with basic lineage support. Its user-friendly interface and robust search functionality make it a popular choice for data discovery in data engineering workflows.
Alation: Provides a collaborative data catalog with lineage tracking, metadata management, and powerful search capabilities to help users find and understand data.
Cloud Provider Solutions:
- AWS Glue Data Catalog: A fully managed service that maintains a metadata repository of data sources, including data lineage features.
- Azure Purview: A unified data governance SaaS solution that provides data discovery, classification, and lineage tracking across on-premises and cloud data sources.
- Google Cloud Data Catalog: A fully managed and scalable metadata management service with search and data lineage capabilities.

These tools offer robust features for documenting data transformations and tracking metadata, providing foundational support for Provenance tracking without the overhead of managing infrastructure.

Where Custom Work Is Needed

Custom work may be necessary to achieve:

Fine-grained tracking: Recording transformations at the level of individual documents or even sections within documents.
Cross-Format Lineage: Tracing data lineage across different data types (e.g., from a PDF to a database entry or from an email to a recommendation in the AI system).
Regulatory Compliance: Meeting industry-specific traceability standards, such as those required in legal or healthcare fields.

Customization might be necessary, especially when dealing with:

Complex Data Transformations: Such as OCR processing of scanned documents or NLP analysis of client emails.
Data Anonymization Steps: Recording how and when data was anonymized for privacy compliance.
Integration with Proprietary Systems: Connecting SaaS tools with in-house or legacy systems that are not supported out-of-the-box.

Practical Tip

Start with metadata management tools like Collibra, Alation, or Atlan for broad data lineage tracking. For critical data requiring high accountability, implement custom logging during data transformations, especially for semi-structured or unstructured data. Use unique identifiers for documents and their extracted content to maintain traceability across systems. This balances the ease of these solutions with the precision needed for strict Provenance.

3. Polyglot Persistence: Building Data Flexibility

Sourcing and tracking data are foundational steps, but storing it in the most effective way for fast retrieval is equally essential. Polyglot Persistence involves using multiple database types to store data based on its structure, retrieval needs, and usage patterns. For complex, diverse data in RAG systems, Polyglot Persistence offers the flexibility to store and retrieve information efficiently, enhancing performance and response relevance.

Reframing Embeddings as Derived Data

One key insight is to treat embeddings not as standalone data points but as derived data—high-dimensional representations created directly from source data like text from PDFs, slides from PowerPoint presentations, or sentences from customer reviews. By viewing embeddings as derived data, you can streamline their management and respond more flexibly to updates in the source data. This approach allows for targeted synchronization when source data changes, rather than requiring constant bulk update synchronizations, ultimately improving responsiveness and reducing maintenance overhead.

Automatic Synchronization with Database-Integrated Vectorizers

Recent developments include database-integrated vectorizers. By using SaaS solutions that integrate vectorization capabilities, such as Weaviate or Zilliz Cloud, AI engineers can achieve automatic synchronization of embeddings with source data, freeing up resources otherwise spent on synchronization tasks.

RAG System Integration Flow: Structured to Unstructured Data Transformation

Example in Practice

Consider an AI-powered customer support system that needs to provide instant, accurate responses by accessing:

Structured Data:
- CRM Systems: Customer profiles and interaction histories.
Semi-Structured Data:
- Knowledge Base Articles in PDFs: Product manuals and troubleshooting guides.
- Training Materials in PowerPoints: Internal staff training slides that might contain useful information.
Unstructured Data:
- Customer Reviews and Feedback: Insights into common issues and sentiments.

Using Polyglot Persistence, this system stores each data type in its optimal database: relational for structured data, document stores for PDFs and presentations, NoSQL for unstructured feedback, and vector databases for embeddings. Data integration and virtualization tools ensure seamless access and synchronization across all databases.

When a customer asks, "How can I reset my device's firmware?" the system can:

Retrieve relevant PDF sections from the product manual stored in MongoDB Atlas.
Reference PowerPoint slides from training materials indexed in Elastic Cloud.
Use embeddings stored in Pinecone to semantically match the query with both structured and unstructured data, providing a comprehensive answer.

Where Custom Work Is Needed

While these services support storage, Polyglot Persistence introduces complexity around:

Data Synchronization: Ensuring updates in semi-structured documents are reflected across all relevant databases.
Cross-Database Queries: Enabling seamless queries that span structured, semi-structured, and unstructured data.
Optimization: Custom indexing and caching strategies to improve retrieval times.

Custom work might involve:

Implementing Data Synchronization Mechanisms: Using SaaS integration tools or custom scripts to keep data consistent across databases.
Setting Up Data Virtualization Layers: Using a platform to provide unified access and querying capabilities.
Optimizing Queries and Indexing: Tailoring strategies to your specific data and access patterns for improved performance.

Practical Tip

Leverage SaaS Database Solutions for Simplicity and Scalability:

Utilize databases like MongoDB Atlas, Elastic Cloud, and Pinecone to handle different data types without the need to manage infrastructure.
Data integration tools like Fivetran or Stitch can be used to automate data synchronization between systems.

Implement Unified Data Access with SaaS Data Virtualization Platforms:

Consider a platform like Presto or Dremio Cloud to create a virtualized data layer. This allows you to query across multiple data sources with a single interface, simplifying data access and reducing the need for complex data movement.

Automate Data Workflows and Synchronization:

Use workflow orchestration tools like Prefect Cloud or Astronomer (managed Apache Airflow) to schedule and manage data pipelines and synchronization tasks.

By leveraging SaaS products, you reduce the operational overhead and complexity associated with managing multiple database systems. This approach enables small teams to implement robust Polyglot Persistence strategies, ensuring efficient data storage and retrieval across diverse data types.

By understanding and leveraging the 3 Ps, engineers can create AI solutions that transition seamlessly from POC to production, delivering high-quality, contextually rich responses at scale

Permissions and Regulatory Compliance: Managing Data Access and Privacy

In RAG systems, managing Permissions and Regulatory Compliance is essential for secure, compliant data usage, especially when working with sensitive or regulated data across various formats. Permissions control access levels across datasets, ensuring that only authorized users or systems can access, modify, or retrieve data from Providers. Compliance requirements (such as GDPR, HIPAA, or industry-specific regulations) often demand data segmentation, audit trails, and secure storage.

Best Practices for Permissions and Compliance

Implement Role-Based Access Control (RBAC):
- Define roles and permissions based on job functions, limiting access to only the necessary data.
  - Example: Only legal team members can access certain confidential PDFs or PowerPoint presentations.
- Solutions:
  - Keycloak: An open-source identity and access management tool that provides RBAC, single sign-on (SSO), and integrations with LDAP.
  - Okta: Provides identity management with robust RBAC features.
  - Auth0: Offers authentication and authorization services with fine-grained access control.
  - Azure Entra ID: Includes RBAC capabilities for managing access to cloud resources.
Utilize Data Masking and Anonymization:
- Protect sensitive data by masking identifiable information while still allowing it to be used in RAG responses when needed.
  - Example: Anonymize customer reviews to protect personal information.
- Solutions:
  - Aircloak: An open-source project from the OpenMined community that provides differential privacy for data anonymization, helping mask sensitive data fields.
  - Protegrity: Offers data protection solutions, including data masking and tokenization.
  - BigID: Provides data discovery and masking for sensitive data across various sources.
  - Informatica Cloud Data Masking: Enables dynamic data masking and anonymization in the cloud.
Automate Auditing with Provenance:
- Track and log every access and data transformation, providing a clear audit trail.
  - Benefit: Provenance aids in identifying and addressing compliance risks by maintaining a historical record of data usage.
Maintain Region-Specific Data Segmentation:
- Ensure data storage complies with regional regulations, such as storing EU customer data within Europe for GDPR compliance.

Leverage data governance tools to manage permissions and auditing across different data types and storage solutions effectively:

Identity and Access Management (IAM): Implement centralized IAM for role-based access control (RBAC) and single sign-on (SSO) across applications and data sources, ensuring consistent access controls and user management.
Data Governance Platforms: Utilize data governance solutions to manage data catalogs, lineage, and compliance in a unified platform, helping to track data usage, transformations, and ownership.
Regular Compliance Protocol Updates: Keep compliance protocols up to date with evolving regulations to maintain data protection and audit-readiness. Robust Provenance tracking is essential for providing clear audit trails and compliance transparency.

By utilizing these governance and IAM strategies, you can simplify permissions and compliance management, reduce operational overhead, and maintain focus on building secure, scalable RAG systems.

Bridging the Gap from POC to Production with the 3 Ps

Moving a RAG system from POC to production involves addressing scalability, reliability, and data integrity challenges that aren’t always apparent during the initial development phase. This transition is where the 3 Ps—Providers, Provenance, and Polyglot Persistence—prove their value, offering robust solutions to common obstacles encountered when scaling up.

Here’s how each principle plays a critical role in bridging that gap:

Providers: Ensuring Data Depth and Relevance
- Challenge in POC to Production: In a POC, integrating only a few data sources can showcase system potential. But in production, maintaining relevance requires a broader set of data sources.
- How Providers Help: A robust network of Providers ensures the system can pull timely, diverse data to handle various scenarios in production, adapting flexibly as new data sources become available.
Provenance: Creating Trust and Consistency
- Challenge in POC to Production: During a POC, shortcuts and limited data history are often acceptable. Production systems, however, require traceable, reliable data to ensure trust and regulatory compliance.
- How Provenance Helps: Provenance provides data traceability and adaptability. It allows engineers to trace each decision back to its data origin, making it easier to identify issues, meet compliance standards, and maintain data integrity over time.
Polyglot Persistence: Maintaining Performance and Flexibility
- Challenge in POC to Production: A POC may work with a single database, but production demands often require optimized storage solutions for diverse data types.
- How Polyglot Persistence Helps: Polyglot Persistence enables engineers to store each data type in the optimal database, preventing performance bottlenecks and ensuring flexibility. This approach allows the system to adapt and expand its capabilities smoothly as new data requirements emerge.

The 3 Ps provide a powerful framework for transforming an RAG system from an experimental POC into a reliable, adaptable, and high-performing production system.

Conclusion

For AI engineers, Providers, Provenance, and Polyglot Persistence represent a transformative approach to data management in RAG systems dealing with structured, semi-structured, and unstructured data. By understanding the role of each principle and leveraging a mix of existing tools and custom solutions, engineers can create AI systems that are adaptable, transparent, and efficient.

While many tools can help implement the 3 Ps, custom engineering is often necessary to fully realize these capabilities. Consider how these principles could elevate your own projects, bringing agility, traceability, and optimized retrieval to every interaction. By combining out-of-the-box solutions with tailored enhancements, engineers can build RAG systems that are ready to meet the demands of dynamic, context-rich AI—ultimately delivering more valuable, relevant, and responsive user experiences.

Source: Vector Databases Are the Wrong Abstraction