Ingestion Service
Overview
The Ingestion Service (IngestionService
) provides comprehensive data ingestion functionality for the Kamiwaza AI Platform. Located in kamiwaza_client/services/ingestion.py
, this service handles data ingestion workflows, dataset processing, and document handling with embedding capabilities.
Key Features
- Data Ingestion
- Dataset Catalog Integration
- Document Processing
- Embedding Generation
- Batch Processing Support
Data Ingestion
Available Methods
ingest(data: Union[str, List[str], Dict[str, Any]], **kwargs) -> IngestionResponse
: Ingest dataingest_dataset(dataset: Dataset, **kwargs) -> DatasetIngestionResponse
: Ingest dataset to cataloginitialize_embedder(provider: str = "default", **kwargs) -> None
: Initialize embedding providerprocess_documents(documents: List[Document], **kwargs) -> ProcessingResponse
: Process documents
# Simple data ingestion
response = client.ingestion.ingest(
data="Sample text data",
chunk_size=512
)
# Dataset ingestion
response = client.ingestion.ingest_dataset(
dataset=dataset_obj,
embedding_config={
"provider": "huggingface",
"model": "sentence-transformers/all-mpnet-base-v2"
}
)
# Initialize embedder
client.ingestion.initialize_embedder(
provider="huggingface",
model_name="sentence-transformers/all-mpnet-base-v2"
)
# Process documents
response = client.ingestion.process_documents(
documents=[
Document(text="doc1", metadata={"source": "file1"}),
Document(text="doc2", metadata={"source": "file2"})
],
chunk_size=512,
overlap=50
)
Integration with Other Services
The Ingestion Service works in conjunction with:
- Embedding Service
- For generating embeddings of ingested text
- VectorDB Service
- For storing processed vectors
- Catalog Service
- For dataset management
- Retrieval Service
- For accessing processed documents
Error Handling
The service includes built-in error handling for common scenarios:
try:
response = client.ingestion.ingest(data)
except EmbeddingError:
print("Embedding generation failed")
except VectorDBError:
print("Vector storage failed")
except ProcessingError as e:
print(f"Document processing failed: {e}")
except APIError as e:
print(f"Operation failed: {e}")
Best Practices
- Initialize embedder before ingestion
- Use appropriate chunk sizes
- Include relevant metadata
- Process documents in batches
- Monitor ingestion progress
- Handle errors appropriately
- Clean up failed ingestions
- Validate data before ingestion
Performance Considerations
- Batch size affects processing speed
- Embedding generation time
- Vector database insertion overhead
- Memory usage during processing
- Network bandwidth for large datasets
Data Formats
The service supports various input formats:
- Raw Text
- Single strings
- Lists of strings
- Structured Data
- JSON objects
- Dictionaries
- Documents
- Custom Document objects
- Metadata support
- Datasets
- Catalog integration
- Batch processing