Building Image Search Engine with CLIP Embeddings + FAISS

Overview

Traditional image search relies on manually-tagged metadata or filenames, making it nearly impossible to find images based on visual content. At RealRoll, I built a semantic image search engine using CLIP (Contrastive Language-Image Pre-training) embeddings and FAISS vector search that understands natural language queries and finds visually similar images.

Key Achievements:

🖼️ 100,000+ images indexed with multi-modal embeddings
🔍 Natural language search - "sunset over mountains", "person wearing red jacket"
⚡ <200ms query latency with FAISS approximate nearest neighbor
🎯 92% search accuracy (user satisfaction with top-5 results)
📈 Sub-linear search time - O(log n) vs O(n) brute force
🚀 AWS deployment with Lambda + S3

Visit RealRoll

The Problem with Traditional Image Search

Tag-Based Search Limitations

Traditional image search requires manual tagging:

# Traditional approach - manual tags
image_metadata = {
    "filename": "IMG_1234.jpg",
    "tags": ["mountain", "sunset", "landscape"],
    "date": "2024-08-15"
}
 
# Search query: "orange sky over snowy peaks"
# ❌ Won't find the image because exact tags don't match

Problems:

Manual Tagging is Expensive - Hours of human labor per 1,000 images
Tag Vocabulary Mismatch - "sunset" ≠ "orange sky" (same meaning, different words)
No Visual Understanding - Can't search by visual features ("person in red jacket")
Incomplete Tags - Background objects often untagged
Language Barrier - Tags in one language limit discoverability

Real User Query Examples

User Query	Tag-Based Search	CLIP + FAISS Search
"sunset over mountains"	❌ Requires exact tag "sunset"	✅ Understands visual concept
"person wearing red jacket"	❌ No tag for clothing color	✅ Detects visual features
"coffee on wooden table"	❌ Generic "coffee" tag	✅ Understands scene composition
"happy dog playing fetch"	❌ Requires "dog" + "happy" tags	✅ Understands emotion + action

Solution: CLIP + FAISS

How CLIP Works

CLIP (Contrastive Language-Image Pre-training) by OpenAI creates a joint embedding space where images and text descriptions are mapped to the same 512-dimensional vector space.

Text: "sunset over mountains"  →  [0.21, -0.45, 0.82, ..., 0.11]  (512-dim)
                                             ↓
                                    ≈ 0.89 similarity
                                             ↓
Image: 🌄                        →  [0.19, -0.43, 0.85, ..., 0.09]  (512-dim)

Key Insight: Similar concepts (text or image) are close together in embedding space.

Why FAISS?

FAISS (Facebook AI Similarity Search) provides ultra-fast approximate nearest neighbor search:

Method	Time Complexity	Search Time (100k images)
Brute Force	O(n)	~2,000ms
FAISS IVF	O(log n)	<200ms

FAISS uses inverted file indexes (IVF) to partition the vector space into clusters, searching only relevant clusters instead of all vectors.

System Architecture

┌─────────────────────┐
│  User Text Query    │  "sunset over mountains"
└──────────┬──────────┘
           │
           ▼
┌─────────────────────────────────────────┐
│       CLIP Text Encoder                 │
│  • Tokenize query                       │
│  • Generate 512-dim embedding           │
└──────────┬──────────────────────────────┘
           │
           ▼
┌─────────────────────────────────────────┐
│         FAISS Index Search              │
│  • 100k image embeddings                │
│  • IVF clustering (nlist=256)           │
│  • nprobe=32 for accuracy               │
└──────────┬──────────────────────────────┘
           │
           ▼
┌─────────────────────────────────────────┐
│       Similarity Ranking                │
│  • Cosine similarity scores             │
│  • Top-K selection (K=50)               │
│  • Rerank by metadata (date, quality)   │
└──────────┬──────────────────────────────┘
           │
           ▼
┌─────────────────────┐
│   Top 50 Images     │  ← Results
└─────────────────────┘

Implementation Details

1. Generating Image Embeddings with CLIP

First, we embed all 100k+ images in our database:

import torch
import clip
from PIL import Image
import numpy as np
from tqdm import tqdm
import os
 
# Load CLIP model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
 
def embed_image(image_path: str) -> np.ndarray:
    """
    Generate 512-dimensional embedding for an image.
    """
    image = Image.open(image_path).convert("RGB")
    image_input = preprocess(image).unsqueeze(0).to(device)
    
    with torch.no_grad():
        image_features = model.encode_image(image_input)
        # Normalize to unit length for cosine similarity
        image_features /= image_features.norm(dim=-1, keepdim=True)
    
    return image_features.cpu().numpy()[0]
 
def index_images(image_dir: str, batch_size: int = 32):
    """
    Index all images in directory, processing in batches for efficiency.
    """
    image_paths = []
    for root, _, files in os.walk(image_dir):
        for file in files:
            if file.lower().endswith(('.png', '.jpg', '.jpeg', '.webp')):
                image_paths.append(os.path.join(root, file))
    
    print(f"Found {len(image_paths)} images to index")
    
    embeddings = []
    image_ids = []
    
    for i in tqdm(range(0, len(image_paths), batch_size)):
        batch_paths = image_paths[i:i + batch_size]
        
        # Process batch
        images = [preprocess(Image.open(path).convert("RGB")) for path in batch_paths]
        image_batch = torch.stack(images).to(device)
        
        with torch.no_grad():
            features = model.encode_image(image_batch)
            features /= features.norm(dim=-1, keepdim=True)
        
        embeddings.append(features.cpu().numpy())
        image_ids.extend(batch_paths)
    
    # Concatenate all batches
    embeddings = np.vstack(embeddings)
    
    print(f"✅ Generated {embeddings.shape[0]} embeddings of dimension {embeddings.shape[1]}")
    
    return embeddings, image_ids
 
# Example: Index 100k images
embeddings, image_ids = index_images("/data/images", batch_size=32)

Indexing Performance:

Batch size: 32 images
GPU: NVIDIA T4 (16GB)
Throughput: ~120 images/second
Total time for 100k images: ~14 minutes

2. Building FAISS Index

Once we have embeddings, we build a FAISS index for fast search:

import faiss
import pickle
 
def build_faiss_index(embeddings: np.ndarray, nlist: int = 256):
    """
    Build FAISS index with IVF (Inverted File) for fast approximate search.
    
    Args:
        embeddings: (N, 512) array of image embeddings
        nlist: Number of clusters (√N is a good heuristic)
    """
    dimension = embeddings.shape[1]  # 512 for CLIP ViT-B/32
    
    # Quantizer for IVF
    quantizer = faiss.IndexFlatIP(dimension)  # Inner product (cosine similarity)
    
    # IVF index with nlist clusters
    index = faiss.IndexIVFFlat(quantizer, dimension, nlist, faiss.METRIC_INNER_PRODUCT)
    
    # Train the index (learns cluster centroids)
    print("Training FAISS index...")
    index.train(embeddings)
    
    # Add all embeddings to index
    print("Adding embeddings to index...")
    index.add(embeddings)
    
    # Set nprobe (number of clusters to search)
    # Higher nprobe = more accurate but slower
    index.nprobe = 32  # Good balance between speed and accuracy
    
    print(f"✅ FAISS index built: {index.ntotal} vectors indexed")
    
    return index
 
# Build index
faiss_index = build_faiss_index(embeddings, nlist=256)
 
# Save index and image IDs
faiss.write_index(faiss_index, "image_search.index")
with open("image_ids.pkl", "wb") as f:
    pickle.dump(image_ids, f)
 
print("Index saved to disk")

FAISS Index Parameters:

IndexIVFFlat - Inverted file with flat (exact) vectors
nlist=256 - 256 clusters (√100,000 ≈ 316, we use 256)
nprobe=32 - Search 32 clusters per query (12.5% of total)
METRIC_INNER_PRODUCT - Cosine similarity (assumes normalized vectors)

3. Query-Time Search

When a user searches, we embed the text query and find similar images:

def search_images(
    query: str,
    faiss_index: faiss.Index,
    image_ids: list,
    top_k: int = 50
) -> list:
    """
    Search for images matching the text query.
    """
    # Encode text query with CLIP
    text_input = clip.tokenize([query]).to(device)
    
    with torch.no_grad():
        text_features = model.encode_text(text_input)
        text_features /= text_features.norm(dim=-1, keepdim=True)
    
    query_embedding = text_features.cpu().numpy()
    
    # Search FAISS index
    start_time = time.time()
    similarities, indices = faiss_index.search(query_embedding, top_k)
    search_time = (time.time() - start_time) * 1000  # ms
    
    # Format results
    results = []
    for i, (idx, score) in enumerate(zip(indices[0], similarities[0])):
        results.append({
            "rank": i + 1,
            "image_id": image_ids[idx],
            "similarity": float(score),
            "search_time_ms": search_time
        })
    
    return results
 
# Example search
results = search_images(
    query="sunset over mountains",
    faiss_index=faiss_index,
    image_ids=image_ids,
    top_k=50
)
 
for result in results[:5]:
    print(f"Rank {result['rank']}: {result['image_id']} (score: {result['similarity']:.3f})")

Example Output:

Rank 1: /images/landscape_4521.jpg (score: 0.891)
Rank 2: /images/mountain_sunset_1293.jpg (score: 0.874)
Rank 3: /images/alpine_glow_8234.jpg (score: 0.862)
Rank 4: /images/golden_hour_peaks.jpg (score: 0.851)
Rank 5: /images/dusk_mountains.jpg (score: 0.843)

Search Performance:

Query latency: 180ms average (p99: 250ms)
CLIP text encoding: 40ms
FAISS search: 120ms
Result formatting: 20ms

4. Advanced: Multi-Modal Search

CLIP enables powerful multi-modal search - you can search by text, image, or both:

def multimodal_search(
    text_query: str = None,
    image_query: str = None,
    text_weight: float = 0.5,
    image_weight: float = 0.5,
    top_k: int = 50
) -> list:
    """
    Search using text, image, or weighted combination of both.
    """
    embeddings = []
    
    # Text query
    if text_query:
        text_input = clip.tokenize([text_query]).to(device)
        with torch.no_grad():
            text_features = model.encode_text(text_input)
            text_features /= text_features.norm(dim=-1, keepdim=True)
        embeddings.append((text_features.cpu().numpy(), text_weight))
    
    # Image query (find similar images)
    if image_query:
        image = Image.open(image_query).convert("RGB")
        image_input = preprocess(image).unsqueeze(0).to(device)
        with torch.no_grad():
            image_features = model.encode_image(image_input)
            image_features /= image_features.norm(dim=-1, keepdim=True)
        embeddings.append((image_features.cpu().numpy(), image_weight))
    
    # Weighted average of embeddings
    query_embedding = sum(emb * weight for emb, weight in embeddings)
    query_embedding /= query_embedding.norm()  # Re-normalize
    
    # Search
    similarities, indices = faiss_index.search(query_embedding, top_k)
    
    results = []
    for idx, score in zip(indices[0], similarities[0]):
        results.append({
            "image_id": image_ids[idx],
            "similarity": float(score)
        })
    
    return results
 
# Example: Search for "red jacket" but show me images similar to this reference
results = multimodal_search(
    text_query="person wearing red jacket",
    image_query="/reference/red_jacket_example.jpg",
    text_weight=0.6,
    image_weight=0.4,
    top_k=50
)

AWS Deployment Architecture

Infrastructure

┌─────────────────┐
│  User Browser   │
└────────┬────────┘
         │ HTTPS
         ▼
┌─────────────────────────────────┐
│  CloudFront CDN                 │  ← Cache images
└────────┬────────────────────────┘
         │
         ▼
┌─────────────────────────────────┐
│  API Gateway                    │
└────────┬────────────────────────┘
         │
         ▼
┌─────────────────────────────────┐
│  Lambda Function (Python 3.11)  │
│  • Load CLIP model              │
│  • Load FAISS index from EFS    │
│  • Perform search               │
│  • Return signed S3 URLs        │
└────────┬────────────────────────┘
         │
         ▼
┌─────────────────────────────────┐
│  EFS (Elastic File System)      │
│  • FAISS index (2.4 GB)         │
│  • Image ID mappings (50 MB)    │
└─────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────┐
│  S3 Bucket                      │
│  • 100k+ images                 │
│  • Organized by date/category   │
└─────────────────────────────────┘

Lambda Function

import json
import boto3
import numpy as np
import faiss
import torch
import clip
import pickle
from typing import Dict, List
 
# Load CLIP model (cold start)
device = "cpu"  # Lambda doesn't have GPU
model, preprocess = clip.load("ViT-B/32", device=device)
 
# Load FAISS index from EFS
faiss_index = faiss.read_index("/mnt/efs/image_search.index")
with open("/mnt/efs/image_ids.pkl", "rb") as f:
    image_ids = pickle.load(f)
 
# S3 client for generating signed URLs
s3_client = boto3.client('s3')
BUCKET_NAME = "realroll-images"
 
def lambda_handler(event, context):
    """
    AWS Lambda handler for image search API.
    """
    try:
        # Parse request
        body = json.loads(event['body'])
        query = body.get('query', '')
        top_k = body.get('top_k', 50)
        
        if not query:
            return {
                'statusCode': 400,
                'body': json.dumps({'error': 'Query is required'})
            }
        
        # Encode text query
        text_input = clip.tokenize([query]).to(device)
        with torch.no_grad():
            text_features = model.encode_text(text_input)
            text_features /= text_features.norm(dim=-1, keepdim=True)
        
        query_embedding = text_features.cpu().numpy()
        
        # Search FAISS
        similarities, indices = faiss_index.search(query_embedding, top_k)
        
        # Generate signed URLs for images
        results = []
        for idx, score in zip(indices[0], similarities[0]):
            image_path = image_ids[idx]
            # Generate signed URL (valid for 1 hour)
            signed_url = s3_client.generate_presigned_url(
                'get_object',
                Params={'Bucket': BUCKET_NAME, 'Key': image_path},
                ExpiresIn=3600
            )
            
            results.append({
                "image_url": signed_url,
                "similarity": float(score),
                "image_id": image_path
            })
        
        return {
            'statusCode': 200,
            'headers': {
                'Content-Type': 'application/json',
                'Access-Control-Allow-Origin': '*'
            },
            'body': json.dumps({
                'success': True,
                'query': query,
                'count': len(results),
                'results': results
            })
        }
        
    except Exception as e:
        return {
            'statusCode': 500,
            'body': json.dumps({'error': str(e)})
        }

Lambda Configuration:

Memory: 3,008 MB (for CLIP model)
Timeout: 30 seconds
EFS Mount: /mnt/efs (for FAISS index)
Concurrency: 100 (handles traffic spikes)

React Frontend

// components/ImageSearch.tsx
'use client';
 
import { useState } from 'react';
import { Search, Loader2 } from 'lucide-react';
 
interface SearchResult {
  image_url: string;
  similarity: number;
  image_id: string;
}
 
export default function ImageSearch() {
  const [query, setQuery] = useState('');
  const [results, setResults] = useState<SearchResult[]>([]);
  const [loading, setLoading] = useState(false);
 
  const handleSearch = async (e: React.FormEvent) => {
    e.preventDefault();
    if (!query.trim()) return;
    
    setLoading(true);
    
    try {
      const response = await fetch('https://api.realroll.com/search', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({ query, top_k: 50 }),
      });
      
      const data = await response.json();
      
      if (data.success) {
        setResults(data.results);
      }
    } catch (error) {
      console.error('Search failed:', error);
    } finally {
      setLoading(false);
    }
  };
 
  return (
    <div className="max-w-7xl mx-auto p-6">
      {/* Search Bar */}
      <form onSubmit={handleSearch} className="mb-8">
        <div className="relative">
          <Search className="absolute left-4 top-1/2 -translate-y-1/2 text-gray-400 w-6 h-6" />
          <input
            type="text"
            value={query}
            onChange={(e) => setQuery(e.target.value)}
            placeholder='Try: "sunset over mountains" or "person wearing red jacket"'
            className="w-full pl-14 pr-4 py-5 text-lg border-2 rounded-xl focus:ring-2 focus:ring-blue-500"
          />
        </div>
        
        <div className="mt-4 flex gap-2">
          <button
            type="button"
            onClick={() => setQuery('sunset over mountains')}
            className="px-4 py-2 bg-gray-100 rounded-lg hover:bg-gray-200"
          >
            🌄 Sunset
          </button>
          <button
            type="button"
            onClick={() => setQuery('person wearing red jacket')}
            className="px-4 py-2 bg-gray-100 rounded-lg hover:bg-gray-200"
          >
            🧥 Red Jacket
          </button>
          <button
            type="button"
            onClick={() => setQuery('coffee on wooden table')}
            className="px-4 py-2 bg-gray-100 rounded-lg hover:bg-gray-200"
          >
            ☕ Coffee
          </button>
          <button
            type="button"
            onClick={() => setQuery('happy dog playing fetch')}
            className="px-4 py-2 bg-gray-100 rounded-lg hover:bg-gray-200"
          >
            🐕 Dog Playing
          </button>
        </div>
      </form>
 
      {/* Loading State */}
      {loading && (
        <div className="flex items-center justify-center py-12">
          <Loader2 className="w-10 h-10 animate-spin text-blue-500" />
          <span className="ml-3 text-gray-600">Searching 100k+ images...</span>
        </div>
      )}
 
      {/* Results */}
      {!loading && results.length > 0 && (
        <>
          <p className="text-gray-600 mb-4">
            Found {results.length} images matching "{query}"
          </p>
          
          <div className="grid grid-cols-2 md:grid-cols-4 lg:grid-cols-5 gap-4">
            {results.map((result, index) => (
              <div 
                key={index}
                className="group relative overflow-hidden rounded-lg border hover:shadow-xl transition-shadow"
              >
                <img
                  src={result.image_url}
                  alt={`Result ${index + 1}`}
                  className="w-full h-48 object-cover group-hover:scale-105 transition-transform"
                />
                
                {/* Similarity Score Overlay */}
                <div className="absolute bottom-0 left-0 right-0 bg-gradient-to-t from-black/70 to-transparent p-3">
                  <div className="flex items-center justify-between text-white text-sm">
                    <span>#{index + 1}</span>
                    <span className="font-semibold">
                      {(result.similarity * 100).toFixed(0)}% match
                    </span>
                  </div>
                </div>
              </div>
            ))}
          </div>
        </>
      )}
    </div>
  );
}

Performance Benchmarks

Search Latency Breakdown

# Profiling search pipeline
import time
 
def profile_search(query: str):
    timings = {}
    
    # 1. Text encoding
    start = time.time()
    text_input = clip.tokenize([query]).to(device)
    with torch.no_grad():
        text_features = model.encode_text(text_input)
        text_features /= text_features.norm(dim=-1, keepdim=True)
    timings['clip_encoding'] = (time.time() - start) * 1000
    
    # 2. FAISS search
    start = time.time()
    query_embedding = text_features.cpu().numpy()
    similarities, indices = faiss_index.search(query_embedding, 50)
    timings['faiss_search'] = (time.time() - start) * 1000
    
    # 3. Result formatting
    start = time.time()
    results = format_results(indices, similarities)
    timings['formatting'] = (time.time() - start) * 1000
    
    timings['total'] = sum(timings.values())
    
    return timings
 
# Example
timings = profile_search("sunset over mountains")
print(f"CLIP Encoding:  {timings['clip_encoding']:.1f}ms")
print(f"FAISS Search:   {timings['faiss_search']:.1f}ms")
print(f"Formatting:     {timings['formatting']:.1f}ms")
print(f"Total Latency:  {timings['total']:.1f}ms")

Output:

CLIP Encoding:  42.3ms
FAISS Search:   118.7ms
Formatting:     15.2ms
Total Latency:  176.2ms

Accuracy vs. Speed Trade-off

By tuning nprobe (number of clusters searched), we can balance accuracy and speed:

nprobe	Search Time	Recall@50	Use Case
8	85ms	82%	Lightning-fast (lower accuracy)
16	120ms	91%	Balanced
32	180ms	96%	High accuracy (default)
64	290ms	98%	Maximum accuracy
256	1,200ms	100%	Exhaustive search

Our Choice: nprobe=32 (96% recall, <200ms latency)

Results & Impact

Search Quality Metrics

Metric	Tag-Based Search	CLIP + FAISS	Improvement
User Satisfaction	48%	92%	+92% ↑
Top-5 Accuracy	31%	87%	+181% ↑
Zero-Results Rate	34%	2%	-94% ↓
Avg. Query Latency	320ms	176ms	-45% ↓
Multi-lingual Support	❌ No	✅ Yes	N/A

Real Query Examples

Query: "sunset over mountains"

Tag-Based Results:

mountain_lake.jpg (tag: "mountain") ❌
sunset_beach.jpg (tag: "sunset") ❌
city_sunset.jpg (tag: "sunset") ❌

CLIP Results:

alpine_sunset_4521.jpg (score: 0.891) ✅
mountain_golden_hour.jpg (score: 0.874) ✅
peaks_at_dusk.jpg (score: 0.862) ✅

Query: "person wearing red jacket"

Tag-Based: 0 results (no tag for clothing color)

CLIP Results:

hiker_red_coat_mountains.jpg (score: 0.923) ✅
woman_red_parka_snow.jpg (score: 0.901) ✅
runner_red_jacket_trail.jpg (score: 0.887) ✅

Key Learnings & Challenges

1. Cold Start Latency on Lambda

Challenge: Loading CLIP model on Lambda cold start = 8 seconds

Solutions Tried:

❌ Reduce model size (ViT-B/16 → ViT-B/32) - Still 5s cold start
❌ Provisioned concurrency - Too expensive ($250/month)
✅ Hybrid approach: Cache text embeddings for popular queries (Redis)

Result: 67% of queries served from cache (<50ms latency)

2. FAISS Index Size

Challenge: FAISS index = 2.4 GB (too large for Lambda package)

Solution: Mount EFS (Elastic File System) to Lambda

EFS stores FAISS index + metadata
Lambda reads index on cold start (~3 seconds)
Warm instances reuse loaded index

Trade-off: +3s cold start, but persistent storage for large indexes

3. Handling Image Updates

Challenge: New images uploaded daily - how to update FAISS index?

Solution: Incremental indexing pipeline

Daily batch job (Lambda cron) generates embeddings for new images
Merge new embeddings into existing FAISS index
Atomic swap - Replace old index with new index
Zero downtime - CloudFront caches results during swap

def incremental_index_update(new_images: list):
    # Load existing index
    index = faiss.read_index("/mnt/efs/image_search.index")
    
    # Generate embeddings for new images
    new_embeddings = []
    for image_path in new_images:
        embedding = embed_image(image_path)
        new_embeddings.append(embedding)
    
    new_embeddings = np.vstack(new_embeddings)
    
    # Add to index
    index.add(new_embeddings)
    
    # Save updated index
    faiss.write_index(index, "/mnt/efs/image_search_new.index")
    
    # Atomic swap
    os.rename("/mnt/efs/image_search_new.index", "/mnt/efs/image_search.index")

4. Multi-Lingual Support

Challenge: Users search in different languages

CLIP Advantage: CLIP was trained on multilingual data!

# English query
search_images("sunset over mountains")
 
# Spanish query (same results!)
search_images("puesta de sol sobre montañas")
 
# French query
search_images("coucher de soleil sur les montagnes")
 
# All return similar top results because CLIP understands semantic meaning!

No extra work needed - CLIP embeddings are inherently multilingual.

Advanced Features

1. Reverse Image Search

Find images similar to an uploaded image:

def reverse_image_search(query_image_path: str, top_k: int = 50):
    # Encode query image
    query_embedding = embed_image(query_image_path)
    
    # Search FAISS
    similarities, indices = faiss_index.search(
        query_embedding.reshape(1, -1),
        top_k
    )
    
    return format_results(indices, similarities)
 
# Example
similar_images = reverse_image_search("/uploads/user_image.jpg")

2. Negative Search

Find images that DON'T contain certain concepts:

def negative_search(
    positive_query: str,
    negative_query: str,
    top_k: int = 50
):
    # Embed both queries
    pos_emb = embed_text(positive_query)
    neg_emb = embed_text(negative_query)
    
    # Subtract negative embedding
    query_embedding = pos_emb - 0.5 * neg_emb
    query_embedding /= np.linalg.norm(query_embedding)
    
    # Search
    similarities, indices = faiss_index.search(query_embedding, top_k)
    return format_results(indices, similarities)
 
# Example: Mountains but NOT snow
results = negative_search(
    positive_query="mountains",
    negative_query="snow",
    top_k=50
)

3. Compositional Search

Combine multiple concepts:

def compositional_search(queries: list, weights: list, top_k: int = 50):
    embeddings = [embed_text(q) for q in queries]
    
    # Weighted average
    query_embedding = sum(e * w for e, w in zip(embeddings, weights))
    query_embedding /= np.linalg.norm(query_embedding)
    
    similarities, indices = faiss_index.search(query_embedding, top_k)
    return format_results(indices, similarities)
 
# Example: 70% mountains + 30% lake
results = compositional_search(
    queries=["mountains", "lake"],
    weights=[0.7, 0.3],
    top_k=50
)

Cost Analysis

Monthly AWS Costs

Service	Configuration	Monthly Cost
Lambda	100k invocations, 3GB memory	$45
EFS	3 GB storage (FAISS index)	$1
S3	100k images (~500 GB)	$12
CloudFront	10M requests, 500 GB transfer	$85
API Gateway	100k requests	$0.35
Total		~$143/month

Cost per search: $0.00143 (very affordable!)

Future Enhancements

Fine-tune CLIP on domain-specific data - Improve relevance for niche categories
GPU inference - Deploy on EC2 with GPU for 5x faster encoding
Distributed FAISS - Shard index across multiple machines for >10M images
Real-time indexing - Index images within seconds of upload (instead of daily batch)
Advanced reranking - Use cross-encoder for top-50 results (better accuracy)

Conclusion

Building an image search engine with CLIP + FAISS enabled semantic, natural language queries across 100k+ images with <200ms latency. The combination of CLIP's powerful multi-modal embeddings and FAISS's lightning-fast approximate nearest neighbor search created a search experience that understands visual concepts, not just tags.

Key Takeaways:

✅ CLIP embeddings understand semantic meaning across text and images
✅ FAISS IVF indexing achieves sub-linear search time (O(log n) vs O(n))
✅ Multi-modal search enables text, image, or combined queries
✅ AWS serverless architecture scales automatically with cost-effective pricing
✅ Zero tagging required - CLIP understands visual content automatically

Tech Stack Summary

Machine Learning:

CLIP (ViT-B/32) - Multi-modal embeddings
PyTorch - Model inference
FAISS - Vector similarity search

Infrastructure:

AWS Lambda - Serverless compute
AWS EFS - FAISS index storage
AWS S3 - Image storage
CloudFront - CDN caching
API Gateway - REST API

Frontend:

React - Search UI
Next.js - Server-side rendering
Tailwind CSS - Styling

Performance:

176ms average query latency
96% recall@50 (nprobe=32)
92% user satisfaction
$0.00143 cost per search