Overview
Traditional image search relies on manually-tagged metadata or filenames, making it nearly impossible to find images based on visual content. At RealRoll, I built a semantic image search engine using CLIP (Contrastive Language-Image Pre-training) embeddings and FAISS vector search that understands natural language queries and finds visually similar images.
Key Achievements:
- πΌοΈ 100,000+ images indexed with multi-modal embeddings
- π Natural language search - "sunset over mountains", "person wearing red jacket"
- β‘ <200ms query latency with FAISS approximate nearest neighbor
- π― 92% search accuracy (user satisfaction with top-5 results)
- π Sub-linear search time - O(log n) vs O(n) brute force
- π AWS deployment with Lambda + S3
The Problem with Traditional Image Search
Tag-Based Search Limitations
Traditional image search requires manual tagging:
# Traditional approach - manual tags
image_metadata = {
"filename": "IMG_1234.jpg",
"tags": ["mountain", "sunset", "landscape"],
"date": "2024-08-15"
}
# Search query: "orange sky over snowy peaks"
# β Won't find the image because exact tags don't matchProblems:
- Manual Tagging is Expensive - Hours of human labor per 1,000 images
- Tag Vocabulary Mismatch - "sunset" β "orange sky" (same meaning, different words)
- No Visual Understanding - Can't search by visual features ("person in red jacket")
- Incomplete Tags - Background objects often untagged
- Language Barrier - Tags in one language limit discoverability
Real User Query Examples
| User Query | Tag-Based Search | CLIP + FAISS Search |
|---|---|---|
| "sunset over mountains" | β Requires exact tag "sunset" | β Understands visual concept |
| "person wearing red jacket" | β No tag for clothing color | β Detects visual features |
| "coffee on wooden table" | β Generic "coffee" tag | β Understands scene composition |
| "happy dog playing fetch" | β Requires "dog" + "happy" tags | β Understands emotion + action |
Solution: CLIP + FAISS
How CLIP Works
CLIP (Contrastive Language-Image Pre-training) by OpenAI creates a joint embedding space where images and text descriptions are mapped to the same 512-dimensional vector space.
Text: "sunset over mountains" β [0.21, -0.45, 0.82, ..., 0.11] (512-dim)
β
β 0.89 similarity
β
Image: π β [0.19, -0.43, 0.85, ..., 0.09] (512-dim)
Key Insight: Similar concepts (text or image) are close together in embedding space.
Why FAISS?
FAISS (Facebook AI Similarity Search) provides ultra-fast approximate nearest neighbor search:
| Method | Time Complexity | Search Time (100k images) |
|---|---|---|
| Brute Force | O(n) | ~2,000ms |
| FAISS IVF | O(log n) | <200ms |
FAISS uses inverted file indexes (IVF) to partition the vector space into clusters, searching only relevant clusters instead of all vectors.
System Architecture
βββββββββββββββββββββββ
β User Text Query β "sunset over mountains"
ββββββββββββ¬βββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββ
β CLIP Text Encoder β
β β’ Tokenize query β
β β’ Generate 512-dim embedding β
ββββββββββββ¬βββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββ
β FAISS Index Search β
β β’ 100k image embeddings β
β β’ IVF clustering (nlist=256) β
β β’ nprobe=32 for accuracy β
ββββββββββββ¬βββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββ
β Similarity Ranking β
β β’ Cosine similarity scores β
β β’ Top-K selection (K=50) β
β β’ Rerank by metadata (date, quality) β
ββββββββββββ¬βββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββ
β Top 50 Images β β Results
βββββββββββββββββββββββ
Implementation Details
1. Generating Image Embeddings with CLIP
First, we embed all 100k+ images in our database:
import torch
import clip
from PIL import Image
import numpy as np
from tqdm import tqdm
import os
# Load CLIP model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
def embed_image(image_path: str) -> np.ndarray:
"""
Generate 512-dimensional embedding for an image.
"""
image = Image.open(image_path).convert("RGB")
image_input = preprocess(image).unsqueeze(0).to(device)
with torch.no_grad():
image_features = model.encode_image(image_input)
# Normalize to unit length for cosine similarity
image_features /= image_features.norm(dim=-1, keepdim=True)
return image_features.cpu().numpy()[0]
def index_images(image_dir: str, batch_size: int = 32):
"""
Index all images in directory, processing in batches for efficiency.
"""
image_paths = []
for root, _, files in os.walk(image_dir):
for file in files:
if file.lower().endswith(('.png', '.jpg', '.jpeg', '.webp')):
image_paths.append(os.path.join(root, file))
print(f"Found {len(image_paths)} images to index")
embeddings = []
image_ids = []
for i in tqdm(range(0, len(image_paths), batch_size)):
batch_paths = image_paths[i:i + batch_size]
# Process batch
images = [preprocess(Image.open(path).convert("RGB")) for path in batch_paths]
image_batch = torch.stack(images).to(device)
with torch.no_grad():
features = model.encode_image(image_batch)
features /= features.norm(dim=-1, keepdim=True)
embeddings.append(features.cpu().numpy())
image_ids.extend(batch_paths)
# Concatenate all batches
embeddings = np.vstack(embeddings)
print(f"β
Generated {embeddings.shape[0]} embeddings of dimension {embeddings.shape[1]}")
return embeddings, image_ids
# Example: Index 100k images
embeddings, image_ids = index_images("/data/images", batch_size=32)Indexing Performance:
- Batch size: 32 images
- GPU: NVIDIA T4 (16GB)
- Throughput: ~120 images/second
- Total time for 100k images: ~14 minutes
2. Building FAISS Index
Once we have embeddings, we build a FAISS index for fast search:
import faiss
import pickle
def build_faiss_index(embeddings: np.ndarray, nlist: int = 256):
"""
Build FAISS index with IVF (Inverted File) for fast approximate search.
Args:
embeddings: (N, 512) array of image embeddings
nlist: Number of clusters (βN is a good heuristic)
"""
dimension = embeddings.shape[1] # 512 for CLIP ViT-B/32
# Quantizer for IVF
quantizer = faiss.IndexFlatIP(dimension) # Inner product (cosine similarity)
# IVF index with nlist clusters
index = faiss.IndexIVFFlat(quantizer, dimension, nlist, faiss.METRIC_INNER_PRODUCT)
# Train the index (learns cluster centroids)
print("Training FAISS index...")
index.train(embeddings)
# Add all embeddings to index
print("Adding embeddings to index...")
index.add(embeddings)
# Set nprobe (number of clusters to search)
# Higher nprobe = more accurate but slower
index.nprobe = 32 # Good balance between speed and accuracy
print(f"β
FAISS index built: {index.ntotal} vectors indexed")
return index
# Build index
faiss_index = build_faiss_index(embeddings, nlist=256)
# Save index and image IDs
faiss.write_index(faiss_index, "image_search.index")
with open("image_ids.pkl", "wb") as f:
pickle.dump(image_ids, f)
print("Index saved to disk")FAISS Index Parameters:
- IndexIVFFlat - Inverted file with flat (exact) vectors
- nlist=256 - 256 clusters (β100,000 β 316, we use 256)
- nprobe=32 - Search 32 clusters per query (12.5% of total)
- METRIC_INNER_PRODUCT - Cosine similarity (assumes normalized vectors)
3. Query-Time Search
When a user searches, we embed the text query and find similar images:
def search_images(
query: str,
faiss_index: faiss.Index,
image_ids: list,
top_k: int = 50
) -> list:
"""
Search for images matching the text query.
"""
# Encode text query with CLIP
text_input = clip.tokenize([query]).to(device)
with torch.no_grad():
text_features = model.encode_text(text_input)
text_features /= text_features.norm(dim=-1, keepdim=True)
query_embedding = text_features.cpu().numpy()
# Search FAISS index
start_time = time.time()
similarities, indices = faiss_index.search(query_embedding, top_k)
search_time = (time.time() - start_time) * 1000 # ms
# Format results
results = []
for i, (idx, score) in enumerate(zip(indices[0], similarities[0])):
results.append({
"rank": i + 1,
"image_id": image_ids[idx],
"similarity": float(score),
"search_time_ms": search_time
})
return results
# Example search
results = search_images(
query="sunset over mountains",
faiss_index=faiss_index,
image_ids=image_ids,
top_k=50
)
for result in results[:5]:
print(f"Rank {result['rank']}: {result['image_id']} (score: {result['similarity']:.3f})")Example Output:
Rank 1: /images/landscape_4521.jpg (score: 0.891)
Rank 2: /images/mountain_sunset_1293.jpg (score: 0.874)
Rank 3: /images/alpine_glow_8234.jpg (score: 0.862)
Rank 4: /images/golden_hour_peaks.jpg (score: 0.851)
Rank 5: /images/dusk_mountains.jpg (score: 0.843)
Search Performance:
- Query latency: 180ms average (p99: 250ms)
- CLIP text encoding: 40ms
- FAISS search: 120ms
- Result formatting: 20ms
4. Advanced: Multi-Modal Search
CLIP enables powerful multi-modal search - you can search by text, image, or both:
def multimodal_search(
text_query: str = None,
image_query: str = None,
text_weight: float = 0.5,
image_weight: float = 0.5,
top_k: int = 50
) -> list:
"""
Search using text, image, or weighted combination of both.
"""
embeddings = []
# Text query
if text_query:
text_input = clip.tokenize([text_query]).to(device)
with torch.no_grad():
text_features = model.encode_text(text_input)
text_features /= text_features.norm(dim=-1, keepdim=True)
embeddings.append((text_features.cpu().numpy(), text_weight))
# Image query (find similar images)
if image_query:
image = Image.open(image_query).convert("RGB")
image_input = preprocess(image).unsqueeze(0).to(device)
with torch.no_grad():
image_features = model.encode_image(image_input)
image_features /= image_features.norm(dim=-1, keepdim=True)
embeddings.append((image_features.cpu().numpy(), image_weight))
# Weighted average of embeddings
query_embedding = sum(emb * weight for emb, weight in embeddings)
query_embedding /= query_embedding.norm() # Re-normalize
# Search
similarities, indices = faiss_index.search(query_embedding, top_k)
results = []
for idx, score in zip(indices[0], similarities[0]):
results.append({
"image_id": image_ids[idx],
"similarity": float(score)
})
return results
# Example: Search for "red jacket" but show me images similar to this reference
results = multimodal_search(
text_query="person wearing red jacket",
image_query="/reference/red_jacket_example.jpg",
text_weight=0.6,
image_weight=0.4,
top_k=50
)AWS Deployment Architecture
Infrastructure
βββββββββββββββββββ
β User Browser β
ββββββββββ¬βββββββββ
β HTTPS
βΌ
βββββββββββββββββββββββββββββββββββ
β CloudFront CDN β β Cache images
ββββββββββ¬βββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββ
β API Gateway β
ββββββββββ¬βββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββ
β Lambda Function (Python 3.11) β
β β’ Load CLIP model β
β β’ Load FAISS index from EFS β
β β’ Perform search β
β β’ Return signed S3 URLs β
ββββββββββ¬βββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββ
β EFS (Elastic File System) β
β β’ FAISS index (2.4 GB) β
β β’ Image ID mappings (50 MB) β
βββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββ
β S3 Bucket β
β β’ 100k+ images β
β β’ Organized by date/category β
βββββββββββββββββββββββββββββββββββ
Lambda Function
import json
import boto3
import numpy as np
import faiss
import torch
import clip
import pickle
from typing import Dict, List
# Load CLIP model (cold start)
device = "cpu" # Lambda doesn't have GPU
model, preprocess = clip.load("ViT-B/32", device=device)
# Load FAISS index from EFS
faiss_index = faiss.read_index("/mnt/efs/image_search.index")
with open("/mnt/efs/image_ids.pkl", "rb") as f:
image_ids = pickle.load(f)
# S3 client for generating signed URLs
s3_client = boto3.client('s3')
BUCKET_NAME = "realroll-images"
def lambda_handler(event, context):
"""
AWS Lambda handler for image search API.
"""
try:
# Parse request
body = json.loads(event['body'])
query = body.get('query', '')
top_k = body.get('top_k', 50)
if not query:
return {
'statusCode': 400,
'body': json.dumps({'error': 'Query is required'})
}
# Encode text query
text_input = clip.tokenize([query]).to(device)
with torch.no_grad():
text_features = model.encode_text(text_input)
text_features /= text_features.norm(dim=-1, keepdim=True)
query_embedding = text_features.cpu().numpy()
# Search FAISS
similarities, indices = faiss_index.search(query_embedding, top_k)
# Generate signed URLs for images
results = []
for idx, score in zip(indices[0], similarities[0]):
image_path = image_ids[idx]
# Generate signed URL (valid for 1 hour)
signed_url = s3_client.generate_presigned_url(
'get_object',
Params={'Bucket': BUCKET_NAME, 'Key': image_path},
ExpiresIn=3600
)
results.append({
"image_url": signed_url,
"similarity": float(score),
"image_id": image_path
})
return {
'statusCode': 200,
'headers': {
'Content-Type': 'application/json',
'Access-Control-Allow-Origin': '*'
},
'body': json.dumps({
'success': True,
'query': query,
'count': len(results),
'results': results
})
}
except Exception as e:
return {
'statusCode': 500,
'body': json.dumps({'error': str(e)})
}Lambda Configuration:
- Memory: 3,008 MB (for CLIP model)
- Timeout: 30 seconds
- EFS Mount: /mnt/efs (for FAISS index)
- Concurrency: 100 (handles traffic spikes)
React Frontend
// components/ImageSearch.tsx
'use client';
import { useState } from 'react';
import { Search, Loader2 } from 'lucide-react';
interface SearchResult {
image_url: string;
similarity: number;
image_id: string;
}
export default function ImageSearch() {
const [query, setQuery] = useState('');
const [results, setResults] = useState<SearchResult[]>([]);
const [loading, setLoading] = useState(false);
const handleSearch = async (e: React.FormEvent) => {
e.preventDefault();
if (!query.trim()) return;
setLoading(true);
try {
const response = await fetch('https://api.realroll.com/search', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ query, top_k: 50 }),
});
const data = await response.json();
if (data.success) {
setResults(data.results);
}
} catch (error) {
console.error('Search failed:', error);
} finally {
setLoading(false);
}
};
return (
<div className="max-w-7xl mx-auto p-6">
{/* Search Bar */}
<form onSubmit={handleSearch} className="mb-8">
<div className="relative">
<Search className="absolute left-4 top-1/2 -translate-y-1/2 text-gray-400 w-6 h-6" />
<input
type="text"
value={query}
onChange={(e) => setQuery(e.target.value)}
placeholder='Try: "sunset over mountains" or "person wearing red jacket"'
className="w-full pl-14 pr-4 py-5 text-lg border-2 rounded-xl focus:ring-2 focus:ring-blue-500"
/>
</div>
<div className="mt-4 flex gap-2">
<button
type="button"
onClick={() => setQuery('sunset over mountains')}
className="px-4 py-2 bg-gray-100 rounded-lg hover:bg-gray-200"
>
π Sunset
</button>
<button
type="button"
onClick={() => setQuery('person wearing red jacket')}
className="px-4 py-2 bg-gray-100 rounded-lg hover:bg-gray-200"
>
π§₯ Red Jacket
</button>
<button
type="button"
onClick={() => setQuery('coffee on wooden table')}
className="px-4 py-2 bg-gray-100 rounded-lg hover:bg-gray-200"
>
β Coffee
</button>
<button
type="button"
onClick={() => setQuery('happy dog playing fetch')}
className="px-4 py-2 bg-gray-100 rounded-lg hover:bg-gray-200"
>
π Dog Playing
</button>
</div>
</form>
{/* Loading State */}
{loading && (
<div className="flex items-center justify-center py-12">
<Loader2 className="w-10 h-10 animate-spin text-blue-500" />
<span className="ml-3 text-gray-600">Searching 100k+ images...</span>
</div>
)}
{/* Results */}
{!loading && results.length > 0 && (
<>
<p className="text-gray-600 mb-4">
Found {results.length} images matching "{query}"
</p>
<div className="grid grid-cols-2 md:grid-cols-4 lg:grid-cols-5 gap-4">
{results.map((result, index) => (
<div
key={index}
className="group relative overflow-hidden rounded-lg border hover:shadow-xl transition-shadow"
>
<img
src={result.image_url}
alt={`Result ${index + 1}`}
className="w-full h-48 object-cover group-hover:scale-105 transition-transform"
/>
{/* Similarity Score Overlay */}
<div className="absolute bottom-0 left-0 right-0 bg-gradient-to-t from-black/70 to-transparent p-3">
<div className="flex items-center justify-between text-white text-sm">
<span>#{index + 1}</span>
<span className="font-semibold">
{(result.similarity * 100).toFixed(0)}% match
</span>
</div>
</div>
</div>
))}
</div>
</>
)}
</div>
);
}Performance Benchmarks
Search Latency Breakdown
# Profiling search pipeline
import time
def profile_search(query: str):
timings = {}
# 1. Text encoding
start = time.time()
text_input = clip.tokenize([query]).to(device)
with torch.no_grad():
text_features = model.encode_text(text_input)
text_features /= text_features.norm(dim=-1, keepdim=True)
timings['clip_encoding'] = (time.time() - start) * 1000
# 2. FAISS search
start = time.time()
query_embedding = text_features.cpu().numpy()
similarities, indices = faiss_index.search(query_embedding, 50)
timings['faiss_search'] = (time.time() - start) * 1000
# 3. Result formatting
start = time.time()
results = format_results(indices, similarities)
timings['formatting'] = (time.time() - start) * 1000
timings['total'] = sum(timings.values())
return timings
# Example
timings = profile_search("sunset over mountains")
print(f"CLIP Encoding: {timings['clip_encoding']:.1f}ms")
print(f"FAISS Search: {timings['faiss_search']:.1f}ms")
print(f"Formatting: {timings['formatting']:.1f}ms")
print(f"Total Latency: {timings['total']:.1f}ms")Output:
CLIP Encoding: 42.3ms
FAISS Search: 118.7ms
Formatting: 15.2ms
Total Latency: 176.2ms
Accuracy vs. Speed Trade-off
By tuning nprobe (number of clusters searched), we can balance accuracy and speed:
| nprobe | Search Time | Recall@50 | Use Case |
|---|---|---|---|
| 8 | 85ms | 82% | Lightning-fast (lower accuracy) |
| 16 | 120ms | 91% | Balanced |
| 32 | 180ms | 96% | High accuracy (default) |
| 64 | 290ms | 98% | Maximum accuracy |
| 256 | 1,200ms | 100% | Exhaustive search |
Our Choice: nprobe=32 (96% recall, <200ms latency)
Results & Impact
Search Quality Metrics
| Metric | Tag-Based Search | CLIP + FAISS | Improvement |
|---|---|---|---|
| User Satisfaction | 48% | 92% | +92% β |
| Top-5 Accuracy | 31% | 87% | +181% β |
| Zero-Results Rate | 34% | 2% | -94% β |
| Avg. Query Latency | 320ms | 176ms | -45% β |
| Multi-lingual Support | β No | β Yes | N/A |
Real Query Examples
Query: "sunset over mountains"
Tag-Based Results:
- mountain_lake.jpg (tag: "mountain") β
- sunset_beach.jpg (tag: "sunset") β
- city_sunset.jpg (tag: "sunset") β
CLIP Results:
- alpine_sunset_4521.jpg (score: 0.891) β
- mountain_golden_hour.jpg (score: 0.874) β
- peaks_at_dusk.jpg (score: 0.862) β
Query: "person wearing red jacket"
Tag-Based: 0 results (no tag for clothing color)
CLIP Results:
- hiker_red_coat_mountains.jpg (score: 0.923) β
- woman_red_parka_snow.jpg (score: 0.901) β
- runner_red_jacket_trail.jpg (score: 0.887) β
Key Learnings & Challenges
1. Cold Start Latency on Lambda
Challenge: Loading CLIP model on Lambda cold start = 8 seconds
Solutions Tried:
- β Reduce model size (ViT-B/16 β ViT-B/32) - Still 5s cold start
- β Provisioned concurrency - Too expensive ($250/month)
- β Hybrid approach: Cache text embeddings for popular queries (Redis)
Result: 67% of queries served from cache (<50ms latency)
2. FAISS Index Size
Challenge: FAISS index = 2.4 GB (too large for Lambda package)
Solution: Mount EFS (Elastic File System) to Lambda
- EFS stores FAISS index + metadata
- Lambda reads index on cold start (~3 seconds)
- Warm instances reuse loaded index
Trade-off: +3s cold start, but persistent storage for large indexes
3. Handling Image Updates
Challenge: New images uploaded daily - how to update FAISS index?
Solution: Incremental indexing pipeline
- Daily batch job (Lambda cron) generates embeddings for new images
- Merge new embeddings into existing FAISS index
- Atomic swap - Replace old index with new index
- Zero downtime - CloudFront caches results during swap
def incremental_index_update(new_images: list):
# Load existing index
index = faiss.read_index("/mnt/efs/image_search.index")
# Generate embeddings for new images
new_embeddings = []
for image_path in new_images:
embedding = embed_image(image_path)
new_embeddings.append(embedding)
new_embeddings = np.vstack(new_embeddings)
# Add to index
index.add(new_embeddings)
# Save updated index
faiss.write_index(index, "/mnt/efs/image_search_new.index")
# Atomic swap
os.rename("/mnt/efs/image_search_new.index", "/mnt/efs/image_search.index")4. Multi-Lingual Support
Challenge: Users search in different languages
CLIP Advantage: CLIP was trained on multilingual data!
# English query
search_images("sunset over mountains")
# Spanish query (same results!)
search_images("puesta de sol sobre montaΓ±as")
# French query
search_images("coucher de soleil sur les montagnes")
# All return similar top results because CLIP understands semantic meaning!No extra work needed - CLIP embeddings are inherently multilingual.
Advanced Features
1. Reverse Image Search
Find images similar to an uploaded image:
def reverse_image_search(query_image_path: str, top_k: int = 50):
# Encode query image
query_embedding = embed_image(query_image_path)
# Search FAISS
similarities, indices = faiss_index.search(
query_embedding.reshape(1, -1),
top_k
)
return format_results(indices, similarities)
# Example
similar_images = reverse_image_search("/uploads/user_image.jpg")2. Negative Search
Find images that DON'T contain certain concepts:
def negative_search(
positive_query: str,
negative_query: str,
top_k: int = 50
):
# Embed both queries
pos_emb = embed_text(positive_query)
neg_emb = embed_text(negative_query)
# Subtract negative embedding
query_embedding = pos_emb - 0.5 * neg_emb
query_embedding /= np.linalg.norm(query_embedding)
# Search
similarities, indices = faiss_index.search(query_embedding, top_k)
return format_results(indices, similarities)
# Example: Mountains but NOT snow
results = negative_search(
positive_query="mountains",
negative_query="snow",
top_k=50
)3. Compositional Search
Combine multiple concepts:
def compositional_search(queries: list, weights: list, top_k: int = 50):
embeddings = [embed_text(q) for q in queries]
# Weighted average
query_embedding = sum(e * w for e, w in zip(embeddings, weights))
query_embedding /= np.linalg.norm(query_embedding)
similarities, indices = faiss_index.search(query_embedding, top_k)
return format_results(indices, similarities)
# Example: 70% mountains + 30% lake
results = compositional_search(
queries=["mountains", "lake"],
weights=[0.7, 0.3],
top_k=50
)Cost Analysis
Monthly AWS Costs
| Service | Configuration | Monthly Cost |
|---|---|---|
| Lambda | 100k invocations, 3GB memory | $45 |
| EFS | 3 GB storage (FAISS index) | $1 |
| S3 | 100k images (~500 GB) | $12 |
| CloudFront | 10M requests, 500 GB transfer | $85 |
| API Gateway | 100k requests | $0.35 |
| Total | ~$143/month |
Cost per search: $0.00143 (very affordable!)
Future Enhancements
- Fine-tune CLIP on domain-specific data - Improve relevance for niche categories
- GPU inference - Deploy on EC2 with GPU for 5x faster encoding
- Distributed FAISS - Shard index across multiple machines for >10M images
- Real-time indexing - Index images within seconds of upload (instead of daily batch)
- Advanced reranking - Use cross-encoder for top-50 results (better accuracy)
Conclusion
Building an image search engine with CLIP + FAISS enabled semantic, natural language queries across 100k+ images with <200ms latency. The combination of CLIP's powerful multi-modal embeddings and FAISS's lightning-fast approximate nearest neighbor search created a search experience that understands visual concepts, not just tags.
Key Takeaways:
- β CLIP embeddings understand semantic meaning across text and images
- β FAISS IVF indexing achieves sub-linear search time (O(log n) vs O(n))
- β Multi-modal search enables text, image, or combined queries
- β AWS serverless architecture scales automatically with cost-effective pricing
- β Zero tagging required - CLIP understands visual content automatically
Tech Stack Summary
Machine Learning:
- CLIP (ViT-B/32) - Multi-modal embeddings
- PyTorch - Model inference
- FAISS - Vector similarity search
Infrastructure:
- AWS Lambda - Serverless compute
- AWS EFS - FAISS index storage
- AWS S3 - Image storage
- CloudFront - CDN caching
- API Gateway - REST API
Frontend:
- React - Search UI
- Next.js - Server-side rendering
- Tailwind CSS - Styling
Performance:
- 176ms average query latency
- 96% recall@50 (nprobe=32)
- 92% user satisfaction
- $0.00143 cost per search