Semantic Index (#10329)

This introduces semantic indexing in Zed based on chunking text from files in the developer's workspace and creating vector embeddings using an embedding model. As part of this, we've created an embeddings provider trait that allows us to work with OpenAI, a local Ollama model, or a Zed hosted embedding. The semantic index is built by breaking down text for known (programming) languages into manageable chunks that are smaller than the max token size. Each chunk is then fed to a language model to create a high dimensional vector which is then normalized to a unit vector to allow fast comparison with other vectors with a simple dot product. Alongside the vector, we store the path of the file and the range within the document where the vector was sourced from. Zed will soon grok contextual similarity across different text snippets, allowing for natural language search beyond keyword matching. This is being put together both for human-based search as well as providing results to Large Language Models to allow them to refine how they help developers. Remaining todo: * [x] Change `provider` to `model` within the zed hosted embeddings database (as its currently a combo of the provider and the model in one name) Release Notes: - N/A --------- Co-authored-by: Nathan Sobo <nathan@zed.dev> Co-authored-by: Antonio Scandurra <me@as-cii.com> Co-authored-by: Conrad Irwin <conrad@zed.dev> Co-authored-by: Marshall Bowers <elliott.codes@gmail.com> Co-authored-by: Antonio <antonio@zed.dev>
2024-04-12 10:40:59 -07:00 · 2024-04-12 10:40:59 -07:00 · 49371b44cb
commit 49371b44cb
parent 4b40e83b8b
33 changed files with 2649 additions and 41 deletions
--- a/crates/semantic_index/src/embedding/cloud.rs
+++ b/crates/semantic_index/src/embedding/cloud.rs
@ -0,0 +1,88 @@
+use crate::{Embedding, EmbeddingProvider, TextToEmbed};
+use anyhow::{anyhow, Context, Result};
+use client::{proto, Client};
+use collections::HashMap;
+use futures::{future::BoxFuture, FutureExt};
+use std::sync::Arc;
+
+pub struct CloudEmbeddingProvider {
+    model: String,
+    client: Arc<Client>,
+}
+
+impl CloudEmbeddingProvider {
+    pub fn new(client: Arc<Client>) -> Self {
+        Self {
+            model: "openai/text-embedding-3-small".into(),
+            client,
+        }
+    }
+}
+
+impl EmbeddingProvider for CloudEmbeddingProvider {
+    fn embed<'a>(&'a self, texts: &'a [TextToEmbed<'a>]) -> BoxFuture<'a, Result<Vec<Embedding>>> {
+        // First, fetch any embeddings that are cached based on the requested texts' digests
+        // Then compute any embeddings that are missing.
+        async move {
+            let cached_embeddings = self.client.request(proto::GetCachedEmbeddings {
+                model: self.model.clone(),
+                digests: texts
+                    .iter()
+                    .map(|to_embed| to_embed.digest.to_vec())
+                    .collect(),
+            });
+            let mut embeddings = cached_embeddings
+                .await
+                .context("failed to fetch cached embeddings via cloud model")?
+                .embeddings
+                .into_iter()
+                .map(|embedding| {
+                    let digest: [u8; 32] = embedding
+                        .digest
+                        .try_into()
+                        .map_err(|_| anyhow!("invalid digest for cached embedding"))?;
+                    Ok((digest, embedding.dimensions))
+                })
+                .collect::<Result<HashMap<_, _>>>()?;
+
+            let compute_embeddings_request = proto::ComputeEmbeddings {
+                model: self.model.clone(),
+                texts: texts
+                    .iter()
+                    .filter_map(|to_embed| {
+                        if embeddings.contains_key(&to_embed.digest) {
+                            None
+                        } else {
+                            Some(to_embed.text.to_string())
+                        }
+                    })
+                    .collect(),
+            };
+            if !compute_embeddings_request.texts.is_empty() {
+                let missing_embeddings = self.client.request(compute_embeddings_request).await?;
+                for embedding in missing_embeddings.embeddings {
+                    let digest: [u8; 32] = embedding
+                        .digest
+                        .try_into()
+                        .map_err(|_| anyhow!("invalid digest for cached embedding"))?;
+                    embeddings.insert(digest, embedding.dimensions);
+                }
+            }
+
+            texts
+                .iter()
+                .map(|to_embed| {
+                    let dimensions = embeddings.remove(&to_embed.digest).with_context(|| {
+                        format!("server did not return an embedding for {:?}", to_embed)
+                    })?;
+                    Ok(Embedding::new(dimensions))
+                })
+                .collect()
+        }
+        .boxed()
+    }
+
+    fn batch_size(&self) -> usize {
+        2048
+    }
+}