How to deal with high cardinality categorical variables
You may want to do query analysis to create a filter on a categorical column. One of the difficulties here is that you usually need to specify the EXACT categorical value. The issue is you need to make sure the LLM generates that categorical value exactly. This can be done relatively easy with prompting when there are only a few values that are valid. When there are a high number of valid values then it becomes more difficult, as those values may not fit in the LLM context, or (if they do) there may be too many for the LLM to properly attend to.
In this notebook we take a look at how to approach this.
Setupβ
Install dependenciesβ
- npm
- yarn
- pnpm
npm i @langchain/core @langchain/community zod chromadb @faker-js/faker
yarn add @langchain/core @langchain/community zod chromadb @faker-js/faker
pnpm add @langchain/core @langchain/community zod chromadb @faker-js/faker
Set environment variablesβ
# Optional, use LangSmith for best-in-class observability
LANGSMITH_API_KEY=your-api-key
LANGCHAIN_TRACING_V2=true
Set up dataβ
We will generate a bunch of fake names
import { faker } from "@faker-js/faker";
const names = Array.from({ length: 10000 }, () => faker.person.fullName());
Letβs look at some of the names
names[0];
"Dale Kessler"
names[567];
"Mrs. Chelsea Bayer MD"
Query Analysisβ
We can now set up a baseline query analysis
import { z } from "zod";
const searchSchema = z.object({
query: z.string(),
author: z.string(),
});
Pick your chat model:
- OpenAI
- Anthropic
- FireworksAI
- MistralAI
Install dependencies
- npm
- yarn
- pnpm
npm i @langchain/openai
yarn add @langchain/openai
pnpm add @langchain/openai
Add environment variables
OPENAI_API_KEY=your-api-key
Instantiate the model
import { ChatOpenAI } from "@langchain/openai";
const llm = new ChatOpenAI({
model: "gpt-3.5-turbo-0125",
temperature: 0
});
Install dependencies
- npm
- yarn
- pnpm
npm i @langchain/anthropic
yarn add @langchain/anthropic
pnpm add @langchain/anthropic
Add environment variables
ANTHROPIC_API_KEY=your-api-key
Instantiate the model
import { ChatAnthropic } from "@langchain/anthropic";
const llm = new ChatAnthropic({
model: "claude-3-sonnet-20240229",
temperature: 0
});
Install dependencies
- npm
- yarn
- pnpm
npm i @langchain/community
yarn add @langchain/community
pnpm add @langchain/community
Add environment variables
FIREWORKS_API_KEY=your-api-key
Instantiate the model
import { ChatFireworks } from "@langchain/community/chat_models/fireworks";
const llm = new ChatFireworks({
model: "accounts/fireworks/models/firefunction-v1",
temperature: 0
});
Install dependencies
- npm
- yarn
- pnpm
npm i @langchain/mistralai
yarn add @langchain/mistralai
pnpm add @langchain/mistralai
Add environment variables
MISTRAL_API_KEY=your-api-key
Instantiate the model
import { ChatMistralAI } from "@langchain/mistralai";
const llm = new ChatMistralAI({
model: "mistral-large-latest",
temperature: 0
});
import { ChatPromptTemplate } from "@langchain/core/prompts";
import {
RunnablePassthrough,
RunnableSequence,
} from "@langchain/core/runnables";
const system = `Generate a relevant search query for a library system`;
const prompt = ChatPromptTemplate.fromMessages([
["system", system],
["human", "{question}"],
]);
const llmWithTools = llm.withStructuredOutput(searchSchema, {
name: "Search",
});
const queryAnalyzer = RunnableSequence.from([
{
question: new RunnablePassthrough(),
},
prompt,
llmWithTools,
]);
We can see that if we spell the name exactly correctly, it knows how to handle it
await queryAnalyzer.invoke("what are books about aliens by Jesse Knight");
{ query: "books about aliens", author: "Jesse Knight" }
The issue is that the values you want to filter on may NOT be spelled exactly correctly
await queryAnalyzer.invoke("what are books about aliens by jess knight");
{ query: "books about aliens", author: "Jess Knight" }
Add in all valuesβ
One way around this is to add ALL possible values to the prompt. That will generally guide the query in the right direction
const system = `Generate a relevant search query for a library system using the 'search' tool.
The 'author' you return to the user MUST be one of the following authors:
{authors}
Do NOT hallucinate author name!`;
const basePrompt = ChatPromptTemplate.fromMessages([
["system", system],
["human", "{question}"],
]);
const prompt = await basePrompt.partial({ authors: names.join(", ") });
const queryAnalyzerAll = RunnableSequence.from([
{
question: new RunnablePassthrough(),
},
prompt,
llmWithTools,
]);
However⦠if the list of categoricals is long enough, it may error!
try {
const res = await queryAnalyzerAll.invoke(
"what are books about aliens by jess knight"
);
} catch (e) {
console.error(e);
}
Error: 400 This model's maximum context length is 16385 tokens. However, your messages resulted in 49822 tokens (49792 in the messages, 30 in the functions). Please reduce the length of the messages or functions.
at Function.generate (file:///Users/bracesproul/Library/Caches/deno/npm/registry.npmjs.org/openai/4.28.4/error.mjs:40:20)
at OpenAI.makeStatusError (file:///Users/bracesproul/Library/Caches/deno/npm/registry.npmjs.org/openai/4.28.4/core.mjs:256:25)
at OpenAI.makeRequest (file:///Users/bracesproul/Library/Caches/deno/npm/registry.npmjs.org/openai/4.28.4/core.mjs:299:30)
at eventLoopTick (ext:core/01_core.js:63:7)
at async file:///Users/bracesproul/Library/Caches/deno/npm/registry.npmjs.org/@langchain/openai/0.0.15/dist/chat_models.js:650:29
at async RetryOperation._fn (file:///Users/bracesproul/Library/Caches/deno/npm/registry.npmjs.org/p-retry/4.6.2/index.js:50:12) {
status: 400,
headers: {
"access-control-allow-origin": "*",
"alt-svc": 'h3=":443"; ma=86400',
"cf-cache-status": "DYNAMIC",
"cf-ray": "85f6e713581815d0-SJC",
"content-length": "341",
"content-type": "application/json",
date: "Tue, 05 Mar 2024 03:08:39 GMT",
"openai-organization": "langchain",
"openai-processing-ms": "349",
"openai-version": "2020-10-01",
server: "cloudflare",
"set-cookie": "_cfuvid=NXe7nstRj6UNdFs5F8k49JZF6Tz7EE8dfKwYRpV3AWI-1709608119946-0.0.1.1-604800000; path=/; domain="... 48 more characters,
"strict-transport-security": "max-age=15724800; includeSubDomains",
"x-ratelimit-limit-requests": "10000",
"x-ratelimit-limit-tokens": "2000000",
"x-ratelimit-remaining-requests": "9999",
"x-ratelimit-remaining-tokens": "1958537",
"x-ratelimit-reset-requests": "6ms",
"x-ratelimit-reset-tokens": "1.243s",
"x-request-id": "req_99890749d442033c6145f9a8f1324aea"
},
error: {
message: "This model's maximum context length is 16385 tokens. However, your messages resulted in 49822 tokens"... 101 more characters,
type: "invalid_request_error",
param: "messages",
code: "context_length_exceeded"
},
code: "context_length_exceeded",
param: "messages",
type: "invalid_request_error",
attemptNumber: 1,
retriesLeft: 6
}
We can try to use a longer context window⦠but with so much information in there, it is not garunteed to pick it up reliably
Pick your chat model:
- OpenAI
- Anthropic
- FireworksAI
- MistralAI
Install dependencies
- npm
- yarn
- pnpm
npm i @langchain/openai
yarn add @langchain/openai
pnpm add @langchain/openai
Add environment variables
OPENAI_API_KEY=your-api-key
Instantiate the model
import { ChatOpenAI } from "@langchain/openai";
const llmLong = new ChatOpenAI({ model: "gpt-4-turbo-preview" });
Install dependencies
- npm
- yarn
- pnpm
npm i @langchain/anthropic
yarn add @langchain/anthropic
pnpm add @langchain/anthropic
Add environment variables
ANTHROPIC_API_KEY=your-api-key
Instantiate the model
import { ChatAnthropic } from "@langchain/anthropic";
const llmLong = new ChatAnthropic({
model: "claude-3-sonnet-20240229",
temperature: 0
});
Install dependencies
- npm
- yarn
- pnpm
npm i @langchain/community
yarn add @langchain/community
pnpm add @langchain/community
Add environment variables
FIREWORKS_API_KEY=your-api-key
Instantiate the model
import { ChatFireworks } from "@langchain/community/chat_models/fireworks";
const llmLong = new ChatFireworks({
model: "accounts/fireworks/models/firefunction-v1",
temperature: 0
});
Install dependencies
- npm
- yarn
- pnpm
npm i @langchain/mistralai
yarn add @langchain/mistralai
pnpm add @langchain/mistralai
Add environment variables
MISTRAL_API_KEY=your-api-key
Instantiate the model
import { ChatMistralAI } from "@langchain/mistralai";
const llmLong = new ChatMistralAI({
model: "mistral-large-latest",
temperature: 0
});
const structuredLlmLong = llmLong.withStructuredOutput(searchSchema, {
name: "Search",
});
const queryAnalyzerAll = RunnableSequence.from([
{
question: new RunnablePassthrough(),
},
prompt,
structuredLlmLong,
]);
await queryAnalyzerAll.invoke("what are books about aliens by jess knight");
{ query: "aliens", author: "Jess Knight" }
Find and all relevant valuesβ
Instead, what we can do is create an index over the relevant values and then query that for the N most relevant values,
import { Chroma } from "@langchain/community/vectorstores/chroma";
import { OpenAIEmbeddings } from "@langchain/openai";
import "chromadb";
const embeddings = new OpenAIEmbeddings({
model: "text-embedding-3-small",
});
const vectorstore = await Chroma.fromTexts(names, {}, embeddings, {
collectionName: "author_names",
});
[Module: null prototype] {
AdminClient: [class AdminClient],
ChromaClient: [class ChromaClient],
CloudClient: [class CloudClient extends ChromaClient],
CohereEmbeddingFunction: [class CohereEmbeddingFunction],
Collection: [class Collection],
DefaultEmbeddingFunction: [class _DefaultEmbeddingFunction],
GoogleGenerativeAiEmbeddingFunction: [class _GoogleGenerativeAiEmbeddingFunction],
HuggingFaceEmbeddingServerFunction: [class HuggingFaceEmbeddingServerFunction],
IncludeEnum: {
Documents: "documents",
Embeddings: "embeddings",
Metadatas: "metadatas",
Distances: "distances"
},
JinaEmbeddingFunction: [class JinaEmbeddingFunction],
OpenAIEmbeddingFunction: [class _OpenAIEmbeddingFunction],
TransformersEmbeddingFunction: [class _TransformersEmbeddingFunction]
}
const selectNames = async (question: string) => {
const _docs = await vectorstore.similaritySearch(question, 10);
const _names = _docs.map((d) => d.pageContent);
return _names.join(", ");
};
const createPrompt = RunnableSequence.from([
{
question: new RunnablePassthrough(),
authors: selectNames,
},
basePrompt,
]);
const queryAnalyzerSelect = createPrompt.pipe(llmWithTools);
await createPrompt.invoke("what are books by jess knight");
ChatPromptValue {
lc_serializable: true,
lc_kwargs: {
messages: [
SystemMessage {
lc_serializable: true,
lc_kwargs: {
content: "Generate a relevant search query for a library system using the 'search' tool.\n" +
"\n" +
"The 'author' you ret"... 259 more characters,
additional_kwargs: {}
},
lc_namespace: [ "langchain_core", "messages" ],
content: "Generate a relevant search query for a library system using the 'search' tool.\n" +
"\n" +
"The 'author' you ret"... 259 more characters,
name: undefined,
additional_kwargs: {}
},
HumanMessage {
lc_serializable: true,
lc_kwargs: {
content: "what are books by jess knight",
additional_kwargs: {}
},
lc_namespace: [ "langchain_core", "messages" ],
content: "what are books by jess knight",
name: undefined,
additional_kwargs: {}
}
]
},
lc_namespace: [ "langchain_core", "prompt_values" ],
messages: [
SystemMessage {
lc_serializable: true,
lc_kwargs: {
content: "Generate a relevant search query for a library system using the 'search' tool.\n" +
"\n" +
"The 'author' you ret"... 259 more characters,
additional_kwargs: {}
},
lc_namespace: [ "langchain_core", "messages" ],
content: "Generate a relevant search query for a library system using the 'search' tool.\n" +
"\n" +
"The 'author' you ret"... 259 more characters,
name: undefined,
additional_kwargs: {}
},
HumanMessage {
lc_serializable: true,
lc_kwargs: {
content: "what are books by jess knight",
additional_kwargs: {}
},
lc_namespace: [ "langchain_core", "messages" ],
content: "what are books by jess knight",
name: undefined,
additional_kwargs: {}
}
]
}
await queryAnalyzerSelect.invoke("what are books about aliens by jess knight");
{ query: "books about aliens", author: "Jessica Kerluke" }