Concepts
Here are the core concepts for working with the Vertesia Platform.
Large Language Models (LLM)
Generative AI is based on the capability of interpreting human languages - and reply to questions using a human language.
There are key points to know about them:
- Large Language Models (LLMs) are capable of interpreting human languages (including programming languages for instance).
- Human language is found in unstructured (image, video, audio) and structured (document,database record) contents.
- Models are pre-trained on very large sets of contents (public, private, hybrid).
- Models can be fine tuned to further adapt them, a posteriori, to specific knowledge (e.g. Enterprise contents such as suppliers contracts, corporate policies) - but that’s not necessarily a good approach.
- Retrieval Augmented Generation (RAG) is a good alternative to fine tuning for adding knowledge to models.
- Though chatBots have been the first well known application of generative AI, they fall far behind the full potential of Gen AI.
- LLMs can be seen as processing entities that can be requested in an automated way - and enable a very wide range of use cases.
A bit of jargon
Term | Description |
---|---|
Context Window | Imagine you have a really big notebook that you use to take notes while you are watching a movie. The notebook can only hold so many pages at a time, and once it is full, you can only look at what is written on those pages. In the world of large language models (LLMs), the notebook is called the context window. It is the amount of text (or tokens) that the model can “remember” or consider at one time when it is answering a question or having a conversation. Each model has a distinct context window. |
Embeddings | Embeddings are high-dimensional vectors that represent tokens (words)c in a way that captures their semantic meaning and relationships. These vectors are learned during the training of the LLM and are crucial for the model's ability to understand and generate language. |
Max tokens | Maximum number of tokens (words/characters) for the model to generate in the output. In some models, it is taken on the context window length. |
Similarity | If two tokens (words) are very similar in meaning, like "happy" and "joyful," their numerical representation will be next to each other. If the words are very different, like "happy" and "fast," their numbers will be farther apart. This numerical way of arranging words is what we call an embedding. |
Token | A unit of text that the model processes. Tokens can be words, subwords, characters, or even punctuation marks. The process of breaking down text into these smaller units is known as tokenization. This allows the model to handle and generate text more efficiently by working with manageable pieces of information. |
Prompt Templates
Prompt Templates
are the building blocks of prompts and are used to create prompts.
Prompts are then assembled to define a prompt for a task (Interaction).
- Name
JS Template
- Description
It's a Javascript Template engine, running in a jailed environment. You can use standard javascript string replacement syntax (
${var}
), as well as control blocks (for
,if
,else
, etc.), and array functionsmap
,reduce
,filter
, etc. It needs to retur a string.
- Name
Plain Text
- Description
Simple plain text format, with no variable replacement. Useful for application context or safety prompts.
Interactions
Interactions
define the tasks the LLM are requested to perform.
An interaction is defined by the following main components:
- Name
Name
- Description
- The name of the interaction.
- Name
Description
- Description
- A description of the interaction.
- Name
Prompt Segments
- Description
A list of prompts templates to be rendered as part of the final prompt.
- Name
Schema
- Description
JSON Schema requested from the generative model for the response. It will be used to validate the response as well.
- Name
Configuration
- Description
Environment and Model to execute the interaction on, and execution parameters.
Runs
Runs are the execution of an interaction, it is both the request to and the response from the generative model.
Runs have the following statuses:
- Name
created
- Description
The run has been created, but not yet started. Typically the case when waiting for the streaming start from the client.
- Name
processing
- Description
- The run is currently executing.
- Name
completed
- Description
- The run has completed successfully.
- Name
failed
- Description
The run has failed. The failure reason is in the field
error
.
Environments
Environments connect to LLM inference providers which are the execution platforms running generative models.
We currently support environments for the following inference providers:
azure_openai
- Azure OpenAI Servicebedrock
- Amazon Bedrockgroq
- Groqhuggingface_ie
- Hugging Face's Inference Endpointmistralai
- Mistral AI's La Platformeopenai
- OpenAIreplicate
- Replicatetogetherai
- TogetherAIvertexai
- Google's Vertex AIwastsonx
- IBM's watsonx.ai
In addition to the core inference providers above, we have created virtual providers to assemble models and platform into a virtual, synthetic LLM, and offer several balancing and execution strategies:
virtual_lb
- a synthetic environment that allows load balancing and failover between multiple modelsvirtual_mediator
- a synthetic environment that allows multi-head execution and LLM mediation