Cross Language Model (SLM/MLM/LLM) Metadata and Parameters: Part 1 Examining the Problem

Due to the nascency of Generative AI and the competitive nature of for-profit companies (sometimes despite the “greater good”), every vendor assumes their generative AI solution is the best and the only one that matters (or should matter). That puts developers stuck in the middle of lock-in, rapidly changing APIs and a merry-go-round of frustration. A very basic and significant area of weakness is Language Model Metadata and Parameters. Let’s examine the basics of each before we design a solution. This is part 1 of a three-part article: in this part we will examine the problem; in part 2 we will analyze the details; and in part 3 we will design a solution.

Language Model Metadata

Metadata is data about data and in the case of Small, Medium and Large Language Models that means contextual information about the model that affects its usage, application and performance. Unfortunately, most of the current language models have no API to retrieve metadata about their models. Some of the newer models like Grok are introducing a simple REST interface to list and retrieve some model information. This is serious because if we want an "agentic future" (like in the figure below) where lots of LLMs communicate and coordinate with each other, understanding the capabilities of a language model is critical to know how to interact with it.

Agents Everywhere

Here is a basic set of Model metadata that every model should expose:

  1. Token limits – Specifies the maximum number of tokens (sub-word units) that the model can process as input and produce as output. For example, OpenAI’s “gpt-4o-mini” model supports an input context window of up to 128,000 tokens and output responses containing up to 16,384 tokens in a single request.
  2. Temperature and other parameter values (See parameters section) - Temperature is a model parameter used to adjust the “creativity” of the responses. For example, in the Llama models by Meta, the temperature parameter is a floating-point number (float) that typically ranges from 0.0 to 2.0. The parameter controls the randomness in token sampling during text generation. We will examine additional model parameters (aka settings) in the parameters section below. Here are the current set of commonly supported modalities: text, images, audio, and video.  
  3. Whether the model is multi-modal and a list of input and output modalities supported. Most modern language models can handle more than just text. For example, a model that can input text and generate an image (text-to-image) or input an image and describe the contents of the image in text (image-to-text) are examples of multi-modal language models.
  4. Output Formats – how is output (model’s response) formatted for various types of content. For example, is text formatted using the markdown format or just plain text. How are tables formatted? Are formulas output in LaTex?
    1. Logarithm Probabilities – most models allow you to retrieve the logarithm probabilities of the returned tokens (these are a normalized vector of numbers that represent the probability of the chosen token). Higher log probability means a higher likelihood for that token. These probabilities may be affected by the temperature parameter. These can be used to calculate “perplexity” or a quantitative measure of the LLM’s uncertainty about the response.
    2. Structured Output Support – the capability of language models to generate output in a predefined, organized format such as JSON, XML, CSV, or other schema. For example, you could specify that the LLM extract information on books and put it in the following JSON format:

      {
          "Title": "Dune",
          "Author": "Frank Herbert",
          "Year": 1965,
          "Genre": "Science Fiction"
      }

    3. Reasoning capability – this one is debatable and a recent phenomenon. Some LLM models take a stepwise or “chain of thought” approach and are better at complex math and logic problems; however, these models typically take longer to deliver a response so from a UI perspective, it is important to know whether your model is using any of these “thinking” methods.
    4. Other features or capabilities – LLMs have a growing list of additional features like assistants, web search, accessing other data sources (like weather), generating and executing code, etc.

    Given the above metadata attributes, two things are immediately evident:

    1. Given the proliferation of new generative AI models and the continual increase in model capabilities, a simple API to retrieve such metadata is sorely needed.
    2. It is a sad testimony to the nascency of the generative AI space (despite the high-level of fear mongering about its potential impact) that such an obvious need is largely unmet. While there are a few models (Grok, Gemini, etc.) that have an API to retrieve model information, most do not. Even the models with the ability to get model metadata do not offer a robust set of metadata attributes like suggested above. Furthermore, lack of such an obvious necessity will hurt the dynamic configuration necessary for a robust agent ecosystem.

    Language Model Parameters

    Here are the common settable model parameters:

    1. Max completion token limit – this is the output side of the token limits attribute discussed above. The output limit is the only “settable” limit because the input limit is a constant constraint of the language model itself and is immutable.
    2. Temperature – same as above.
    3. System instructions (aka a “Persona”) – this is a set of contextual instructions that should govern all prompts for as long as these instructions are in effect (not replaced). These are often called a “Persona” (or role) that the language model can assume. An example of a Persona would be, "You are a virtual fitness coach. Provide personalized workout recommendations and motivational advice tailored to the user's fitness goals and current activity level. Be encouraging, positive, and knowledgeable about exercise techniques and health tips."
    4. Other generation tuning parameters – there are other parameters besides temperature that affect how the neural network selects which token to return. Some of the common ones are top_k, top_p, frequency penalty and presence penalty. The top_k parameter limits the model to sampling from the top k most likely tokens, reducing randomness by narrowing the choice set. The top_p parameter implements nucleus sampling by limiting the model to tokens with a cumulative probability of p (or greater), ensuring focus on high-probability outcomes. The frequency penalty parameter discourages the model from repeating tokens by penalizing token probabilities based on their prior occurrence in the generated text. The presence penalty parameter encourages the model to introduce new tokens by reducing probabilities of tokens already appearing in the text, promoting diversity

    There is a significant flaw in the API for these LLM parameters because most LLMs allow them as optional parameters to the chat call (or streaming chat call). What this means is that these parameters are transactional parameters that can be changed with each and every call; however, that actually violates the separation of concern for that method. Chat is a single concern where a user sends a prompt (question) and receives a completion (answer). Distinct from that is the global state of the LLM with attributes like temperature, system instructions, token limits, etc. Setting of global state is a separate concern from chatting with the LLM. Conflating the two is poor design. Proof of that is the hole it leaves in trying to query the current state. That doesn’t exist because you should know what you just set it to; however, that logic breaks down when you add multiple users.

    Next Steps

    So that leaves us with: “Houston, we have a problem.” The next step will be to analyze the details of each metadata attribute and parameter (including the efficacy of naming), examine various design patterns that match the problem space, and finally, examine design alternatives.

    After we have completed a thorough analysis, then we will be ready to design a solution!