Function Calling and Self-hosted Models, imperfect but usable

What is function calling?
Function calling is the capability of a language model to return structured output for knowledge augmentation and execution control. This output takes the shape of function calls (in the programming sense of the word), facilitating the standardization of communication between the host system and the model.

When it comes to Open Source and self-hosted models, the pool of available options is still limited, and basing decisions purely based of public forum posts can lead to popular choices that might not bring the best results. After experimenting with different models for the past few weeks, there are two models that are very easy to set up and good-enough at what they can do.

This is not a technical deep-dive on the topic, though technical details and code are provided towards the end of the the post. For in-depth introductory text on the topic we recommend external resources.

Agentic workflows with BitAgent-8B

Agentic workflows are processes where the model decides which tools it should invoke, and how it should follow up after each step. For this use-case the model needs to be capable of multi-turn dialogs, and BitAgent-8B is one such open source model.

As an 8 billion parameter model, in Q4 quantization, this model works really well on commodity hardware. With memory requirements coming in under 10 GBs, you can expect around 7 tokens/second when running on a laptop-class CPU such as Ryzen 5 PRO 4650U (ranked 1158th across all CPU's benchmarked at the time of writing this post).

That is not the kind of performance that will impress anyone expecting a real-time chat experience. It's a great model to run locally for prototyping, and demo purposes. For integration work you'd want to run this on a GPU, which you can do locally or with a cost-effective provider which also makes it possible to run the full model.

Applications, considerations, and limitations

Without going all in on agentic products, this model can be used effectively in domains where information processing is manual due to limited capability of processing natural language. It can work great as an integration tool between external sources (emails, invoices, documents) and internal data management systems.

In my usage of BitAgent-8B my observations are that is that it's very good at calling functions consistently, and function hallucination was not a problem.

It was however easy to see it go into an endless function calling loop, or have it call functions at random when the input text didn't match any function in the system. Important to keep in mind if you're going to write the agent integration yourself without third-party frameworks, which generally have built-in workarounds for these issues.

Another thing to keep in mind, is that function calls, their definitions and user prompts are all part of the information sent over to the model. Which means that you always have to keep in mind context lengths, and balance response time that grows linearly with the input size.

For practical and effective usage, start with agentic workflows that run asynchronously, and make it easy to adjust prompting and function definitions without whole system redeployment.

Feature-based natural language interfaces with functionary-small

If you think back at the AI assistants of the 2010s like Siri, Cortana, Alexa and Google, they all were natural language interfaces on top of limited feature sets. With function calling your product can offer the same users experience to your end-users.

Functionary-small is another 8 billion parameter Open Source model with function calling support. Compared to BitAgent-8B it's not advertised as a multi-turn model. In practice, it will do multi-modal though not consistently. It's better used in places where a single, or a parallel set, of function calling is enough.

Performance is very close across these two models, but functionary-small it's better out of the box at stopping when input text doesn't match any function in the system.

Great places where one would use these natural language interfaces, are those places within a product where features tend to be hidden behind multiple different levels of navigation, and forms with complex interaction.

Consideration of other open-source models

If you'd like to research this space on your own, there are a few pointers I can give you. While function calling is not some kind of new capability (it's been around since 2023) finding clear and up-to-date information is not as easy.

The first resource you should familiarize yourself is the Berkeley Function Calling Leaderboard.

As a secondary signal source you can check user reviews on r/LocalLLaMA. I like to be on the lookout for the negative reviews on the subreddit, because they tend to better match my experience.

You can also use popular Open Source models (Mistral 7B, Mistral NeMo, etc.) with custom prompting and few-shot strategies. There are reports of these working out well in practice, and some advertised function calling models are fine-tunes of other models. My preference was to chose models that work out-of-the-box instead.

Technical notes and reference source code

As with other parts of the ecosystem, while there are small variations across models, there's convergence on OpenAIs API format, and models that run through Ollama, llama.cpp, etc. are drop in replacements.

Functions are defined in terms of JSON Schema, but in the followup examples I'm going to use YAML for readability. And of course, automatically generating schemas from host language function definitions is what you'd do in practice.

» To experiment locally you can use our reference playground code on GitHub.

Let's take a look at a PHP function definition and the associated schema by which the model will know when to call the function.

/** @param array{name: string} $data */
function predict_age_by_name(array $data): false|string { ... }

type: function
function:
  name: predict_age_by_name
  description: Guess person age from their name.
  parameters:
    type: object
    required:
      - name
    properties:
      name:
        type: string
        description: Name of person we wish to guess the age for

After removing a couple of layers of nesting in the request data, this is an example request we would send to the language model.

model: placeholder-for-the-model-you-use
tools:
- type: function
  function:
    name: predict_age_by_name
    description:Guess person age from their name.
    parameters:
      type: object
      required:
        - name
      properties:
        name:
          type: string
          description: Name of person we wish to guess the age for
messages:
- role: user
  content: "My name is John Doe, how old do you think I am?"

To which the model would reply that it needs to call our function to proceed further.

choices:
  - finish_reason: tool_calls
    index: 0
    message:
      role: assistant
      content: null
      tool_calls:
        - type: function
          function:
            name: predict_age_by_name
            arguments: '{"name":"John Doe"}'
          id: TUURKjDCMOfbxPAdklzyiz6BYDJeRIMV

This is where in our code we would parse the model response, call the requested function and feed it back into the model as if it where part of the original message. Note we pass in the original prompt, the previous model response (function call request) in the message list and the tool response that references back by using the same id value. To simplify the output only showing the last message in the list.

role: tool
content: 'Predicted age: 74'
tool_call_id: TUURKjDCMOfbxPAdklzyiz6BYDJeRIMV

At this point the model will do what Generative AI do best and generate a sentence to describe the result of the function call.