Abacus.AI - Effortlessly Embed Cutting Edge AI In Your Applications.

Open-Source Generative AI

Abacus.AI is committed to open-source AGI and has significantly contributed to open-source AI and LLMs. Our research is open-sourced, reviewed, and published in top AI and ML conferences.

In addition, our open-source contributions to LLMs have led to several other open-source labs adopting some of our techniques and pushing the boundaries of enterprise and SOTA AI.

Here are the key contributions from Abacus.AI to open-source:

Our Benchmarks

LiveBench AI

With LLMs training on web-scale data, test-set contamination is a pervasive concern in LLM evaluation that can quickly render benchmarks obsolete. To mitigate this, many recent benchmarks crowdsource new prompts and evaluations from human or LLM judges; however, these can introduce significant biases, and they are not reliable for hard questions. For example, LLM judges make mistakes up to 40% of the time on challenging math and reasoning tasks.

To resolve this issue, we developed LiveBench, a contamination-limited benchmark that evaluates large language models on a variety of general intelligence capabilities including reasoning, coding, language understanding, data analysis, instruction following, and mathematics. By frequently releasing updated question sets and developing new tasks over time, we ensure that our results remain an accurate assessment of LLM capabilities as new models are released. Questions are constructed from a variety of recent sources such as research papers and news articles so they can be easily refreshed over time. LiveBench also uses only objective ground-truth judgment to ensure unbiased results.

Our leaderboard at livebench.ai presents results for all major model providers, both open source and proprietary. Since its release, LiveBench has remained one of the most popular LLM benchmarks and has been a featured result in major model release reports.

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Schwartz - Ziv, Neel Jain, Khalid Saifullah, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, Micah Goldblum

ICLR 2025 Spotlight

Our Models

Abacus.AI Dracarys Line

We have two Dracary fine tunes - One based on Llama-3.1 70b and another based on Qwen 72B

These fine tunes enhance the coding and reasoning abilities of the base LLM. Dracarys is built on a refined formula called the "Dracarys recipe," which applies optimized fine-tuning techniques to large-scale models such as Qwen-2 72B, Qwen2.5-72B, and LLama-3.1 70B.

Our research and experiments on creating this intelligent dataset and fine-tuning methodologies significantly boost the coding and reasoning capabilities of these open-source LLMs.

The Dracarys llama model shows improvements on livebench.ai with a significant increase the coding score which improves to 36.59 from 33.49. Overall it's a better model and ranks higher on the leaderboard

Abacus.AI Smaug Line

Smaug is our most recent open-source line of models, and has set a new standard for open-source, topping the HuggingFace OpenLLM leaderboard with an accuracy of 80.48%, nearly 2% better than the next best model. We have introduced several Smaug fine tunes, with the flagship model being Smaug-72B. This model is a fine-tuned version of Qwen-72B, enhanced through our novel Direct Preference Optimization-Positive (DPOP) method.

Unlike traditional DPO, which focuses on improving model performance at the risk of reducing completion likelihood, DPOP introduces a new term in the loss function that penalizes any reduction in the likelihood of positive outcomes. This innovation addresses a critical shortcoming in LLM fine-tuning and significantly improves model reliability and effectiveness.

We also applied these techniques to make Smaug-34B and Smaug-Mixtral, both of which are leaders in performance in their classes.

Smaug-72B - The World’s Best Open-Source LLM!

	GPT - 3.5 (PROP)	GEMINI PRO (PROP)	MISTRAL - SMALL (PROP)	MISTRAL - MEDIUM (PROP)	SMAUG - 72B (PROP)
MMLU	70.0	71.8	70.6	75.3	77.15
HellaSwag	85.5	84.7	86.7	88.9	89.27
Arc	85.2	unknown	85.8	88.9	76.02
WinoGrade	81.6	unknown	81.2	88	85.05
GSM-8K	57.1	unknown	58.4	66.7	78.7
Truthful QA	unknown	unknown	unknown	unknown	76.67

Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive

Arka Pal, Deep Karkhanis, Samuel Dooley, Manley Roberts, Siddartha Naidu, Colin White

Abacus.AI Giraffe

Open-source LLMs have continued to proliferate within the AI landscape and have shown comparable performance to the close-sourced LLMs of OpenAI, Google, and others. However, open-source LLMs often come with a limited context length, limiting their utility for creating custom LLMs on your knowledge base. Since you are constrained by the amount of proprietary data you can send in a single API call, a bigger context length is crucial for various tasks.

Many methods have been proposed for context-length extrapolation; in our extensive research, we tested each approach thoroughly and proposed two new approaches. One of these approaches is truncation, which has shown promising results.

Along with this research, we released Giraffe. Based on Llama-2, this model is the world’s first open-source LLM capable of handling a 32k context. This capability is vital for various applications, from complex information retrieval to sustained conversational AI and code generation on an existing use case. On an enterprise level, it can function as an AI brain for your business, boosting productivity, improving decision-making, and providing key insights.

Giraffe: Adventures in Expanding Context Lengths in LLMs

Arka Pal, Deep Karkhanis, Manley Roberts, Samuel Dooley, Arvind Sundararajan, Siddartha Naidu

arXiv Preprint

Abacus.AI Professor

TheProfessor showcases the innovative potential of merging different LLMs, leveraging their unique strengths to create composite models that excel across various domains. Developed with mergekit using pre-trained language models, TheProfessor provides broad conversational, reasoning, scientific, medical, and mathematical skills.

It is also helpful in concept development, from conception to implementation, including code and writing/reviewing/revising papers with citations.

Abacus.AI Liberated-Qwen

Open-source LLMs are notorious for not following system prompts, which makes them less suited for real-world usage, including being more vulnerable to end user misuse. To fix this critical problem, we introduce Liberated-Qwen1.5-72B, the most performant uncensored model in the world.

Liberated was trained using open-source datasets, including SystemChat, a new dataset we created. (You can read more about this dataset below.) Liberated-Qwen performs the best out of the open-source models on the HumanEval leaderboard. It has an MMLU score of 77+, the best score an open-source model can get.

While this model is entirely uncensored and liberated, it demonstrates strong adherence to system prompt following, and thus allows you to set bounds on its behavior with an appropriate system prompt.

Our Datasets

DPO

We used three datasets to create our Smaug series of models. These datasets were meant to be used to fine-tune LLMs using the DPOP loss function. Arc_DPO_FewShot was used to test the level of understanding of science at the grade-school level. HellaSwag_DPO_FewShot contains common sense inference questions that LLMs commonly struggle with. MetaMath_DPO_Fewshot was used to measure math and reasoning skills in an LLM and to align models toward being precise in the calculation.

SystemChat

SystemChat is a dataset with 7000 synthetic conversations generated with Mistral-Medium and Dolphin-2.7-mixtral-8x7b. It was designed to teach model compliance to the system prompt over long multiturn conversations, even with unusual or mechanical system prompts. Fine-tuning with this dataset makes it far more usable and harder to jailbreak. No guardrails or censorship are added to the dataset, and you can implement your own alignment layer.

WikiQA

The WikiQA task is the task of answering a question based on the information given in a Wikipedia document. We have selected large Wikipedia documents and truncated them to get multiple versions of the same document sizes varying between 2000 to 16000 tokens. Each size of document also had multiple versions which places the question and answer text at different locations.

However, a Wikipedia based dataset could correctly answer from its pretrained corpus and not from context. To combat this, we created an “altered” dataset, where the data only consists of questions which have numerical answers. This ensures that if an LLM recollects from its pretrained corpus, it gives a wrong answer.

Other Datasets

We also created MetaMathFewShot, a new few-shot version of the popular MetaMath dataset. This allows the model to understand the concept of few-shot prompting. Our LongChat-Lines was used to evaluate the performance of a model fine-tuned to operate on longer contexts.