While lawmakers in most countries are still discussing how to put up barriers to artificial intelligence, the European Union is at the forefront, having approved a risk-based framework to regulate AI applications earlier this year.
The law came into force in August, although full details of the EU-wide AI governance regime are still being worked out; for example, codes of practice are being developed. But in the coming months and years, the law’s tiered provisions will begin to apply to makers of AI applications and models, so the countdown to compliance is already underway.
The next challenge is to assess whether and how AI models meet their legal obligations. Large Language Models (LLM) and other so-called basic or general-purpose AI will underpin most AI applications. Therefore, it seems important to focus evaluation efforts on this layer of the AI stack.
Step Lattice flow AIan offshoot of public research university ETH Zurich, which focuses on AI risk management and compliance.
On Wednesday, it published what it touts as the first technical interpretation of the EU AI Law, meaning it seeks to correlate regulatory requirements with technical ones, along with an open source LLM validation framework that builds on this. work, which calls. Compl-AI (‘compl-ai’…look what they did there!).
The AI model benchmarking initiative, which they also call “the first regulatory-oriented LLM benchmarking suite,” is the result of a long-term collaboration between the Swiss Federal Institute of Technology and the Institute of Life Sciences. Bulgarian Computing, Artificial Intelligence and Technology (INSAIT). ), according to LatticeFlow.
AI model creators can use the Compl-AI site to request an evaluation of the compliance of its technology with the requirements of the EU AI Law.
LatticeFlow has also published model evaluations of several mainstream LLMs, such as different versions/sizes of OpenAI’s Llama de Meta and GPT models, along with a EU AI Law Compliance League Table for big AI.
The latter ranks the performance of models such as Anthropic, Google, OpenAI, Meta and Mistral compared to the requirements of the law, on a scale from 0 (i.e. no compliance) to 1 (full compliance).
Other evaluations are marked N/A when data is missing or if the model manufacturer does not make the capability available. (NB: At the time of writing there were also some negative scores recorded, but we were told this was due to a bug in the Hugging Face interface.)
The LatticeFlow framework evaluates LLM responses across 27 benchmarks, such as “toxic ending of benign text,” “biased responses,” “following harmful instructions,” “truthfulness,” and “common sense reasoning,” for example. Name some of the benchmarking categories you use for evaluations. So each model gets a range of scores in each column (or N/A).
AI compliance is a mixed bag
So how did the top LLMs fare? There is no overall model score. Therefore, performance varies depending on exactly what is being evaluated, but there are some notable ups and downs across the various benchmarks.
For example, all models perform well in not following harmful instructions; and relatively strong performance across the board in not producing biased responses, while reasoning and general knowledge scores were much more mixed.
Elsewhere, recommendation consistency, which the framework uses as a measure of fairness, was particularly poor for all models: none scored above half (and most scored well below).
Other areas, such as the adequacy of training data and the reliability and robustness of watermarking, appear essentially untested due to the number of results marked N/A.
LatticeFlow notes that there are certain areas where model compliance is more difficult to assess, such as hot-button topics like copyright and privacy. So you don’t pretend to have all the answers.
In a paper detailing work on the framework, scientists involved in the project highlight how most of the smaller models they evaluated (≤13 billion parameters) “scored poorly on technical robustness and security.”
They also found that “nearly all models examined strive to achieve high levels of diversity, non-discrimination, and equity.”
“We believe that these shortcomings are mainly due to model providers disproportionately focusing on improving model capabilities, at the expense of other important aspects highlighted by the regulatory requirements of the EU AI Law,” they add, suggesting that a As compliance deadlines begin to impact LLM manufacturers, they will be forced to shift their focus to areas of concern, “leading to a more balanced development of LLMs.”
Since no one yet knows exactly what will be required to comply with the EU AI Law, the LatticeFlow framework is necessarily a work in progress. It is also just an interpretation of how the requirements of the law could be translated into technical results that can be evaluated and compared. But it’s an interesting start in what will need to be an ongoing effort to test powerful automation technologies and try to guide their developers toward safer utility.
“The framework is a first step towards a comprehensive assessment focused on compliance with the EU AI Law, but is designed in a way that it can be easily updated to move at the same pace as the Law is updated and the various groups of work progress”. Petar Tsankov, CEO of LatticeFlow, told britcommerce. “The EU Commission supports this. “We hope the community and industry will continue to develop the framework towards a comprehensive and comprehensive AI Law assessment platform.”
Summarizing the key takeaways so far, Tsankov said it’s clear that AI models “have been predominantly optimized for capabilities rather than compliance.” It also noted “notable performance gaps,” noting that some high-capacity models may be on par with weaker models when it comes to compliance.
Resilience to cyber attacks (at the model level) and fairness are areas of particular concern, according to Tsankov, with many models scoring below 50% in the first area.
“While Anthropic and OpenAI have successfully aligned their (closed) models to resist jailbreaks and rapid injections, open source vendors like Mistral have placed less emphasis on this,” he said.
And given that “most models” perform equally poorly on fairness benchmarks, he suggested this should be a priority for future work.
Regarding the challenges of comparing LLM performance in areas such as copyright and privacy, Tsankov explained: “In the case of copyright, the challenge is that current benchmarks only check books protected by copyright. This approach has two main limitations: (i) it does not take into account potential copyright violations involving materials other than these specific books, and (ii) it relies on quantifying model memorization, which is notoriously difficult.
“On privacy, the challenge is similar: the benchmark only attempts to determine whether the model has memorized specific personal information.”
LatticeFlow wants the broader AI research community to adopt and improve the free and open source framework.
“We invite AI researchers, developers and regulators to join us to advance this evolving project,” Professor Martin Vechev of ETH Zurich and founder and scientific director of INSAIT, who is also involved in the work, said in a statement. “We encourage other research groups and practitioners to contribute by refining AI Law mapping, adding new benchmarks, and expanding this open source framework.
“The methodology can also be extended to assess AI models against future regulatory acts beyond the EU AI Law, making it a valuable tool for organizations working in different jurisdictions.”