Open-source software is widely used by companies, and now they are adopting “open source” AI models as well. Just last month, the Open Source Initiative released a definition of “open-source AI”. This article thus examines what are “open-source AI models” and how much they need to be “open” about – for example, balancing the need to release the training data to understand how model works with IP and data protection issues. It also examines what new risks they pose compared to “traditional” open-source software and closed AI models. Finally, we look at what measures are being instituted to mitigate the risks and ensure the models’ safe use, such as new types of licences (known as “Responsible AI Licences”), and regulatory controls in legislation and ongoing monitoring. For example, are open-source AI models subject to lighter regulatory regimes since they are distributed for free, and how is this balanced against the need for transparency on the model’s capabilities and development process?
The topic of “open-source AI models” is of interest because Singapore’s Infocomm Media Development Authority (IMDA) proposed a “shared responsibility” approach in May 2024, where responsibility should be allocated amongst the players in the AI development chain1For example, model developers, application deployers (who develop solutions or applications that make use of AI technology), application deployers (who provide the AI solutions to end-users) and cloud service providers (who provide the platform to host the AI application). according to the level of control they have, so that they can take the necessary action to protect end-users.2See the Model Governance Framework for Generative AI (30 May 2024) (“Model Gen-AI Framework”) at page 7. The IMDA has indicated that when apportioning responsibility, we “may also need to consider different model types (e.g. closed-source, open-source or open-weights) given the different levels of control that application deployers have for each model type”3See the Model Gen-AI Framework at page 7., and that the details of how these responsibilities will be allocated are in the midst of being worked out.4See the Model Gen-AI Framework at footnote 11.
We will thus explore the different types of models where they exist on a spectrum, described by IMDA as5As defined in the Model Gen-AI Framework at footnotes 9 and 10.:
- Closed-source models have all information about them kept private by the developer;
- Open-weights models make available pre-trained parameters/weights of the model, but not the training code, dataset, methodology, etc.
- Open-source models make available the full source code and information required for re-training the model from scratch, include model architecture code, training methodology and hyperparameters, original training dataset and documentation.
Part 1: What does it take to be an “open source” AI model? What are the difficulties in considering them “open-source”’ and how can these issues be overcome?
In this article, an “AI model” is created when algorithms (as a set of steps/instructions to reach an outcome) are applied to datasets to analyse the data, leading to an output that is examined and the algorithm iterated, until the most appropriate model emerges.6The definitions are based on (3.20) and (3.21) of the Singapore’s Model AI Governance Framework, as well as the definition from IBM at https://www.ibm.com/topics/ai-model. The IBM differentiates that “algorithms are procedures, often described in mathematical language or pseudocode, to be applied to a dataset to achieve a certain function or purpose” and “models are the output of an algorithm that has been applied to a dataset”. An AI model is thus what is “learnt” by the algorithm from the data. Algorithms can be expressed in, for example, mathematical language, pseudocode or programming languages.7https://www.techtarget.com/whatis/definition/algorithm
To understand the concept of “open-source AI models”, we will draw on concepts from open-source software, and their differences will come to light in the paragraphs below.
By way of background, open-source software is software where the source code (i.e. the instructions for a computer to execute, written in languages such as Python, C++ and Java) is made available to anyone to inspect, modify and enhance.8https://opensource.com/resources/what-open-source It is in contrast to “proprietary” or “closed-source” software where only the person or organisation who created the source code can access and modify it.9https://opensource.com/resources/what-open-source
For software to be considered open-source, the Open Source Initiative (OSI) defines 10 criteria to be met10See the definition at https://opensource.org/osd (last modified on 16 February 2024). (where the most relevant are set out below):
- the program must include source code, and must allow distribution in source code as well as compiled form;
- the license must allow modifications and derived works, and must allow them to be distributed under the same terms as the license of the original software;
- the licence must not discriminate against any person or group of persons;
- the license must not restrict anyone from making use of the program in a specific field of endeavour – e.g. it may not restrict the program from being used in a business, or from being used for genetic research.
The two criteria mentioned in paragraphs (c) and (d) are most affected when we look at “open-source AI models” and new licence terms developed for them:
- Issue 1: how much of the model and its associated/underlying information (e.g. the training data) must be shared so that the user can understand how it works?
- Issue 2: what happens if the licence restricts how the AI model can be used, given the growing adoption of “Responsible AI Licences” with use restrictions?
The IMDA has also highlighted that “[t]oday, however, there is a lack of information on the approaches being taken to ensure trustworthy models. Even in cases of “open-source” models, some important information like the methodology and datasets may not be made available.”11See the Model Gen-AI Framework at page 13. Therefore, can we still say that an AI model is ‘open source’ if key criteria are not met as we would then be ‘watering down’ the requirements for open source models? In exploring these issues, we will also look at the positions taken by recognised institutions like the OSI, as well as regulators in the EU and USA.
Issue 1: How much must be shared to allow meaningful use and modification of the model?
Unlike open-source software, with AI models, more than just the source code is needed to understand how the model works. There is no consensus on exactly what other components must be shared for an AI model to be considered as open-source, but it generally includes the following:12See page 11 of “Open Sourcing Highly Capable Foundation Models”, Centre for the Governance of AI (“CGAI Report”), accessible at https://cdn.governance.ai/Open-Sourcing_Highly_Capable_Foundation_Models_2023_GovAI.pdf, referencing training code, model weights and training data. See also page 3 of CNIL, “Open source practices in artificial intelligence” (July 2024) (“CNIL Report”), accessible at: https://www.cnil.fr/sites/cnil/files/2024-07/in-depth_analysis_open_source_practices_in_artificial_intelligence.pdf, which states: “Openness in AI generally does not refer to the publication of the source code related to the use or development of a model, although this may be part of it, but rather to the publication of the model and the weights, or parameters, that constitute it.”
- Training code – the instructions that guide the model training, optimising the model weights to improve the model’s performance on the training tasks (this is distinct from inference code, which implements the model after it is trained and allow it to perform tasks like writing and classifying images)13See the definitions on page 11 of the CGAI Report.
- Training data – you can’t just look at the code to understand how the model works and how to modify it, as you must also know what kind of data it was trained on. This thus opens up a package of legal issues not found in traditional open-source software: there are limitations to sharing training data, such as IP, privacy and confidentiality concerns, which will be exacerbated if all the training data is shared as-is.
Practically, model developers may also be reluctant to disclose all their training data if it would expose the fact they trained on third-parties’ copyrighted material.14The Verge, “Open-source AI must reveal its training data, per new OSI definition”, 29 October 2024, accessible at: https://www.theverge.com/2024/10/28/24281820/open-source-initiative-definition-artificial-intelligence-meta-llama Even if they had obtained a licence to use that material for training, that licence may not extend to sharing a copy of that material with the wider public.
However, it is also not the case that developers necessarily need access to all the training data to modify the AI model – what is helpful is detailed information about the dataset (the dataset building process rather than the dataset), such as where the data was from, so that they can build similar datasets to improve on the existing datasets.15See the interview with OSI’s Executive Director Stefano Maffulli at TechBrew, “The divide over open-source AI, explained” (18 June 2024), accessible at: https://www.emergingtechbrew.com/stories/2024/06/18/what-is-open-source-ai Hence, the issue is how much of the training data must be made available.
- Model weights – these are the learned connections between the material the model is trained on (numerical values that influence how much each factor has over an output, that are optimised during training).16See page 8 of the NTIA Report on “Dual-Use Foundation Models with Widely Available Model Weights” (July 2024) (“NTIA Report”), which explains AI model weights simply: “An AI model processes input – such as a user prompt – into a corresponding output, and the contents of that output are determined by a series of numerical parameters that make up the model, known as the model’s weights. The values of these weights, and therefore the behaviour of the model, are determined by training the model with numerous examples. The weights represent numerical values that the model has learned during training to achieve an objective specified by the developers.” These are crucial to determine the effectiveness of model, but are not necessarily covered under traditional open-source licences as they are not source code.
To this end, the OSI has recently released a definition of ‘open source AI’ (in October 2024)17https://opensource.org/ai/open-source-ai-definition, to cover these required components. Regardless of how the offering is characterised (as an AI system, model or weights), it must be made available under terms and in a way that grants users the freedom to use, study, modify and share the system (with or without modifications) for any purpose.
The OSI elaborates that in order to exercise these freedoms, the user must have access to the “preferred form” to make modifications to the system or its components, which includes the following elements:
- Training data – where there must be “sufficiently detailed information about the data used to train the system so that a skilled person can build a substantially equivalent system”18https://opensource.org/ai/open-source-ai-definition. This must include a description of all data used for training (including unshareable data due to privacy and copyright issues)19The OSI has made it clear that some training data can be excluded due to privacy and copyright issues. Otherwise, if there is unfettered access to all training data, then what is open-source AI will be reduced to a very small niche of AI that is trained only on open public data – see https://hackmd.io/@opensourceinitiative/osaid-faq, disclosing the provenance of the data, its scope and characteristics, how it was obtained and selected, labelling procedures, as well as data processing and filtering methodologies. It must also include a list of where to obtain publicly available training data and data from third parties.
- Code (both training code and inference code, and including code used for processing and filtering data);
- Model weights.
Nevertheless, it is very much open to interpretation what training data must be shared to meet the definition, given that the parallel system need only be “substantially equivalent”, and it would depend on the relative “skill” of the person trying to reproduce the system.20See commentary at https://www.juliaferraioli.com/blog/2024/on-open-source-ai/ Also, even if the developer discloses the third-party data source, the subsequent user of the model may not be granted the same permissions by the third party. 21See commentary at https://www.juliaferraioli.com/blog/2024/on-open-source-ai/ It is understandable that sensitive and confidential data cannot just be openly shared, but where the balance will be struck for this definition remains to be seen in practice. The French data protection regulator has suggested that synthetic data could be published instead, or subsets that would be representative of the full dataset (where any personal data has been removed).22Page 11 of the CNIL Report
Issue 2: Can you limit what the model can be used for?
The second issue is not one that is inherent in the model (e.g. how much information is made available, as discussed above), but is a feature of the licence attached to it. This limits the freedom to make use of the model in any field of endeavour.
Responsible AI Licences (RAIL)23See, for example, the BigScience RAIL Licence v1.0, available at https://huggingface.co/spaces/bigscience/license that have use restrictions are gaining traction,24It is not compulsory to use RAIL for AI models – there are also models licensed under traditional open-source licences like MIT and Apache. See the analysis in McDuff et. al., “On the Standardization of Behavioral Use Clauses and Their Adoption for Responsible Licensing of AI”, accessible at https://arxiv.org/pdf/2402.05979. where the use restrictions not only bind the organisation that uses/builds on the model, but the organisation must also integrate these restrictions in its subsequent licence, so anyone using the modified model is also bound.
Some of the restricted uses – e.g. not to use the model to “generate or disseminate verifiably false information with the purpose of harming others” – reflect what the law already is, hence the overwhelming majority of users would not use the model in that manner, even without the licence restriction. However, some RAILs can contain unique and specific purposes – e.g. not “to provide medical advice and medical results interpretation”,25See paragraph (l) in Attachment A to the BigScience RAIL Licence v1.0., in contrast to the responsible AI licence for GRID, available at https://github.com/ScaledFoundations/GRID-playground/blob/main/LICENSE so users must review the licence terms very carefully. Nevertheless, if someone is going to use an AI model for nefarious purposes, the prospect of breaching a licence condition is not going to stop them.
Separately, there are also licences that restrict who can use the model – e.g. if the person has more than 700 million monthly active users for their products/services, they must request for a permission to use the model, which the developer can decide not to grant.26See for example https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE This needs to be reconciled with the OSI criteria where the license must not discriminate against any person or group of persons (even if they are competitors).
Taking a pragmatic approach to promoting the safety of AI models regardless of how “open” they are
At the end of the day, what is the practical effect of coming up with a definition for “open source” AI models? If there are no tangible benefits to meeting the OSI definition (or any other emerging definition set to become an industry standard), then there are no incentives for companies to meet it, as it would involve them releasing a sizeable amount of training data together with their AI model, and possibly exposing themselves to liability when they make it public that they have trained their model on copyrighted materials without seeking permission from the rights holder.27Especially since the scope of the fair use and text and data mining exceptions are still pending before the courts.
Furthermore, some industry players have expressed the view that there is no single open-source AI definition as it would depend on the nature and purpose of the AI model.28See the comments by Meta spokesperson Faith Eischen, as reported in The Verge, “Open-source AI must reveal its training data, per new OSI definition” (29 October 2024), accessible at: https://www.theverge.com/2024/10/28/24281820/open-source-initiative-definition-artificial-intelligence-meta-llama They would support a definition of open source that “most suits what they’re pushing at the moment”.29Interview with GitHub’s Chief Legal Officer, “How big new AI regulatory pushes could affect open source” (17 September 2024), accessible at: https://www.emergingtechbrew.com/stories/2024/09/16/ai-regulations-open-source-development-shelley-mckinley-github. See also the view expressed by MIT Technology Review in “We finally have a definition for open-source AI” (22 August 2024), available at: https://www.technologyreview.com/2024/08/22/1097224/we-finally-have-a-definition-for-open-source-ai/, where they said “(d)escribing models as open source may cause them to be perceived as more trustworthy, even if researchers aren’t able to independently investigate whether they really are open source.”
On the regulatory front, the EU AI Act does not define “open-source” AI, but “targets certain categories of licences”,30As described by the CNIL (see page 2 of the CNIL Report). covering licences for software and data, including models, “that allows them to be openly shared and where users can freely access, use, modify and redistribute them or modified versions thereof”.31See Recital 102 of the EU AI Act. Unlike the OSI definition, it does not require training data to be shared32Even in the context of general-purpose AI models, recital 102 does not reference the extent of training data required – “General-purpose AI models released under free and open-source licences should be considered to ensure high levels of transparency and openness if their parameters, including the weights, the information on the model architecture, and the information on model usage are made publicly available.” to be considered open-source. But in the context of the EU AI Act, this makes sense because there are reduced obligations for certain AI models that are not monetised and fall under a “free and open-source licence”, so the definition must be one without room for ambiguity (although one could still argue one cannot freely modify the model without access to a certain amount of training data).
The USA also avoided using the term “open source”, instead calling them “dual-use foundation models with widely available model weights”.33See the Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence, issued on 30 October 2023. Its emphasis is on whether model weights are shared, the capacity of the model, as well as the resultant risks posed, rather than whether any training data must be shared and whether there can be any restrictions on downstream uses of the model, as that would be “meaningless when addressing threats posed by sophisticated threat actors who are already operating outside the law and thus don’t care about licence terms”.34David Evan Harris, “How to Regulate Unsecured “Open-Source” AI: No Exemptions”, accessible at: https://www.techpolicy.press/how-to-regulate-unsecured-opensource-ai-no-exemptions/
Ultimately, the reality is that any person can open up access to their AI models without meeting the OSI open-source definition, under any licence terms they choose. Hence, searching for a definition may not matter as much as identifying what is shared in the case of each AI model made available, as the degree of sharing poses different risks (which we will address in Part 2).
In the next Parts, where “open-source” AI models is used, it will refer to the spectrum of “open-source” and “open-weights” models, and not solely to models which conform to the OSI definition. An asterisk will be used to mark out such usage. A better term may in fact be “unsecured” AI models (proposed by the Centre for International Governance Innovation), which “convey[s] not only the literal choice to not secure the weights of these AI systems but also the threat to security posed by these systems”.35David Evan Harris, “Not Open and Shut: How to Regulate Unsecured AI” (6 September 2024), accessible at: https://www.cigionline.org/articles/not-open-and-shut-how-to-regulate-unsecured-ai/
Part 2: What are the risks posed by open-source* AI models?
We will cover the risks from the perspective of both users and developers of the models. We will not cover risks of erroneous, biased or toxic output, deepfakes, or leakage of confidential training data, as these concerns apply regardless of whether the model is open-source* or not.
Persons who use or modify the models released
A key concern is that if users do not have access to the training dataset, they will not know exactly what the model is trained on, which can be an issue where it comes to assessing the quality of the model and its appropriateness for the use case.36“Challenges and limits of an open source approach to Artificial Intelligence” at page13, accessible at https://www.europarl.europa.eu/RegData/etudes/STUD/2021/662908/IPOL_STU(2021)662908_EN.pdf. Nevertheless, model cards are raised as a solution to this issue (which we will discuss in Part 3).
Also, it is not clear who bears liability if the model released was trained on copyrighted material by the developer, which the user used/made modifications to, as unlike a proprietary model (where the developer can offer an indemnity for IP infringement), open-source* models are released without any warranties or indemnities relating to IP infringement.
There can also be cybersecurity vulnerabilities inherent in the model, although this is not unique to open-source AI models.37Also found in proprietary AI models, as well as in other open-source software.
Persons who release the open-source* models (i.e. developers)
First, developers are concerned about downstream misuse of their model. When the model weights are released, it becomes easier for people to circumvent safety features (e.g. content filters, blocklists and prompt shields to prohibit certain prompts) in the model.38Page 14 of the NTIA Report. This is also because many filters are implemented post-hoc as part of the model’s inference code, “rather than fundamentally changing the behaviour of the model itself”.39Page 12 of the CGAI Report. It is thus a matter of removing that one or two lines of code to make the model perform in undesired ways.40See the anecdotal example given page 12 of the CGAI Report.
Second, once the model is released, the developer cannot “take it back” even if the model has serious flaws, as users would have a copy of that model/weights within their own computers/servers and can keep going back to it. The developer also loses control over how the model is used, as if the model was instead a closed one accessed through an API, the user’s access could be restricted remotely by revoking the access key,41See the FAQs at https://bigscience.huggingface.co/blog/the-bigscience-rail-license and the developer can also monitor the kind of prompts that go to the model and how it is being used.
Part 3: How do we mitigate the risks of open-source* AI models, looking at regulatory approaches to complement RAIL licences and other best practices?
We will first cover how regulators are looking to control the development and use of open-source* AI models. Their position can generally be summed up as: developers still have responsibilities and standards to develop against so that the model remains safe, even if they release the model without charge, and under circumstances where people can freely use and adapt it.
At the outset, it is important to note that not every open-source* AI model will be restricted or regulated. The EU and US are only going to target the most powerful models (in terms of computing power), which pose serious public safety and national security risks. We will explore the positions across Singapore, EU and the US, and then conclude with identifying general takeaways for developers/deployers.
Singapore
The position in Singapore is still nascent and references open-source* models without distinction of their capabilities (unlike the EU and US). We had earlier discussed IMDA’s proposed “shared responsibility” approach, where responsibility is allocated based on the level of control over the model development. As model developers are “the most knowledgeable about their own models and how they are deployed”, they (regardless of whether the model is released open-source, open-weights or closed-source) are expected lead the ongoing/emerging discussions on how responsibilities should be allocated.42Page 8 of Gen-AI Model Framework. The IMDA also recommends that people who download open-source or open-weights models should do so “from reputable platforms to minimise the risk of tampered models”43Page 8 of Gen-AI Model Framework..
The Cybersecurity Agency of Singapore also issued Guidelines on Securing AI Systems in October 2024, recommending that persons downloading AI models should evaluate these open-source models, such as by running code checking, or checking against a database with vulnerability information.44See 2.1.9 of the CSA Companion Guide on Securing AI Systems.
USA
The US presently takes a relaxed approach towards open-source* AI models with little regulation,45Nicole Kobie, “Open-source AI just got a major seal of approval from US regulators – but will it push developers in the right direction?” (31 July 2024), accessible at: https://www.itpro.com/software/open-source/open-source-ai-just-got-a-major-seal-of-approval-from-us-regulators-but-will-it-push-developers-in-the-right-direction but the Government will shape future strategy based on real-life issues that surface.
The Executive Order on “Safe, Secure and Trustworthy Development and Use of AI” (October 2023) directed the holding of public consultations to determine the appropriate policy and regulatory approaches for “dual-use foundation models with widely available model weights”.46See section 4.6 of the Executive Order. This reference to “dual-use foundation models” means that the EO only targets the most powerful AI models — which contain at least 10 billion parameters47Parameters are variables the model learns during training and are used to make predictions and decisions – see https://tedai-sanfrancisco.ted.com/glossary/parameters/. and can perform tasks that pose a serious risk to security, national economic security, national public health or safety.48See the definition of “dual-use foundation model” in section 3(k) of the Executive Order. Separately, there are also reporting requirements for models trained on a quantity of computing power greater than 1026 FLOPS, which can include dual-use foundation models (see section 4.2 of the Executive Order).
Following from the public consultations, the NTIA recommended49See the “Dual-Use Foundation Models with Widely Available Model Weights” report (NTIA Report) issued in July 2024. that the Government should not restrict the availability of such models for now, such as by prohibiting them from being distributed, or require a person to have a licence before they can access the model weights. However, they also reserved the Government’s position to restrict certain classes of model weights in the future.50Pages 36 and 40 of the NTIA report.
It also recommended that the Government should develop the capacity to evaluate such models for evidence of unacceptable risk, and quickly respond to such risks. The Government should take an evidence-based approach to uncovering such risks (instead of merely hypothesising), such as by looking at audit reports, issues encountered by model developers, model evaluations, and red-teaming results.51Pages 37 and 40 of the NTIA report.
EU
The EU AI Act adopts a tiered approach towards open-source* AI models:
- The bulk of open-source* AI models (described in the Act as released under a “free and open-source licence”52See recital (102) of the EU AI Act which defines a “free and open-source” licence as one that allows the model “to be openly shared and where users can freely access, use, modify and redistribute them or modified versions thereof”. It broadly matches the OSI definition of what is ‘open source AI’ in terms of freely accessing, using, modifying and redistributing, but unlike the OSI definition it is silent on whether training data has to be shared to be considered ‘open-source’.) – except for general purpose AI (GPAI) models discussed at paragraph (b) below – are not subject to regulation.53See Article 2(12) of the EU AI Act. Instead, the developers are “encouraged to implement widely adopted documentation practices, such as model card and data sheets as a way to accelerate information sharing along the AI value chain”.54See recital (89) of the EU AI Act.
Specifically, non-GPAI open-source* AI models will only be regulated if:
-
- they are high-risk AI systems (where they are subject to the same requirements as closed high-risk counterparts);
- they are banned or prohibited AI systems under Article 5;
- they are AI systems that interact directly with natural persons (e.g. emotion recognition systems) as described in Article 50 – in which case there are transparency obligations (e.g. informing natural persons that they are interacting with an AI system unless it is obvious to a reasonably well-informed person; watermarking of outputs).
- For GPAI models (i.e. AI models trained with a large amount of data, have significant generality and are capable of performing a wide range of distinct tasks – similar what the USA calls “foundation models”)55As defined in Article 3(63) of the EU AI Act. — their treatment depends on the following:
- If the model is (a) not monetised (e.g. not offered for a fee, no charge for technical support); (b) released under a “free and open-source licence”; and (c) does not “presents systemic risks” (characterised by its technical capabilities where the cumulative amount of computation used for its training is greater than 1025 FLOPS)56See Article 51(2) of the EU AI Act., the developer only has to provide a summary of the content used for model training and comply with EU copyright law.57See Article 53(2). Notably, the EU AI Act requires compliance with copyright law but doesn’t mention how to comply, so it is still an open question as to whether fair use and text and data mining exceptions apply to scraping data from the Internet to train the AI model. A GPAI model that does not fulfil criteria (a) or (b) will require maintaining technical documentation to be provided to the authorities upon request, and making available information about the model to persons who will integrate it into their own systems.
- However, if the model “presents systemic risks”, then regardless of whether it is monetised and the type of licence it is released under, it will be subject to the full suite of obligations across Articles 53 to 55, including the need for technical documentation, maintaining cybersecurity protections and reporting any serious incidents arising to the EU AI Office.
Best practices
From the examples above, we can see that how open-source* AI models will be evaluated and regulated is still a work in progress. However, there are safeguards (from legislation, guidelines and industry practice) that model users and developers can adopt now to ensure that open-source* AI models are developed, released and used safely. Testing and evaluation of the models (both before release, and after modifying them) would be essential regardless of whether the model is open-source*, so these will not be discussed here.
In the case of persons using/building on open-source* models, they should be able to answer the question of “do you know what you are using”:
- Select models where the technical details, summaries of training data, intended uses, and performance on evaluation and red-teaming efforts are disclosed, over models that don’t – model cards can help with providing this information;58Page 9 of the NTIA Report. See also Anokhy Desai, “5 things to know about AI model cards”, accessible at: https://iapp.org/news/a/5-things-to-know-about-ai-model-cards. See also a sample model card at https://ai.google.dev/gemma/docs/model_card_2 which describes how the model performs on evaluations, the risks/limitations of the model and a general overview of the types of data it was trained on.
- Download models from reliable sources to minimise risks of tampered models;59Page 8 of Gen-AI Model Framework.
- Read licence terms carefully to comply with permitted uses of model.
On the other hand, model developers should consider how much of the model and related information to share:
- Vary levels of access to model weights to “vetted” persons as an alternative to release to all persons;60Page 9 of the NTIA Report, and page 25 of the CGAI Report on gated download access.
- Ensure models are tested properly before release to patch vulnerabilities, because once the model and its weights are released the developer cannot “recall” it;61Page 19 of the GCAI Report.
- Have API access to the model (akin to “dialling-in” to the model) instead of releasing it for download, so that the developer can cut off access to the model as needed;62Pages 19 and 32 of the GCAI Report. and
- Use RAILs to limit downstream use, although effective monitoring and enforcement remains an issue.
The views expressed in this article are the personal views of the author and do not represent the views of Drew & Napier LLC.
Endnotes
↑1 | For example, model developers, application deployers (who develop solutions or applications that make use of AI technology), application deployers (who provide the AI solutions to end-users) and cloud service providers (who provide the platform to host the AI application). |
---|---|
↑2 | See the Model Governance Framework for Generative AI (30 May 2024) (“Model Gen-AI Framework”) at page 7. |
↑3 | See the Model Gen-AI Framework at page 7. |
↑4 | See the Model Gen-AI Framework at footnote 11. |
↑5 | As defined in the Model Gen-AI Framework at footnotes 9 and 10. |
↑6 | The definitions are based on (3.20) and (3.21) of the Singapore’s Model AI Governance Framework, as well as the definition from IBM at https://www.ibm.com/topics/ai-model. The IBM differentiates that “algorithms are procedures, often described in mathematical language or pseudocode, to be applied to a dataset to achieve a certain function or purpose” and “models are the output of an algorithm that has been applied to a dataset”. |
↑7 | https://www.techtarget.com/whatis/definition/algorithm |
↑8 | https://opensource.com/resources/what-open-source |
↑9 | https://opensource.com/resources/what-open-source |
↑10 | See the definition at https://opensource.org/osd (last modified on 16 February 2024). |
↑11 | See the Model Gen-AI Framework at page 13. |
↑12 | See page 11 of “Open Sourcing Highly Capable Foundation Models”, Centre for the Governance of AI (“CGAI Report”), accessible at https://cdn.governance.ai/Open-Sourcing_Highly_Capable_Foundation_Models_2023_GovAI.pdf, referencing training code, model weights and training data. See also page 3 of CNIL, “Open source practices in artificial intelligence” (July 2024) (“CNIL Report”), accessible at: https://www.cnil.fr/sites/cnil/files/2024-07/in-depth_analysis_open_source_practices_in_artificial_intelligence.pdf, which states: “Openness in AI generally does not refer to the publication of the source code related to the use or development of a model, although this may be part of it, but rather to the publication of the model and the weights, or parameters, that constitute it.” |
↑13 | See the definitions on page 11 of the CGAI Report. |
↑14 | The Verge, “Open-source AI must reveal its training data, per new OSI definition”, 29 October 2024, accessible at: https://www.theverge.com/2024/10/28/24281820/open-source-initiative-definition-artificial-intelligence-meta-llama |
↑15 | See the interview with OSI’s Executive Director Stefano Maffulli at TechBrew, “The divide over open-source AI, explained” (18 June 2024), accessible at: https://www.emergingtechbrew.com/stories/2024/06/18/what-is-open-source-ai |
↑16 | See page 8 of the NTIA Report on “Dual-Use Foundation Models with Widely Available Model Weights” (July 2024) (“NTIA Report”), which explains AI model weights simply: “An AI model processes input – such as a user prompt – into a corresponding output, and the contents of that output are determined by a series of numerical parameters that make up the model, known as the model’s weights. The values of these weights, and therefore the behaviour of the model, are determined by training the model with numerous examples. The weights represent numerical values that the model has learned during training to achieve an objective specified by the developers.” |
↑17 | https://opensource.org/ai/open-source-ai-definition |
↑18 | https://opensource.org/ai/open-source-ai-definition |
↑19 | The OSI has made it clear that some training data can be excluded due to privacy and copyright issues. Otherwise, if there is unfettered access to all training data, then what is open-source AI will be reduced to a very small niche of AI that is trained only on open public data – see https://hackmd.io/@opensourceinitiative/osaid-faq |
↑20 | See commentary at https://www.juliaferraioli.com/blog/2024/on-open-source-ai/ |
↑21 | See commentary at https://www.juliaferraioli.com/blog/2024/on-open-source-ai/ |
↑22 | Page 11 of the CNIL Report |
↑23 | See, for example, the BigScience RAIL Licence v1.0, available at https://huggingface.co/spaces/bigscience/license |
↑24 | It is not compulsory to use RAIL for AI models – there are also models licensed under traditional open-source licences like MIT and Apache. See the analysis in McDuff et. al., “On the Standardization of Behavioral Use Clauses and Their Adoption for Responsible Licensing of AI”, accessible at https://arxiv.org/pdf/2402.05979. |
↑25 | See paragraph (l) in Attachment A to the BigScience RAIL Licence v1.0., in contrast to the responsible AI licence for GRID, available at https://github.com/ScaledFoundations/GRID-playground/blob/main/LICENSE |
↑26 | See for example https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE |
↑27 | Especially since the scope of the fair use and text and data mining exceptions are still pending before the courts. |
↑28 | See the comments by Meta spokesperson Faith Eischen, as reported in The Verge, “Open-source AI must reveal its training data, per new OSI definition” (29 October 2024), accessible at: https://www.theverge.com/2024/10/28/24281820/open-source-initiative-definition-artificial-intelligence-meta-llama |
↑29 | Interview with GitHub’s Chief Legal Officer, “How big new AI regulatory pushes could affect open source” (17 September 2024), accessible at: https://www.emergingtechbrew.com/stories/2024/09/16/ai-regulations-open-source-development-shelley-mckinley-github. See also the view expressed by MIT Technology Review in “We finally have a definition for open-source AI” (22 August 2024), available at: https://www.technologyreview.com/2024/08/22/1097224/we-finally-have-a-definition-for-open-source-ai/, where they said “(d)escribing models as open source may cause them to be perceived as more trustworthy, even if researchers aren’t able to independently investigate whether they really are open source.” |
↑30 | As described by the CNIL (see page 2 of the CNIL Report). |
↑31 | See Recital 102 of the EU AI Act. |
↑32 | Even in the context of general-purpose AI models, recital 102 does not reference the extent of training data required – “General-purpose AI models released under free and open-source licences should be considered to ensure high levels of transparency and openness if their parameters, including the weights, the information on the model architecture, and the information on model usage are made publicly available.” |
↑33 | See the Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence, issued on 30 October 2023. |
↑34 | David Evan Harris, “How to Regulate Unsecured “Open-Source” AI: No Exemptions”, accessible at: https://www.techpolicy.press/how-to-regulate-unsecured-opensource-ai-no-exemptions/ |
↑35 | David Evan Harris, “Not Open and Shut: How to Regulate Unsecured AI” (6 September 2024), accessible at: https://www.cigionline.org/articles/not-open-and-shut-how-to-regulate-unsecured-ai/ |
↑36 | “Challenges and limits of an open source approach to Artificial Intelligence” at page13, accessible at https://www.europarl.europa.eu/RegData/etudes/STUD/2021/662908/IPOL_STU(2021)662908_EN.pdf. |
↑37 | Also found in proprietary AI models, as well as in other open-source software. |
↑38 | Page 14 of the NTIA Report. |
↑39 | Page 12 of the CGAI Report. |
↑40 | See the anecdotal example given page 12 of the CGAI Report. |
↑41 | See the FAQs at https://bigscience.huggingface.co/blog/the-bigscience-rail-license |
↑42 | Page 8 of Gen-AI Model Framework. |
↑43 | Page 8 of Gen-AI Model Framework. |
↑44 | See 2.1.9 of the CSA Companion Guide on Securing AI Systems. |
↑45 | Nicole Kobie, “Open-source AI just got a major seal of approval from US regulators – but will it push developers in the right direction?” (31 July 2024), accessible at: https://www.itpro.com/software/open-source/open-source-ai-just-got-a-major-seal-of-approval-from-us-regulators-but-will-it-push-developers-in-the-right-direction |
↑46 | See section 4.6 of the Executive Order. |
↑47 | Parameters are variables the model learns during training and are used to make predictions and decisions – see https://tedai-sanfrancisco.ted.com/glossary/parameters/. |
↑48 | See the definition of “dual-use foundation model” in section 3(k) of the Executive Order. Separately, there are also reporting requirements for models trained on a quantity of computing power greater than 1026 FLOPS, which can include dual-use foundation models (see section 4.2 of the Executive Order). |
↑49 | See the “Dual-Use Foundation Models with Widely Available Model Weights” report (NTIA Report) issued in July 2024. |
↑50 | Pages 36 and 40 of the NTIA report. |
↑51 | Pages 37 and 40 of the NTIA report. |
↑52 | See recital (102) of the EU AI Act which defines a “free and open-source” licence as one that allows the model “to be openly shared and where users can freely access, use, modify and redistribute them or modified versions thereof”. It broadly matches the OSI definition of what is ‘open source AI’ in terms of freely accessing, using, modifying and redistributing, but unlike the OSI definition it is silent on whether training data has to be shared to be considered ‘open-source’. |
↑53 | See Article 2(12) of the EU AI Act. |
↑54 | See recital (89) of the EU AI Act. |
↑55 | As defined in Article 3(63) of the EU AI Act. |
↑56 | See Article 51(2) of the EU AI Act. |
↑57 | See Article 53(2). Notably, the EU AI Act requires compliance with copyright law but doesn’t mention how to comply, so it is still an open question as to whether fair use and text and data mining exceptions apply to scraping data from the Internet to train the AI model. |
↑58 | Page 9 of the NTIA Report. See also Anokhy Desai, “5 things to know about AI model cards”, accessible at: https://iapp.org/news/a/5-things-to-know-about-ai-model-cards. See also a sample model card at https://ai.google.dev/gemma/docs/model_card_2 which describes how the model performs on evaluations, the risks/limitations of the model and a general overview of the types of data it was trained on. |
↑59 | Page 8 of Gen-AI Model Framework. |
↑60 | Page 9 of the NTIA Report, and page 25 of the CGAI Report on gated download access. |
↑61 | Page 19 of the GCAI Report. |
↑62 | Pages 19 and 32 of the GCAI Report. |
The post Open-source AI Models – What Are They, and Closing the Safety and Liability Gaps appeared first on The Singapore Law Gazette.