Quantcast
Channel: Feature Archives - The Singapore Law Gazette
Viewing all articles
Browse latest Browse all 153

Generative AI

$
0
0

An Evaluation of the Current Solutions to Address the Intellectual Property Challenges it Generates

The risks of generative AI – such as copyright infringement issues, privacy issues, hallucinations – are commonly known and cannot be overstated. But the technology and research (and the ingenuity of humans to solve problems) moves quickly. This article looks at what’s happening now to address the intellectual property risks and uncertainties posed by generative AI – such as the introduction of watermarking, and indemnities offered by generative AI companies to their users for any claims of copyright infringement – and more importantly, are these measures enough?

Approximately a year ago, ChatGPT burst onto the scene, introducing millions of people to generative artificial intelligence (AI).1Generative AI creates new content from the existing material that it is trained on. Anyone with an Internet connection could avail himself or herself of AI text generators, or text-to-image generators, and create content more efficiently. However, together with the multitude of benefits generative AI brings, it also comes with a bundle of risks.

The risks of generative AI, and the measures to mitigate them, have been set out in IMDA’s June 2023 discussion paper on “Generative AI: Implications for Trust and Governance”.2The discussion paper is accessible at https://aiverifyfoundation.sg/downloads/Discussion_Paper.pdf This article will focus on the intellectual property risks and challenges (where the legal position in Singapore and around the world is still in flux), and the methods to mitigate them, covering those in the paper as well as new developments beyond the paper. It is not a comprehensive statement of every solution, as trying to manage the risks of any technology is a constant work-in-progress, but aims to evaluate the key and commonly-discussed ones.

Issue 1: Copyright Infringement Where a Third-party’s Copyrighted Works are Used to Train Generative AI Models Without Their Consent

This has been the subject of many ongoing lawsuits against companies such as OpenAI, Microsoft, Meta, and Stability AI, and we will unpack the core issues at stake.

Generally, authors commence copyright infringement lawsuits when they discover that their works have been used to train generative AI models (without their permission). But how do the authors know that their works have been used in training? The ongoing lawsuits show that it could be because3See, for example, the lawsuit filed in July 2023 in the United States District Court in the Northern District of California by a trio of writers – Sarah Silverman, Christopher Golden and Richard Kadrey – against OpenAI. The plaintiffs’ complaint can be read at https://llmlitigation.com/pdf/03416/silverman-openai-complaint.pdf.

  1. the generative AI system produces an accurate summary of their work;
  2. their work is listed as part of a dataset which is alleged to have been used in training the model; and/or
  3. the model’s output is similar to or reproduces portions of their existing works.4See the lawsuit filed against Github, Microsoft and OpenAI in relation to GitHub Copilot, an AI tool that can generate code, reported at https://www.theverge.com/2022/11/8/23446821/microsoft-openai-github-copilot-class-action-lawsuit-ai-copyright-violation-training-data.

However, the AI companies have responded saying that the issue to be litigated should actually be whether the output of the model is substantially similar to their works, and not whether their works are used to train the model.5https://www.reuters.com/legal/litigation/meta-tells-court-ai-software-does-not-violate-author-copyrights-2023-09-19/ They submit that it cannot be the case that every output from ChatGPT is an infringing derivative of any author’s work – for example, a paragraph describing Homer’s Iliad could not possibly have been extracted from an American’s autobiography.6See OpenAI’s argument, reported at https://www.theregister.com/2023/08/31/openai_class_action_fair_use/. It appears that this argument is gaining traction, because in an ongoing copyright infringement claim against Stability AI by 3 artists alleging that Stability AI used copyrighted artwork without permission to train the AI tools, Federal District Judge William Orrick (in October 2023) expressed doubt over arguments that all AI generated outputs are infringing derivates of the images they were trained on, saying “even if plaintiffs narrow their allegations to limit them to output images that draw upon training images based upon copyrighted images, I am not convinced that copyright claims based (on) a derivative theory can survive absent ‘substantial similarity’ type allegations.” (reported at https://www.theregister.com/2023/10/31/judge_copyright_stabilityai_deviantart_midjourney/?td=keepreading). Second, the works are used to train large language models (LLMs) to recognise patterns in language and “statistically model language”,7As described by Meta’s attorneys in their motion to dismiss the case brought by Sarah Silverman and other authors – “Copyright law does not protect facts or the syntactical, structural, and linguistic information that may have been extracted from books like Plaintiffs’ during training. Use of texts to train LLaMA to statistically model language and generate original expression is transformative by nature and quintessential fair use.”, available at https://fingfx.thomsonreuters.com/gfx/legaldocs/dwpkakjdxpm/META%20OPENAI%20SILVERMAN%20INFRINGEMENT%20metamtd.pdf akin to a “student” learning how to write an essay, rather than “copying the training data like a scribe in a monastery”.8See the testimony (before the U.S. Senate Committee on the Judiciary Subcommittee on Intellectual Property) of Matthew Sag, Professor of Law in Artificial Intelligence, Machine Learning, and Data Science, Emory University School of Law on 12 June 2023, at page 4. Available at: https://www.judiciary.senate.gov/imo/media/doc/2023-07-12_pm_-_testimony_-_sag.pdf

Until we have a definitive court ruling (or legislative amendment), it remains an open question in Singapore and around the world whether defences like fair use or the computational data analysis exemption apply to the use of copyrighted works for training. This article will not repeat what has been ably discussed by other authors,9The author recommends reading a very comprehensive article by Professor David Tan in the April 2023 Law Gazette: https://lawgazette.com.sg/feature/the-best-things-in-life-are-not-for-free-copyright-and-generative-ai-learning/ and will instead focus on the measures the industry is taking to deal with the uncertainty so that they can carry on with their business.

What are the Current Solutions?

#1: Governments try to legislate, to different degrees

Many countries are enacting legislation in relation to generative AI, but the legislation does not fundamentally change the rules of copyright as we know them. The approaches can be sorted into three broad categories:10Please note that this is only a sampling, and not a comprehensive statement of every country’s activities.

  1. Cannot use training data that infringes IP rights
    1. China’s Interim Measures for the Management of Generative Artificial Intelligence Services indicate that providers of generative AI services must carry out the training of the models in accordance with law, and must not infringe the intellectual property rights of third parties (Article 7).11A translation is available at https://www.chinalawtranslate.com/en/comparison-chart-of-current-vs-draft-rules-for-generative-ai/ Further guidance on how to avoid IP infringement12As reported in https://www.reuters.com/technology/china-proposes-blacklist-sources-used-train-generative-ai-models-2023-10-12/ is given in the consultation draft on Basic Security Requirements for Generative Artificial Intelligence Service (published in October 2023), to the extent that parties should have an IP management strategy, scrutinize all training data carefully and avoid using materials with a “problem” with their copyright for training. However, no position is taken on the very same questions before the American courts (such as whether the use of copyrighted materials for training is “fair use”).13This is the author’s own assessment after running the draft rules through Google Translate – the draft rules (only available in Chinese) can be accessed at: https://www.tc260.org.cn/upload/2023-10-11/1697008495851003865.pdf
  2. Disclose the training datasets
    1. The European Parliament, in its proposed amendments to the EU Artificial Intelligence Act, required providers of generative AI systems to “document and make publicly available a sufficiently detailed summary of the use of training data protected under copyright law”.14See Amendment 399, available at https://www.europarl.europa.eu/doceo/document/TA-9-2023-0236_EN.html This does not change any copyright laws, but in fact ensures that they can apply (as copyright holders will then be aware of whether their works have been used and can decide if they want to bring a claim if they feel their rights have been violated). We are pending more guidance as to how much detail is necessary in the summaries.
    2. Singapore is also advocating for transparency in the training datasets (including the disclosure of copyrighted material in the training data), so that persons are aware of what is input into the model to address issues of privacy, copyright and bias.15See pages 16, 21 and 22 of IMDA’s Discussion Paper on “Generative AI: Implications for Trust and Governance” published on 6 June 2023.
  3. Non-legislative Code of Practice … but if AI companies and copyright holders cannot agree, will enact legislation to allocate rights
    1. The UK Intellectual Property Office is creating a code of practice on copyright and AI to make licenses for data mining more available.16https://www.gov.uk/guidance/the-governments-code-of-practice-on-copyright-and-ai It will support generative AI companies in accessing copyrighted material to train their models, while ensuring there are safeguards to support rights holders of copyrighted work (e.g. labelling of generated output), such that an AI company that commits to the code of practice can expect to have a reasonable licence offered by a rights holder in return.17See paragraph 5 of HM Government Response to Sir Patrick Vallance’s Pro Innovation Regulation of Technologies Review (published March 2023), available at https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1142798/HMG_response_to_SPV_Digital_Tech_final.pdf However, the IPO has also indicated that it will introduce legislation if the code of practice is not adopted or agreement is not reached.18See paragraph 6 of HM Government Response to Sir Patrick Vallance’s Pro Innovation Regulation of Technologies Review (published March 2023).

#2: Some companies are developing AI systems trained on entirely “legitimate” data

An example of this is Adobe’s Firefly, an AI-powered text-to-image generator, where all its training data is properly licensed and can be legally used (as opposed to being trained on any kind of content from the Internet):

  1. Stock images which Adobe holds the rights to;
  2. Openly-licensed content;
  3. Content in the public domain where copyright has expired.19https://www.adobe.com/sg/sensei/generative-ai/firefly.html#faqs

This is a very laudable effort. However, it only goes towards the legitimacy of the training data, and does not guarantee that the output will not contain similar elements to the training data. Therefore, companies are also indemnifying their users against third-party copyright infringement claims on the generated output (which will be discussed in Issue 2).

#3: Payment to content-creators whose works are used to train generative AI, and technological tools they can harness to stop their works from being used without consent

Companies have proposed payments to content creators whose works are used to train generative AI.20https://www.straitstimes.com/business/adobe-to-pay-ai-content-creators-in-groundbreaking-move For example, content creators will receive payment based on the number of images they submitted, and “the number of licences that those images generated [in a 12-month period]”, with the bonus weighted towards the licences.21See the Adobe FAQ available at https://helpx.adobe.com/sg/stock/contributor/help/firefly-faq-for-adobe-stock-contributors.html There is limited information available online on how the licences are created/calculated, but it appears that they are related to the number of times the image is “downloaded” 22See the Adobe FAQ available at https://helpx.adobe.com/sg/stock/contributor/help/firefly-faq-for-adobe-stock-contributors.html, which describes the bonus compensation to be based on “eligible online images and their downloads”, and subsequent bonuses to be “based on new approved images and downloads”. (presumably to be added to a training dataset).

As such, the author wonders if in the future, can the content creator receive payment for each time the AI system generates output based on their work used in the training? Granted, this may be easier to do so for images, because images are tagged with text descriptions when they are used to train text-to-image generators.23It is less likely to be able to do this for text, given how books, articles, etc. are used to train LLMs to recognise patterns across words and are not tagged the way images are. Therefore, what if every time a user’s text prompt contains elements of the description tied to the content creator’s image (especially if the content creator is specifically named in the prompt: “do [image] in [person’s] style”), the content creator obtains a cut?

There are also (self-help) means available for artists to stop their content from being used to train AI without their consent.

The first are technological solutions that can be applied directly to their artworks – e.g. by using tools that change the pixels of an image in a way the human eye cannot detect, such that it will “cloak” the image to prevent AI models from copying its style,24https://glaze.cs.uchicago.edu/faq.html or “trick” the AI model into thinking the image is something other than what it actually is (e.g. recognising an image of a car as a cow).25https://www.technologyreview.com/2023/10/24/1082247/this-new-tool-could-give-artists-an-edge-over-ai/ These may be more effective than technological solutions that are detached from the artwork (e.g. HTML tags,26DeviantArt created a new form of protection, with “noai” (AI cannot use anything on the page) and “noimageai” (AI cannot use any images on the page) directives. These “noai” and “noimageai” meta tags will be placed in the HTML page associated with the art. Web crawlers are able to read the tags and recognise that the person does not want their content used to train AI. However, the web crawler can still choose to ignore it. The above explanation was based on this article which reports on DeviantArt’s new protection: https://techcrunch.com/2022/11/11/deviantart-provides-a-way-for-artists-to-opt-out-of-ai-art-generators/ , and this article which very clearly explains how it works: https://www.aimeecozza.com/noai-noimageai-meta-tag-how-to-install/ or simply Terms of Use pages on the sites where the content is hosted). However, the ethics of using such data poisoning tools is being debated, as the use cases vary from deterring companies from using copyrighted works without permission (“use at own peril!!”), to “actively trying to ruin a model”.27See the interview with Braden Hancock, co-founder of Snorkel AI that develops LLMs, available at https://www.computerworld.com/article/3709609/data-poisoning-anti-ai-theft-tools-emerge-but-are-they-ethical.html. Braden is of the view that “there are unethical uses of (technological defences) – for example, if you’re trying to poison self-driving car data that helps them recognize stop signs and speed limit signs (…) if your goal is more towards ‘don’t scrape me’ and not actively trying to ruin a model, I think that’s where the line is for me.”

#4: But what can be done for the works that have already been used to train generative AI models?

At present, there are already millions and millions of photos, text, etc. that have been used to train generative AI without the consent of the copyright owners. While whether compensation is due to such persons in part depends on the outcome of the ongoing lawsuits (e.g. was it fair use), the scale of the problem should not hinder compensation. Some are optimistic that licensing and payments can be done at scale like how the music scene has evolved from illegal file-sharing programs to Spotify and iTunes where people can legitimately download music.28https://www.theverge.com/23444685/generative-ai-copyright-infringement-legal-fair-use-training-data, interviewing Matthe Butterick, a lawyer representing plaintiffs in lawsuits against companies for scraping data to train AI models.

But apart from payment, can artists request that their works used to train models without their consent be removed? Outside of whether the AI company is agreeable to doing so (or has been directed by the court or regulators to do so), this is a technical issue. The short answer is that it is not straightforward to “remove” the work that the model has already been trained on. Training data influences model weights, and the closest analogy one can use is that of trying to identify and remove a chicken wing that has been blended with other parts (and other chickens) into a chicken nugget.

Practically, retraining the model from scratch without that work will come with high time, computing power and electricity costs, which companies want to avoid as much as possible. Research into “machine unlearning” to selectively remove data points without full retraining is still ongoing, and it is very much model-dependent.29Jie Xu et al “Machine Unlearning: Solutions and Challenge”, available at https://arxiv.org/pdf/2308.07061.pdf, and Thanveer Shaik et al, “Exploring the Landscape of Machine Unlearning: A Comprehensive Survey and Taxonomy”, available at: https://arxiv.org/pdf/2305.06360.pdf.

Issue 2: Copyright Infringement Where the Generated Output is Substantially Similar to an Existing Work

Even if training data were all legitimately obtained, the generated output is still something that worries users who wish to use it publicly, because they would not know how similar the generated image or text is to thousands of existing images or text they have never seen before. However, if the user input a prompt requesting an image of something the user knows is copyrighted like Mickey Mouse, whether the user is liable for copyright infringement will depend on factors such as the purpose of such creation – e.g. is the new image a satire? Has the original work been visibly transformed or recontextualized?30See paragraphs 77 – 81 of Global Yellow Pages Ltd v Promedia Directories Pte Ltd and another matter (2017) SGCA 28.

Research has also shown that on the text-generation front, LLMs sometimes reproduce/regurgitate their training data,31Jooyoung Lee et al, “Do Language Models Plagiarize?”, available at: https://arxiv.org/pdf/2203.07618.pdf. In particular, note section 5.2 of their paper, which hypothesizes that a model is more likely to memorize and regurgitate content it is trained on if that content appears multiple times/frequently in the training set. increasing the likelihood of inadvertent copyright infringement by users. The same has also been seen in text-to-image generators.32https://arstechnica.com/information-technology/2023/02/researchers-extract-training-images-from-stable-diffusion-but-its-difficult/ To be clear, this is not what they are designed to do, and research is ongoing to try to minimise such instances.

The issue is thus – does liability for copyright infringement even arise, and if yes, who should bear liability? On the part of the user – if he/she had come up with the image or text on his/her own, without using generative AI, and if it happened to be substantially similar to an existing image or text, the defence of independent creation could succeed. However, because the user has used a generative AI program, which can colloquially be said to have “accessed” the existing work in question if it was used in training that AI program, that defence may no longer apply.33See page 4 of the report on “Generative Artificial Intelligence and Copyright Law” published by the Congressional Research Service, available at https://crsreports.congress.gov/product/pdf/LSB/LSB10922

What are the Current Solutions?

#1: Indemnities

Because of the ongoing lawsuits which can (understandably) make users of generative AI services nervous, companies have recently taken to indemnifying customers who use their products from challenges on copyright grounds – such as if there is a claim that the training data used by the company to create the generative AI model infringes a third-party’s IP right, or if the generated output created by the customer using the AI model infringes a third-party’s IP right.34For example, Microsoft and Google offer the indemnity for copyright infringement allegations for both training data as well as the generated output. In contrast, Abobe only offers it for the generated output.

However, the indemnities are not blanket ones and will generally require the customer to –

  1. use (i.e. not disable) the guardrails and content filters built into the product;35See, for example, Microsoft and Google’s terms.
  2. not deliberately create or use generated output that the customer knows/should have known is infringing36https://cloud.google.com/terms/service-terms under “Service Terms”, paragraph 17(j). – for example, if the customer prompts the generative AI to “directly copy” the work of a particular author;37https://www.theregister.com/2023/10/12/google_ai_protection/ and
  3. stop using the generated output if the customer receives notice of an infringement claim from the rights holder.38https://cloud.google.com/terms/service-terms under “Service Terms”, paragraph 17(j).

There are also some companies that want to review each generated output against their content policies before providing indemnity.39See https://www.shutterstock.com/blog/ai-indemnity-protection-commercial-use, where the AI-generated images will be approved for indemnification once they pass content review by Shuttlestock’s expert content review team, who “check that generated images do not depict copyright, trademark, right of publicity and other potential risks”, and also examine the prompts used to create the image.

The indemnity will also not cover additional content (e.g. a picture of Mickey Mouse) added by the customer to the output outside of the AI-generated content.40See https://techcrunch.com/2023/06/26/adobe-indemnity-clause-designed-to-ease-enterprise-fears-about-ai-generated-art.

However, it remains to be seen how the indemnity for customers for the AI provider’s use of copyrighted training data to train the generative AI model will play out, for two reasons:

  1. Unless a third-party rights-holder is aware of the datasets used in training the generative AI model, it would be difficult for them to bring a claim that the training of the model infringed their copyright41See also https://synthedia.substack.com/p/google-adds-broad-generative-ai-indemnity, where the author is of the view that “it is unclear how a user may become liable for something unpublished but resident in training data”. (and if they knew, it would be more logical for them to bring a claim against the AI provider instead of its customer). It is likely that this indemnity will only be triggered if the customer’s generated output is materially similar to the third-party rights-holder’s work, where the third-party rights-holder then makes both copyright challenges.
  2. How will indemnities operate (if at all) if instead of using the AI service as-is (i.e. only as trained by the AI provider), the customer decides to further “train”42The customer trains a model (that has already been trained) on its own datasets, in a process called fine-tuning, so that the model’s weights (parameters) are changed and it is more tailored/customised to the task at hand – see https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/fine-tuning?tabs=turbo&pivots=programming-language-studio or “customise”43The customer links the model with a database of the customer’s own content which will be searched based on the prompt entered, where the relevant ‘matching’ content from the database is then ‘retrieved’ and sent together with the prompt to the model for it to generate an output (a process called Retrieval-Augmented Generation (“RAG”)). Technically, for RAG, it does not involve any retraining of the model, because it merely sends an “enhanced prompt” with more context/resources to the model, but the model can still draw on the content to provide the response to the prompt. Learn more about RAG at: https://scale.com/blog/retrieval-augmented-generation-to-enhance-llms the AI service on the customer’s own data, with the AI provider’s consent? As the legitimacy of the materials the customer introduces to the AI model is no longer in the AI provider’s control, if copyright infringement in the use of training data is alleged, would it be necessary (or even possible) to carry out a fact-finding exercise as to “who introduced the copyrighted data – customer or AI provider” – to see if the indemnity can be invoked? To protect themselves, customers would have to keep very clear records on what data they have introduced to the AI model.

It is also important to note that the current indemnities are generally for paid services and not free ones,44E.g. https://cloud.google.com/terms/service-terms under “Service Terms”, paragraph 17(j). and it remains to be seen whether smaller players in the AI scene also have the financial resources to offer such indemnities.

#2: Building guardrails to filter content

Some companies have designed their generative AI systems to decline requests to generate images “in the style of a living artist”.45See https://openai.com/dall-e-3, under “Creative Control”.

#3: Edit the output, put your own spin on it …

This is the conventional wisdom to minimise the risk that if you use the generated output publicly, that it will be substantially similar to a third-party’s existing copyrighted work and attract a copyright infringement claim.

However, this is only useful in relation to copyright infringement. It still begs the question as to whether there is copyright in the generated output, and who owns it (or if you only have copyright over the edits you made). This brings us to Issue 3.

Issue 3: Copyright Over the Output Generated by the AI System

Around the world, the position is still very much that a human author is required for copyright protection. However, the debate centres on just how much human input and control over the form/content of the output will be sufficient – whether the AI system is being used as an assistive tool, or is actually the (commissioned) artist.46The US Copyright Office issued guidance on 16 March 2023 (titled “Copyright Registration Guidance: Works Containing Material Generated by Artificial Intelligence”, accessible at: https://www.federalregister.gov/documents/2023/03/16/2023-05321/copyright-registration-guidance-works-containing-material-generated-by-artificial-intelligence). It takes the position that whether copyright exists depends on whether AI is like a tool (e.g. a pencil) or a “commissioned artist”, stating that “(w)hen an AI technology determines the expressive elements of its output, the generated material is not the product of human authorship”.The US Copyright Office went on to explain that if a human modifies material generated by AI technology “to such a degree that the modifications meet the standard for copyright protection”, the human will have copyright over the human-authored aspects of the work, but not the base AI-generated material. Determining if there is any copyright (and if so, who holds it) is an important issue if people wish to monetize their works.

This case is illustrative: Jason Allen had generated an image using Midjourney, a text-to-image AI service (bottom left image).47Both images below are extracted from the US Copyright Review Board’s decision, page 6, available at: https://www.copyright.gov/rulings-filings/review-board/docs/Theatre-Dopera-Spatial.pdf He then made adjustments/edits to it in Photoshop (bottom right image). However, when he sought to apply for copyright over the edited (prize-winning48Mr Allen’s creation won first place in the digital arts category at the Colorado State Fair Fine Arts Competition: https://edition.cnn.com/2022/09/03/tech/ai-art-fair-winner-controversy/index.html) image (the Work), it was denied (and his subsequent appeal was also denied).

The US Copyright Review Board held that the Work had more than a de minimis amount of content generated by AI, so that AI-generated content had to be disclaimed in the application for registration. It was open to Mr Allen to claim copyright over the modifications he had made on their own,49Whether Mr Allen’s adjustments in Photoshop were sufficient to be copyrightable on their own was not decided in this case, as the Board lacked enough information to make that determination. but because he was unwilling to let go of the underlying AI-generated material, the entire Work could not be registered.

But does the amount of effort one puts into prompting the AI matter? To try to claim copyright over the AI-generated content, Mr Allen submitted that he “input numerous revisions and text prompts at least 624 times to arrive at the initial version of the image”, which he then edited to remove flaws, create new visual content, increase the resolution and size, etc. However, his effort did not result in the Board granting him copyright over the AI-generated content.

How the AI system in question generates content was key to the Board’s decision. The Board noted that an AI system does not carry out instructions to create a specific expressive result because it does not understand instructions the way humans do50The author notes that as a parallel, there are artists who also have their art pieces done by human assistants – see for example, Damine Hirst’s spot paintings (more information available at https://www.myartbroker.com/artist-damien-hirst/articles/damien-hirst-assistants-vs-renaissance-workshops). But applying the principles articulated by the Copyright Review Board, the outcome is different because Mr. Hirst conceptualises them and instructs his assistants, who, being human, understand what he wants them to do. (there could be thousands of ways it could translate the prompt into an image, almost in a way that is “random” so you cannot predict what the image it generates will look like) — so that was why Mr Allen needed hundreds of iterations of prompts and images before he finally found four that were satisfactory and captured what he wanted. In other words, the style, and other elements of authorship, were determined by the AI system and not Mr Allen.

What are the Current Solutions?

#1: Watermarking of AI-generated content

“Watermarking” generally means putting a marker on the AI-generated content — where it could be visible to the human eye, or only detectible by a computer program (invisible watermarks) — to distinguish it from human-generated content. To be frank, this solution does not directly solve any of the three issues discussed in this article. Nevertheless, it is worth mentioning as many countries51Some examples are: (1) China – released guidelines on tagging content in generative AI services on 25 August 2023; (2) EU – where signatories to the voluntary Code of Practice on Online Disinformation are to label AI-generated content; (3) Singapore – see IMDA’s Paper on Generative AI: Implications for Trust and Governance (published June 2023); (4) US – the voluntary commitments to the White House made by 7 leading AI companies on 21 July 2023. are advocating it as a means for the user to be aware that they are viewing AI-generated content.

However, there is no common consensus yet as to the technical standards to apply to watermarks, such as:

  1. what visible watermarks must look like (e.g. dimensions, content) – where it is important to strike the right balance between people being able to see it for it to achieve its purpose (and also, not easily cropped out), and aesthetics;52https://deepmind.google/discover/blog/identifying-ai-generated-images-with-synthid/
  2. ensuring that watermarks are not easily tampered with, removed or modified – e.g. that invisible watermarks continue to remain detectible even if the image is modified;53https://deepmind.google/discover/blog/identifying-ai-generated-images-with-synthid/
  3. what data must be included in invisible watermarks (e.g. the service/model that created the content, but not information that could identify the user of the service/model?54As suggested in https://www.whitehouse.gov/wp-content/uploads/2023/09/Voluntary-AI-Commitments-September-2023.pdf. The reason for this is pithily summed up in this article: “for satirists living under authoritative rule, humorous content challenging their leadership could put them in danger” – https://www.theverge.com/2023/10/31/23940626/artificial-intelligence-ai-digital-watermarks-biden-executive-order);
  4. how can invisible watermarks be detected by companies as a whole and not only by a few who have the technology to do so;55https://www.technologyreview.com/2023/08/09/1077516/watermarking-ai-trust-online
  5. what types of AI-generated content will they apply to – text, images, audio?56The White House voluntary commitments relate to audiovisual content only; text-based content is not covered. As this article shows, trying to watermark text is not straightforward! (see https://techcrunch.com/2022/12/10/openais-attempts-to-watermark-ai-text-hit-limits/)

There are also limitations to watermarking as watermarks say nothing about the veracity or desirability of the content. It is thus important to also educate people to think critically about the material they consume. There is also the practical issue of whether watermarking can be rolled out to all companies with generative AI tools all over the world, with a consensus on the technical standards to be applied, otherwise users could simply choose to generate content with providers that do not apply watermarking.

#2: Stating in terms of use/contract who owns the copyright

Some companies57For example, OpenAI and Google. OpenAI: https://help.openai.com/en/articles/5008634-will-openai-claim-copyright-over-what-outputs-i-generate-with-the-api. Google: https://cloud.google.com/terms/service-terms under “Service Terms”, para 17(a) – “As between Customer and Google, Google does not assert any ownership rights in any new intellectual property created in the Generated Output.” have publicly stated that they will not claim copyright over content generated using their generative AI services by the user. OpenAI’s terms of use also indicate that it “assigns to [the user] all its right, title and interest in and to the Output”58https://openai.com/policies/mar-2023-terms at para 3(a) (accessed 9 November 2023). As at 14 November 2023, OpenAI has since updated its terms of use (to be effective on 14 December 2023), inserting an “if any” qualifier: “We hereby assign to you all our right, title and interest, if any, in and to Output” (available at: https://openai.com/policies/terms-of-use). if the user complies with the Terms of Service.

However, whether copyright actually exists in the generated content is still subject to any judicial decisions or positions taken by the copyright authorities.59See also the commentary on OpenAI’s Terms of Use on page 3 of the US Congressional Research Service’s report on “Generative Artificial Intelligence and Copyright Law”. If it is held that OpenAI holds copyright in the generated content, having it automatically assigned to the user definitely provides assurance to the user. However, if it is held that there is no copyright in the content generated, then there is no loss to either party.

#3: Legislative amendments?

Will there be a renewed interest in adopting section 9(3) of the UK’s Copyright, Designs and Patents Act 1988, where the author of a computer-generated work is taken to be the person “by whom the arrangements necessary for the creation of the work are undertaken”, which would then make the person who entered the prompt the author?60The Singapore Academy of Law’s Law Reform Committee report (July 2020) on “Rethinking Database Rights and Data Ownership in an AI World” at 2.76 suggested Singapore may consider adopting section 9(3) of the CPDA to prepare for situations where human input into a work is increasingly remote, such that human authorship is lost. It may be worth considering to give protection to works generated by AI (but not work wholly autonomously generated by AI without any human prompting),61As held in the case of Thaler v Perlmutter (United States District Court, District of Columbia), accessible at https://caselaw.findlaw.com/court/us-dis-crt-dis-col/114916944.html so that people are encouraged to create, instead of an approach where they only have a sliver of copyright on the modifications made and not the underlying image/text.

Conclusion

The solutions to the IP issues posed by generative AI can be found both in technology, as well as what the law will recognise. How the solutions will play out in practice remains to be seen. However, it is good that society is forging ahead to harness the technology (rather than waiting until all the risks and legal uncertainties are ironed out with clarity), and tweaking the solutions as the risks arise, because no product or service is 100% risk-free.

The views expressed in this article are the personal views of the author and do not represent the views of Drew & Napier LLC.

Endnotes

Endnotes
1 Generative AI creates new content from the existing material that it is trained on.
2 The discussion paper is accessible at https://aiverifyfoundation.sg/downloads/Discussion_Paper.pdf
3 See, for example, the lawsuit filed in July 2023 in the United States District Court in the Northern District of California by a trio of writers – Sarah Silverman, Christopher Golden and Richard Kadrey – against OpenAI. The plaintiffs’ complaint can be read at https://llmlitigation.com/pdf/03416/silverman-openai-complaint.pdf.
4 See the lawsuit filed against Github, Microsoft and OpenAI in relation to GitHub Copilot, an AI tool that can generate code, reported at https://www.theverge.com/2022/11/8/23446821/microsoft-openai-github-copilot-class-action-lawsuit-ai-copyright-violation-training-data.
5 https://www.reuters.com/legal/litigation/meta-tells-court-ai-software-does-not-violate-author-copyrights-2023-09-19/
6 See OpenAI’s argument, reported at https://www.theregister.com/2023/08/31/openai_class_action_fair_use/. It appears that this argument is gaining traction, because in an ongoing copyright infringement claim against Stability AI by 3 artists alleging that Stability AI used copyrighted artwork without permission to train the AI tools, Federal District Judge William Orrick (in October 2023) expressed doubt over arguments that all AI generated outputs are infringing derivates of the images they were trained on, saying “even if plaintiffs narrow their allegations to limit them to output images that draw upon training images based upon copyrighted images, I am not convinced that copyright claims based (on) a derivative theory can survive absent ‘substantial similarity’ type allegations.” (reported at https://www.theregister.com/2023/10/31/judge_copyright_stabilityai_deviantart_midjourney/?td=keepreading).
7 As described by Meta’s attorneys in their motion to dismiss the case brought by Sarah Silverman and other authors – “Copyright law does not protect facts or the syntactical, structural, and linguistic information that may have been extracted from books like Plaintiffs’ during training. Use of texts to train LLaMA to statistically model language and generate original expression is transformative by nature and quintessential fair use.”, available at https://fingfx.thomsonreuters.com/gfx/legaldocs/dwpkakjdxpm/META%20OPENAI%20SILVERMAN%20INFRINGEMENT%20metamtd.pdf
8 See the testimony (before the U.S. Senate Committee on the Judiciary Subcommittee on Intellectual Property) of Matthew Sag, Professor of Law in Artificial Intelligence, Machine Learning, and Data Science, Emory University School of Law on 12 June 2023, at page 4. Available at: https://www.judiciary.senate.gov/imo/media/doc/2023-07-12_pm_-_testimony_-_sag.pdf
9 The author recommends reading a very comprehensive article by Professor David Tan in the April 2023 Law Gazette: https://lawgazette.com.sg/feature/the-best-things-in-life-are-not-for-free-copyright-and-generative-ai-learning/
10 Please note that this is only a sampling, and not a comprehensive statement of every country’s activities.
11 A translation is available at https://www.chinalawtranslate.com/en/comparison-chart-of-current-vs-draft-rules-for-generative-ai/
12 As reported in https://www.reuters.com/technology/china-proposes-blacklist-sources-used-train-generative-ai-models-2023-10-12/
13 This is the author’s own assessment after running the draft rules through Google Translate – the draft rules (only available in Chinese) can be accessed at: https://www.tc260.org.cn/upload/2023-10-11/1697008495851003865.pdf
14 See Amendment 399, available at https://www.europarl.europa.eu/doceo/document/TA-9-2023-0236_EN.html
15 See pages 16, 21 and 22 of IMDA’s Discussion Paper on “Generative AI: Implications for Trust and Governance” published on 6 June 2023.
16 https://www.gov.uk/guidance/the-governments-code-of-practice-on-copyright-and-ai
17 See paragraph 5 of HM Government Response to Sir Patrick Vallance’s Pro Innovation Regulation of Technologies Review (published March 2023), available at https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1142798/HMG_response_to_SPV_Digital_Tech_final.pdf
18 See paragraph 6 of HM Government Response to Sir Patrick Vallance’s Pro Innovation Regulation of Technologies Review (published March 2023).
19 https://www.adobe.com/sg/sensei/generative-ai/firefly.html#faqs
20 https://www.straitstimes.com/business/adobe-to-pay-ai-content-creators-in-groundbreaking-move
21 See the Adobe FAQ available at https://helpx.adobe.com/sg/stock/contributor/help/firefly-faq-for-adobe-stock-contributors.html
22 See the Adobe FAQ available at https://helpx.adobe.com/sg/stock/contributor/help/firefly-faq-for-adobe-stock-contributors.html, which describes the bonus compensation to be based on “eligible online images and their downloads”, and subsequent bonuses to be “based on new approved images and downloads”.
23 It is less likely to be able to do this for text, given how books, articles, etc. are used to train LLMs to recognise patterns across words and are not tagged the way images are.
24 https://glaze.cs.uchicago.edu/faq.html
25 https://www.technologyreview.com/2023/10/24/1082247/this-new-tool-could-give-artists-an-edge-over-ai/
26 DeviantArt created a new form of protection, with “noai” (AI cannot use anything on the page) and “noimageai” (AI cannot use any images on the page) directives. These “noai” and “noimageai” meta tags will be placed in the HTML page associated with the art. Web crawlers are able to read the tags and recognise that the person does not want their content used to train AI. However, the web crawler can still choose to ignore it. The above explanation was based on this article which reports on DeviantArt’s new protection: https://techcrunch.com/2022/11/11/deviantart-provides-a-way-for-artists-to-opt-out-of-ai-art-generators/ , and this article which very clearly explains how it works: https://www.aimeecozza.com/noai-noimageai-meta-tag-how-to-install/
27 See the interview with Braden Hancock, co-founder of Snorkel AI that develops LLMs, available at https://www.computerworld.com/article/3709609/data-poisoning-anti-ai-theft-tools-emerge-but-are-they-ethical.html. Braden is of the view that “there are unethical uses of (technological defences) – for example, if you’re trying to poison self-driving car data that helps them recognize stop signs and speed limit signs (…) if your goal is more towards ‘don’t scrape me’ and not actively trying to ruin a model, I think that’s where the line is for me.”
28 https://www.theverge.com/23444685/generative-ai-copyright-infringement-legal-fair-use-training-data, interviewing Matthe Butterick, a lawyer representing plaintiffs in lawsuits against companies for scraping data to train AI models.
29 Jie Xu et al “Machine Unlearning: Solutions and Challenge”, available at https://arxiv.org/pdf/2308.07061.pdf, and Thanveer Shaik et al, “Exploring the Landscape of Machine Unlearning: A Comprehensive Survey and Taxonomy”, available at: https://arxiv.org/pdf/2305.06360.pdf.
30 See paragraphs 77 – 81 of Global Yellow Pages Ltd v Promedia Directories Pte Ltd and another matter (2017) SGCA 28.
31 Jooyoung Lee et al, “Do Language Models Plagiarize?”, available at: https://arxiv.org/pdf/2203.07618.pdf. In particular, note section 5.2 of their paper, which hypothesizes that a model is more likely to memorize and regurgitate content it is trained on if that content appears multiple times/frequently in the training set.
32 https://arstechnica.com/information-technology/2023/02/researchers-extract-training-images-from-stable-diffusion-but-its-difficult/
33 See page 4 of the report on “Generative Artificial Intelligence and Copyright Law” published by the Congressional Research Service, available at https://crsreports.congress.gov/product/pdf/LSB/LSB10922
34 For example, Microsoft and Google offer the indemnity for copyright infringement allegations for both training data as well as the generated output. In contrast, Abobe only offers it for the generated output.
35 See, for example, Microsoft and Google’s terms.
36 https://cloud.google.com/terms/service-terms under “Service Terms”, paragraph 17(j).
37 https://www.theregister.com/2023/10/12/google_ai_protection/
38 https://cloud.google.com/terms/service-terms under “Service Terms”, paragraph 17(j).
39 See https://www.shutterstock.com/blog/ai-indemnity-protection-commercial-use, where the AI-generated images will be approved for indemnification once they pass content review by Shuttlestock’s expert content review team, who “check that generated images do not depict copyright, trademark, right of publicity and other potential risks”, and also examine the prompts used to create the image.
40 See https://techcrunch.com/2023/06/26/adobe-indemnity-clause-designed-to-ease-enterprise-fears-about-ai-generated-art.
41 See also https://synthedia.substack.com/p/google-adds-broad-generative-ai-indemnity, where the author is of the view that “it is unclear how a user may become liable for something unpublished but resident in training data”.
42 The customer trains a model (that has already been trained) on its own datasets, in a process called fine-tuning, so that the model’s weights (parameters) are changed and it is more tailored/customised to the task at hand – see https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/fine-tuning?tabs=turbo&pivots=programming-language-studio
43 The customer links the model with a database of the customer’s own content which will be searched based on the prompt entered, where the relevant ‘matching’ content from the database is then ‘retrieved’ and sent together with the prompt to the model for it to generate an output (a process called Retrieval-Augmented Generation (“RAG”)). Technically, for RAG, it does not involve any retraining of the model, because it merely sends an “enhanced prompt” with more context/resources to the model, but the model can still draw on the content to provide the response to the prompt. Learn more about RAG at: https://scale.com/blog/retrieval-augmented-generation-to-enhance-llms
44 E.g. https://cloud.google.com/terms/service-terms under “Service Terms”, paragraph 17(j).
45 See https://openai.com/dall-e-3, under “Creative Control”.
46 The US Copyright Office issued guidance on 16 March 2023 (titled “Copyright Registration Guidance: Works Containing Material Generated by Artificial Intelligence”, accessible at: https://www.federalregister.gov/documents/2023/03/16/2023-05321/copyright-registration-guidance-works-containing-material-generated-by-artificial-intelligence). It takes the position that whether copyright exists depends on whether AI is like a tool (e.g. a pencil) or a “commissioned artist”, stating that “(w)hen an AI technology determines the expressive elements of its output, the generated material is not the product of human authorship”.The US Copyright Office went on to explain that if a human modifies material generated by AI technology “to such a degree that the modifications meet the standard for copyright protection”, the human will have copyright over the human-authored aspects of the work, but not the base AI-generated material.
47 Both images below are extracted from the US Copyright Review Board’s decision, page 6, available at: https://www.copyright.gov/rulings-filings/review-board/docs/Theatre-Dopera-Spatial.pdf
48 Mr Allen’s creation won first place in the digital arts category at the Colorado State Fair Fine Arts Competition: https://edition.cnn.com/2022/09/03/tech/ai-art-fair-winner-controversy/index.html
49 Whether Mr Allen’s adjustments in Photoshop were sufficient to be copyrightable on their own was not decided in this case, as the Board lacked enough information to make that determination.
50 The author notes that as a parallel, there are artists who also have their art pieces done by human assistants – see for example, Damine Hirst’s spot paintings (more information available at https://www.myartbroker.com/artist-damien-hirst/articles/damien-hirst-assistants-vs-renaissance-workshops). But applying the principles articulated by the Copyright Review Board, the outcome is different because Mr. Hirst conceptualises them and instructs his assistants, who, being human, understand what he wants them to do.
51 Some examples are: (1) China – released guidelines on tagging content in generative AI services on 25 August 2023; (2) EU – where signatories to the voluntary Code of Practice on Online Disinformation are to label AI-generated content; (3) Singapore – see IMDA’s Paper on Generative AI: Implications for Trust and Governance (published June 2023); (4) US – the voluntary commitments to the White House made by 7 leading AI companies on 21 July 2023.
52 https://deepmind.google/discover/blog/identifying-ai-generated-images-with-synthid/
53 https://deepmind.google/discover/blog/identifying-ai-generated-images-with-synthid/
54 As suggested in https://www.whitehouse.gov/wp-content/uploads/2023/09/Voluntary-AI-Commitments-September-2023.pdf. The reason for this is pithily summed up in this article: “for satirists living under authoritative rule, humorous content challenging their leadership could put them in danger” – https://www.theverge.com/2023/10/31/23940626/artificial-intelligence-ai-digital-watermarks-biden-executive-order
55 https://www.technologyreview.com/2023/08/09/1077516/watermarking-ai-trust-online
56 The White House voluntary commitments relate to audiovisual content only; text-based content is not covered. As this article shows, trying to watermark text is not straightforward! (see https://techcrunch.com/2022/12/10/openais-attempts-to-watermark-ai-text-hit-limits/)
57 For example, OpenAI and Google. OpenAI: https://help.openai.com/en/articles/5008634-will-openai-claim-copyright-over-what-outputs-i-generate-with-the-api. Google: https://cloud.google.com/terms/service-terms under “Service Terms”, para 17(a) – “As between Customer and Google, Google does not assert any ownership rights in any new intellectual property created in the Generated Output.”
58 https://openai.com/policies/mar-2023-terms at para 3(a) (accessed 9 November 2023). As at 14 November 2023, OpenAI has since updated its terms of use (to be effective on 14 December 2023), inserting an “if any” qualifier: “We hereby assign to you all our right, title and interest, if any, in and to Output” (available at: https://openai.com/policies/terms-of-use).
59 See also the commentary on OpenAI’s Terms of Use on page 3 of the US Congressional Research Service’s report on “Generative Artificial Intelligence and Copyright Law”.
60 The Singapore Academy of Law’s Law Reform Committee report (July 2020) on “Rethinking Database Rights and Data Ownership in an AI World” at 2.76 suggested Singapore may consider adopting section 9(3) of the CPDA to prepare for situations where human input into a work is increasingly remote, such that human authorship is lost.
61 As held in the case of Thaler v Perlmutter (United States District Court, District of Columbia), accessible at https://caselaw.findlaw.com/court/us-dis-crt-dis-col/114916944.html

The post Generative AI appeared first on The Singapore Law Gazette.


Viewing all articles
Browse latest Browse all 153

Trending Articles