Companies doubling down on AI shouldn't sleep on copyright concerns, experts warn
Welcome back to LinkedIn News Tech Stack, which brings you news, insights and trends involving the founders, investors and companies on the cutting edge of emerging technology.
First: Usually on the other end of the hot seat thanks to his popular podcast, this week’s edition of VC Wednesdays featured Redpoint’s Logan Bartlett. Catch the full Q&A here.
Pitch me the interesting investors, founders, ideas and companies powering emerging technologies like AI. Share your feedback and follow me on LinkedIn for other tech updates.
A deep dive into one big theme or news story every week.
Ever since ChatGPT’s debut last November, companies of all sizes have been eager to figure out how to deploy the technology to boost their own businesses. Businesses are projected to invest around $200 billion in AI globally by 2025, according to a recent Goldman Sachs report.
But a slew of recent lawsuits have raised questions around how AI large language models collect and use copyrighted data – leaving some in the lurch about the way forward, say experts.
In a recent post, Mayfield Fund partner Patrick Salyer wrote how enterprise CIOs want models “that are immune to IP and copyright issues” among other things, and some enterprises have even hit pause until relevant tools that can detect potential copyright infringements are put in place.
Open source versus proprietary models
Most enterprises have been building their own AI models either through paid access to proprietary LLMs like OpenAI ’s ChatGPT, or by fine-tuning open source LLMs like Meta’s Llama 2 for their own use cases. (A third, more expensive and laborious option is building their own models internally from scratch.)
While proprietary models are easier to use, security remains an ongoing concern, with companies concerned about the models having access to their internal data and potentially using it for future training, said Maggie Basta , an investor at Scale Venture Partners .
But as far as copyright issues are concerned, open source solutions – which are sometimes preferred, as they allow users to keep any proprietary data private – aren’t immune either.
In a recent webinar hosted by The Information, Cohere CEO and cofounder Aidan Gomez shared his concerns around open source models like Meta ’s Llama 2, saying that large enterprises were exposing themselves to security and privacy risks by building products on top of such AI models.
And in a separate LinkedIn post, AI adviser Vin Vashishta recently warned how companies risked being caught off-guard if they integrated open source LLMs into products without evaluating how they were trained.
The underlying issue is that whether open source or proprietary, outside LLMs have essentially been trained on troves of image and text data scraped from the internet – and remain huge black boxes. Just last week, The Atlantic reported how pirated books were used to train models like Meta’s Llama and others.
“Both Stability AI, which is open source on the image side, and OpenAI, which is closed, are getting a huge amount of pushback for copyright issues,” said Basta. “I don’t think any of these LLMs have done a great job of saying ‘this is where we got our data from.’”
There’s no existing legal precedent for AI
The problem is that because AI technology is so new and advancing so fast, there’s no legal precedent. Existing copyright provisions haven't been tested against AI, and so the question of who’s at risk – the company employing the model or the model developer – remains unclear, at least in the U.S., said AI law expert Barry Scannell .
(The EU AI Act, which is yet to become law, places the onus on developers, including obligations on providing a summary of what copyrighted data was used to train models. Different countries have different standards, as AI researcher Sebastian Raschka, PhD wrote in this post.)
“It doesn't really matter if it's open source or proprietary – what matters is what data was used to train it and if it was infringing on copyright or not,” Scannell said. “From a copyright infringement perspective, there’s a risk of damages but also injunctions preventing unauthorized use of training data, which could lead to the dismantling of your whole (enterprise) model.”
In the U.S., expect the principle of “fair use” to become a hot-button issue, with some even calling it the “Napster moment” for OpenAI and for generative AI.
Developers have relied on the concept to argue that training AI with copyrighted material should be permitted, as generative AI technology technically produces new work. But that is starting to be challenged, with NPR reporting that The New York Times has updated its terms of service to forbid using its content in training, and is considering legal action against OpenAI for unauthorized use of its articles as training data.
“Whether it’s The New York Times or someone else, there's going to be a major case in the next year on training AI and what constitutes transformative use and fair use,” Scannell said. “And it’s unclear where it will fall.”
In the meantime, companies must evaluate how the models they are employing have been trained in order to avoid being dragged into potential legal issues in the future, and review their indemnity, liability and warranty clauses, experts said.
“There's going to be a huge push to have more compliance and auditing from companies around where the models got their data from,” said Basta.
Despite the uncertainty, most enterprises ought to continue adopting AI models in their businesses, said Forrester analyst Rowan Curran .
“There's so much clear value,” he said.
Here’s where we bring you up-to-speed with the latest advancements from the world of AI.
It’s been another jam-packed week in AI news. Here are the highlights.
Catch up on the tech headlines you may have missed this week and what our members are saying about them on LinkedIn.
Here’s keeping tabs on key executives on the move and other big pivots in the tech industry. Please send me personnel moves within emerging tech.
Here are other top stories of the week from beyond LinkedIn in the broader world of tech.
Thanks for reading. Please share Tech Stack and forward it around if you like it! And if you have any news tips, find me on InMail.
Communication Strategist at Career Development Centre, MREI | Content Writer & Marketer - AI, B2B SaaS, eCommerce, Personal Tech | Founder, VyasSpeaks - Comforting, Reassuring, Uplifting Content
1yThis is one of the most valuable newsletters on LinkedIn. I get to learn so much from it. Thank you for writing this Tanya Dua.
Thank you Tanya Dua for covering Microsoft unbinding story. We at MelpApp appreciate all your news at Tech Stack. Melp.us
VP of AI at Cisco | Ex Google AI
1ygreat insights Tanya Dua and Maggie Basta!
VP Brand Innovation, Hype: Leading Onchain Marketing Agency | Co-founder, Lemonade I Judge, The Lovie Awards
1yOn top of training data copyright, there are also inherent biases in LLM training data and it's fascinating. If you put a basic prompt in Midjourney, you're going to get the most stereotypical non-diverse result. Bad data in bad data out. I'm definitely following along to see how these transformational developments unfold! Thank you for the update. Exciting times!
Senior watsonx Leader @ IBM
1yIt's incredible to witness how the power of collaboration within the open-source community continues to reshape the landscape of #AIinnovation. The diverse lineup of tech giants coming together to support Hugging Face's growth speaks volumes about the transformative potential of shared knowledge and resources. As AI's capabilities expand, so does the importance of fostering an environment where collective expertise thrives. This investment is a clear signal that the future of AI is intertwined with collective effort and cooperation. Kudos to Hugging Face and the entire community driving this evolution! 🚀🌐 #AIInnovation #OpenSourceCommunity #CollaborativeFuture IBM IBM Watson