I will say I do not feel there is an ethical use for "ai", it's very nature is to be unethical. JMHO.
What's so fundamental and intrinsic to AI use that's unethical, that you can't change to make it ethical? Every criticism I've heard has been how it's used (basically copyright infringement) rather than AI per se. The only thing I can see is that artists etc might lose their jobs, but while that's a shame and I have sympathy for them, given the amount of jobs lost to automation in our lives, it'd be oddly selective to criticise AI for that.
The problem is that in a lot of ways being unethical is baked into how it works. To get a large enough sample to actually work they have to pool thousands, if not millions, of sources. Each one of those is essentially stealing someone elses work and talent. Sure you could actually make it ethical by honouring copyright but that would result in you having to pay every single creator and get them to agree, making it both too costly and too time consuming to be viable, and an awful lot of artists would say no no matter how much you offered them. Throw in as well that most of the people and companies creating AI tools seem to have a really shakey grasp of ethics and don't see anything wrong with stealing other people's work to get their algorithms to work and you'll never get an ethical AI on the current business model
He specifically mentions using only the books WOTC owns the copyright to in the AI database.
Did he say who will own the "ai"?
Rollback Post to RevisionRollBack
CENSORSHIP IS THE TOOL OF COWARDS and WANNA BE TYRANTS.
I'm not sure how that's relevant? I'm sure the core AI code would be licensed from someone else, but it's the training data that determines what it produces.
I'm not sure how that's relevant? I'm sure the core AI code would be licensed from someone else, but it's the training data that determines what it produces.
I disagree I see the code as more important than the data, as it is the code that will "decide" what data it uses, and how it is used. Just look at the multitude of debates on RAW/RAI we have, and all of us have been "trained" on the same data.
Rollback Post to RevisionRollBack
CENSORSHIP IS THE TOOL OF COWARDS and WANNA BE TYRANTS.
I will say I do not feel there is an ethical use for "ai", it's very nature is to be unethical. JMHO.
What's so fundamental and intrinsic to AI use that's unethical, that you can't change to make it ethical? Every criticism I've heard has been how it's used (basically copyright infringement) rather than AI per se. The only thing I can see is that artists etc might lose their jobs, but while that's a shame and I have sympathy for them, given the amount of jobs lost to automation in our lives, it'd be oddly selective to criticise AI for that.
The problem is that in a lot of ways being unethical is baked into how it works. To get a large enough sample to actually work they have to pool thousands, if not millions, of sources. Each one of those is essentially stealing someone elses work and talent. Sure you could actually make it ethical by honouring copyright but that would result in you having to pay every single creator and get them to agree, making it both too costly and too time consuming to be viable, and an awful lot of artists would say no no matter how much you offered them. Throw in as well that most of the people and companies creating AI tools seem to have a really shakey grasp of ethics and don't see anything wrong with stealing other people's work to get their algorithms to work and you'll never get an ethical AI on the current business model
He specifically mentions using only the books WOTC owns the copyright to in the AI database.
Would that even work? How many books does WOTC own the copyright to? I was under the impression you would need something on the scale of thousands of books to train a generative AI that's even a little bit competent.
It sounds pretty questionable. WotC doesn't have any AI-building ability in the first place, they'd need to buy it from someone else and then dump their own content into it.
Rollback Post to RevisionRollBack
Find your own truth, choose your enemies carefully, and never deal with a dragon.
"Canon" is what's factual to D&D lore. "Cannon" is what you're going to be shot with if you keep getting the word wrong.
Would that even work? How many books does WOTC own the copyright to? I was under the impression you would need something on the scale of thousands of books to train a generative AI that's even a little bit competent.
They've got fifty years of D&D artwork, although the early stuff was obviously not as detailed. They also have thirty years of Magic the Gathering card artwork. Then there will also be the thousands of pieces that were submitted by the artists they hired but was rejected for publication.
And, of course, they also have access to anything in the Public Domain.
Would that even work? How many books does WOTC own the copyright to? I was under the impression you would need something on the scale of thousands of books to train a generative AI that's even a little bit competent.
They've got fifty years of D&D artwork, although the early stuff was obviously not as detailed. They also have thirty years of Magic the Gathering card artwork. Then there will also be the thousands of pieces that were submitted by the artists they hired but was rejected for publication.
And, of course, they also have access to anything in the Public Domain.
The art work isn't gonna help the "ai" "learn" much about being a DM.
The use of public domain is precluded by Cocks' statement of using only WotC owned copyrighted materials.
Honestly I don’t understand how it can be unethical use if it’s referencing millions of different data points, anymore than an artist is unethical for studying famous works of a style they want to use or an author is unethical for looking at how other authors write certain scenes. The vast majority of art is built on what came before, this is just the next technological extension of that. Obviously if you exclusively train it on a single artist it’ll just imitate them, but once you’ve got several dozen in the mix can it really be said to be taking enough from any single one to be stealing?
If I steal a bunch of other people's paintings and put them in my woodchipper, do I own the confetti?
I'm not sure how that's relevant? I'm sure the core AI code would be licensed from someone else, but it's the training data that determines what it produces.
I disagree I see the code as more important than the data, as it is the code that will "decide" what data it uses, and how it is used. Just look at the multitude of debates on RAW/RAI we have, and all of us have been "trained" on the same data.
No, the code is much less important than the data. The core of large language models is a database of words, correlated along vast numbers of different axes. While there's code around it, the model's behavior isn't programmed like we're used to.
He specifically mentions using only the books WOTC owns the copyright to in the AI database.
Would that even work? How many books does WOTC own the copyright to? I was under the impression you would need something on the scale of thousands of books to train a generative AI that's even a little bit competent.
I'm pretty sure that every single word written about and for D&D throughout the last 50 years would not be enough to train a modern LLM. A customized model is (I believe) invariably done by taking one of the generic models like GPT 4, trained on whatever they grabbed off the internet, and putting it through a second round of training with the domain-specific data set. Everything in the original model is likely still there, just deemphasized.
Even if WotC did own enough material to fully train an LLM from scratch, doing so is extremely expensive.
Honestly I don’t understand how it can be unethical use if it’s referencing millions of different data points, anymore than an artist is unethical for studying famous works of a style they want to use or an author is unethical for looking at how other authors write certain scenes. The vast majority of art is built on what came before, this is just the next technological extension of that. Obviously if you exclusively train it on a single artist it’ll just imitate them, but once you’ve got several dozen in the mix can it really be said to be taking enough from any single one to be stealing?
If I steal a bunch of other people's paintings and put them in my woodchipper, do I own the confetti?
That's a fascinating and evocative metaphor that is entirely unrelated to what's happening with an AI. A better one would be: "If I go to an art museum and look at the pieces they have on display for inspiration, am I plagiarizing if I borrow from one person's use of colors, and another's use of perspective and so on?"
The art work isn't gonna help the "ai" "learn" much about being a DM.
The use of public domain is precluded by Cocks' statement of using only WotC owned copyrighted materials.
For that they have fifty years of published adventures as well as podcasts. Dragon magazine was owned by TSR, then there are the licensed games such as the SSI gold box series. They've even got DM questions and answers on D&D Beyond to pull from.
But yeah, I too totally believe him when he says they're not going to use public domain items.
I'm not sure how that's relevant? I'm sure the core AI code would be licensed from someone else, but it's the training data that determines what it produces.
I disagree I see the code as more important than the data, as it is the code that will "decide" what data it uses, and how it is used. Just look at the multitude of debates on RAW/RAI we have, and all of us have been "trained" on the same data.
No, the code is much less important than the data. The core of large language models is a database of words, correlated along vast numbers of different axes. While there's code around it, the model's behavior isn't programmed like we're used to.
Again I disagree, without the "code" it is just data. How is data going to run a session without the "code"?
It is the code that determines how the data is used even if it "learns" it is governed by the code and those fingerprints stay with the education throughout the "learning" process. The fact that the "code" is proprietary says a lot.
The art work isn't gonna help the "ai" "learn" much about being a DM.
The use of public domain is precluded by Cocks' statement of using only WotC owned copyrighted materials.
For that they have fifty years of published adventures as well as podcasts. Dragon magazine was owned by TSR, then there are the licensed games such as the SSI gold box series. They've even got DM questions and answers on D&D Beyond to pull from.
But yeah, I too totally believe him when he says they're not going to use public domain items.
Well it is all we have to go on at the moment, this entire thread is based on his statements. 50 years of owned content seems like a lot until you try to teach "ai" with it. And that is assuming they use all of the rules from all of the editions, that seems like it would be very problematic for the "ai" DM.
Rollback Post to RevisionRollBack
CENSORSHIP IS THE TOOL OF COWARDS and WANNA BE TYRANTS.
Honestly I don’t understand how it can be unethical use if it’s referencing millions of different data points, anymore than an artist is unethical for studying famous works of a style they want to use or an author is unethical for looking at how other authors write certain scenes. The vast majority of art is built on what came before, this is just the next technological extension of that. Obviously if you exclusively train it on a single artist it’ll just imitate them, but once you’ve got several dozen in the mix can it really be said to be taking enough from any single one to be stealing?
If I steal a bunch of other people's paintings and put them in my woodchipper, do I own the confetti?
That's a fascinating and evocative metaphor that is entirely unrelated to what's happening with an AI. A better one would be: "If I go to an art museum and look at the pieces they have on display for inspiration, am I plagiarizing if I borrow from one person's use of colors, and another's use of perspective and so on?"
No, it's kind of apt.
Despite it being called "training" or "learning", what the generative models do is vastly different from what people do. (As best we understand the latter.)
Because they're probabilistic, when specific sets of words or pixels are strongly correlated and well-represented in the training data, they're very likely to show up in the output. If you give an image generator the phrase "Italian plumber", we all know the way the output's going to trend. If you ask a person to draw you an Italian plumber, you will only get Mario if they choose to draw you Mario.
That said, the legal issues around whether LLMs are infringing the copyright of their training data are complex and deeply unresolved. Transformative use is a significant fair-use principle in the US, and there's a solid argument to be made that this stuff is transformative. The fact that it's pretty easy to poke them into just spitting out training data nearly verbatim is also rather persuasive.
I think there's a real chance that we're going to see some kind of split, where they're legally legit in principle, but the massive copyright infringement committed in assembling the training data for the current models is not. But that's many years and millions of dollars in legal expenses away. (And will only apply to the US. The EU, China, etc. will have their own rulings. So much fun.)
The ethical and artistic issues are each their own cans of worms as well.
Again I disagree, without the "code" it is just data. How is data going to run a session without the "code"?
A LLM doesn't function without both data and code, but the relationship between a LLM and data is similar to the relation between a search engine and data.
The main problem with using only licensed sources is that it's probably not enough data. There's likely on the order of ten million words across all 5e official publications, I suspect most LLMs are trained on many billions.
Again I disagree, without the "code" it is just data. How is data going to run a session without the "code"?
A LLM doesn't function without both data and code, but the relationship between a LLM and data is similar to the relation between a search engine and data.
Correct, but the data can be used in many ways without the code, it is the code that makes it special.
Rollback Post to RevisionRollBack
CENSORSHIP IS THE TOOL OF COWARDS and WANNA BE TYRANTS.
Well it is all we have to go on at the moment, this entire thread is based on his statements. 50 years of owned content seems like a lot until you try to teach "ai" with it. And that is assuming they use all of the rules from all of the editions, that seems like it would be very problematic for the "ai" DM.
How many years of content does an AI need to learn from? You can generate a dungeon and BBEG by simply rolling on random tables found in the DMG. That doesn't come across as something that would be difficult for an AI to be taught.
And how does any of this relate to the idea that using an AI is somehow immoral?
Well it is all we have to go on at the moment, this entire thread is based on his statements. 50 years of owned content seems like a lot until you try to teach "ai" with it. And that is assuming they use all of the rules from all of the editions, that seems like it would be very problematic for the "ai" DM.
How many years of content does an AI need to learn from? You can generate a dungeon and BBEG by simply rolling on random tables found in the DMG. That doesn't come across as something that would be difficult for an AI to be taught.
That also doesn't come across as something you need an AI for.
Rollback Post to RevisionRollBack
Find your own truth, choose your enemies carefully, and never deal with a dragon.
"Canon" is what's factual to D&D lore. "Cannon" is what you're going to be shot with if you keep getting the word wrong.
Well it is all we have to go on at the moment, this entire thread is based on his statements. 50 years of owned content seems like a lot until you try to teach "ai" with it. And that is assuming they use all of the rules from all of the editions, that seems like it would be very problematic for the "ai" DM.
How many years of content does an AI need to learn from? You can generate a dungeon and BBEG by simply rolling on random tables found in the DMG. That doesn't come across as something that would be difficult for an AI to be taught.
And how does any of this relate to the idea that using an AI is somehow immoral?
It isn't the years, it is the volume of data needed, as to the ethics of "ai", well I am not going to retype what is already written in this thread by many, it is here for you to read at your leisure.
Rollback Post to RevisionRollBack
CENSORSHIP IS THE TOOL OF COWARDS and WANNA BE TYRANTS.
Did he say who will own the "ai"?
CENSORSHIP IS THE TOOL OF COWARDS and WANNA BE TYRANTS.
I'm not sure how that's relevant? I'm sure the core AI code would be licensed from someone else, but it's the training data that determines what it produces.
I disagree I see the code as more important than the data, as it is the code that will "decide" what data it uses, and how it is used. Just look at the multitude of debates on RAW/RAI we have, and all of us have been "trained" on the same data.
CENSORSHIP IS THE TOOL OF COWARDS and WANNA BE TYRANTS.
Would that even work? How many books does WOTC own the copyright to? I was under the impression you would need something on the scale of thousands of books to train a generative AI that's even a little bit competent.
It sounds pretty questionable. WotC doesn't have any AI-building ability in the first place, they'd need to buy it from someone else and then dump their own content into it.
Find your own truth, choose your enemies carefully, and never deal with a dragon.
"Canon" is what's factual to D&D lore. "Cannon" is what you're going to be shot with if you keep getting the word wrong.
They've got fifty years of D&D artwork, although the early stuff was obviously not as detailed. They also have thirty years of Magic the Gathering card artwork. Then there will also be the thousands of pieces that were submitted by the artists they hired but was rejected for publication.
And, of course, they also have access to anything in the Public Domain.
The art work isn't gonna help the "ai" "learn" much about being a DM.
The use of public domain is precluded by Cocks' statement of using only WotC owned copyrighted materials.
CENSORSHIP IS THE TOOL OF COWARDS and WANNA BE TYRANTS.
If I steal a bunch of other people's paintings and put them in my woodchipper, do I own the confetti?
No, the code is much less important than the data. The core of large language models is a database of words, correlated along vast numbers of different axes. While there's code around it, the model's behavior isn't programmed like we're used to.
I'm pretty sure that every single word written about and for D&D throughout the last 50 years would not be enough to train a modern LLM. A customized model is (I believe) invariably done by taking one of the generic models like GPT 4, trained on whatever they grabbed off the internet, and putting it through a second round of training with the domain-specific data set. Everything in the original model is likely still there, just deemphasized.
Even if WotC did own enough material to fully train an LLM from scratch, doing so is extremely expensive.
That's a fascinating and evocative metaphor that is entirely unrelated to what's happening with an AI. A better one would be: "If I go to an art museum and look at the pieces they have on display for inspiration, am I plagiarizing if I borrow from one person's use of colors, and another's use of perspective and so on?"
For that they have fifty years of published adventures as well as podcasts. Dragon magazine was owned by TSR, then there are the licensed games such as the SSI gold box series. They've even got DM questions and answers on D&D Beyond to pull from.
But yeah, I too totally believe him when he says they're not going to use public domain items.
Again I disagree, without the "code" it is just data. How is data going to run a session without the "code"?
It is the code that determines how the data is used even if it "learns" it is governed by the code and those fingerprints stay with the education throughout the "learning" process. The fact that the "code" is proprietary says a lot.
CENSORSHIP IS THE TOOL OF COWARDS and WANNA BE TYRANTS.
Well it is all we have to go on at the moment, this entire thread is based on his statements. 50 years of owned content seems like a lot until you try to teach "ai" with it. And that is assuming they use all of the rules from all of the editions, that seems like it would be very problematic for the "ai" DM.
CENSORSHIP IS THE TOOL OF COWARDS and WANNA BE TYRANTS.
No, it's kind of apt.
Despite it being called "training" or "learning", what the generative models do is vastly different from what people do. (As best we understand the latter.)
Because they're probabilistic, when specific sets of words or pixels are strongly correlated and well-represented in the training data, they're very likely to show up in the output. If you give an image generator the phrase "Italian plumber", we all know the way the output's going to trend. If you ask a person to draw you an Italian plumber, you will only get Mario if they choose to draw you Mario.
That said, the legal issues around whether LLMs are infringing the copyright of their training data are complex and deeply unresolved. Transformative use is a significant fair-use principle in the US, and there's a solid argument to be made that this stuff is transformative. The fact that it's pretty easy to poke them into just spitting out training data nearly verbatim is also rather persuasive.
I think there's a real chance that we're going to see some kind of split, where they're legally legit in principle, but the massive copyright infringement committed in assembling the training data for the current models is not. But that's many years and millions of dollars in legal expenses away. (And will only apply to the US. The EU, China, etc. will have their own rulings. So much fun.)
The ethical and artistic issues are each their own cans of worms as well.
A LLM doesn't function without both data and code, but the relationship between a LLM and data is similar to the relation between a search engine and data.
The main problem with using only licensed sources is that it's probably not enough data. There's likely on the order of ten million words across all 5e official publications, I suspect most LLMs are trained on many billions.
Correct, but the data can be used in many ways without the code, it is the code that makes it special.
CENSORSHIP IS THE TOOL OF COWARDS and WANNA BE TYRANTS.
How many years of content does an AI need to learn from? You can generate a dungeon and BBEG by simply rolling on random tables found in the DMG. That doesn't come across as something that would be difficult for an AI to be taught.
And how does any of this relate to the idea that using an AI is somehow immoral?
That also doesn't come across as something you need an AI for.
Find your own truth, choose your enemies carefully, and never deal with a dragon.
"Canon" is what's factual to D&D lore. "Cannon" is what you're going to be shot with if you keep getting the word wrong.
It isn't the years, it is the volume of data needed, as to the ethics of "ai", well I am not going to retype what is already written in this thread by many, it is here for you to read at your leisure.
CENSORSHIP IS THE TOOL OF COWARDS and WANNA BE TYRANTS.
Um... so? The basic way a LLM works is you feed it a bunch of data and tell it "produce stuff that looks like what we fed you".