Has ChatGPT the books in its training data?

#1 Nov 7, 2023
Tarod_
Tarod_

View User Profile

View Posts

Send Message
- Adventurer
- Join Date: 9/19/2023
- Posts: 3,202
- Member Details
I was playing a little bit with ChatGPT, when I saw some answers pointing out to some pages from PHB. Is this legal? I guess not. But I suppose WotC is aware of that... or not.

Question: "Which ability should I use when using a disguise kit in D&D 5e?" And the (wrong?) answer:

Another example and then some lies:

Last edited by Tarod_: Nov 7, 2023

Rollback Post to Revision RollBack
#2 Nov 7, 2023
Quar1on
Quar1on

View User Profile

View Posts

Send Message
- Master Trickster
- Join Date: 2/9/2020
- Posts: 2,705
- Member Details
I doubt it actually knows what's on those pages. It probably just noticed that they tend to accompany online answers to questions like yours. If a bunch of online answers to "is it possible to cast spells while concentrating" include references to pages 202 and 203, then the AI's answer will as well.

Rollback Post to Revision RollBack

Look at what you've done. You spoiled it. You have nobody to blame but yourself. Go sit and think about your actions.

How I'm posting based on text formatting: Mod Hat Off - Mod Hat Also Off (I'm not a mod)
#3 Nov 8, 2023
jl8e
jl8e

View User Profile

View Posts

Send Message
- Adventurer
- Join Date: 7/23/2020
- Posts: 3,770
- Member Details
Quote from tarodnet >>

I was playing a little bit with ChatGPT, when I saw some answers pointing out to some pages from PHB. Is this legal? I guess not. But I suppose WotC is aware of that... or not.

Nobody outside of the folks at OpenAI knows for sure whether any of the D&D books were used to train it. (And even they probably don't know unless they check.)

Whether or not they did, it's definitely absorbed a lot of text about the D&D rules, so you'll get things that look like answers out of it. As you notice in the first one, there's a good chance they'll be sketchy, because the system doesn't understand things, or even know what rules are. You'll often get page references that look right, because there's enough text out there associating the disguise kit with the page number.

Even if the books were used, it can't look things up in them like we can. Their text was used in correlating the relationships between words along many axes, and then discarded. (I think they're starting to let the things go out to the web to find information, so that's less true than it used to be, but it's still not the same.)

Given their extremely indiscriminate approach for collecting text to feed the beast, I'd expect the SRD got absorbed, and the books are pretty likely. (Is that legal? That's a fascinating legal question which isn't going to be answered any time soon, and at great expense.)

Rollback Post to Revision RollBack
#4 Nov 8, 2023
MidnightPlat
MidnightPlat

View User Profile

View Posts

Send Message
- Cabalist
- Join Date: 4/7/2020
- Posts: 6,541
- Member Details
Before ChatGPT there was an anecdote used to articulate what machine learning was trying to accomplish, a researcher purportedly told his supervisor, "We're not teaching a computer to play chess, we're teaching it to be a Dungeon Master." Given that, I wouldn't be surprised at all if an SRD made it into early language models (and received updates, I mean computer science and ttrpgs have some significant overlaps on the ol Venn diagram). OpenAI and others have also admitted to "disrupting" copyright in their language model development claiming their conceit of what the AI does transcends the electronic storage boiler plate to most copyright terms. It's "move fast break things" approach where they hope to become instrumental enough to avoid any repercussions or oversight ... like pretty much every other tech innovation of the last 20 years.

Last edited by MidnightPlat: Nov 8, 2023

Rollback Post to Revision RollBack

Jander Sunstar is the thinking person's Drizzt, fight me.
#5 Nov 8, 2023
Agile_DM
Agile_DM

View User Profile

View Posts

Send Message
- Prestidigitator
- Join Date: 3/20/2017
- Posts: 683
- Member Details
I'm certain it read the OGL, SRD, and Basic Rules here as they are published in public. For the same reason, I don't think ChatGPT scanned the books here on D&D Beyond as that requires ownership to view. If it has the full books in the DB that's because someone scanned them up and pirated on a public URL.

Overall I think the folks behind ChatGPT are trying to shape it into something that respects copyright after the backlash (rightful backlash no less).

Rollback Post to Revision RollBack
#6 Nov 8, 2023
MidnightPlat
MidnightPlat

View User Profile

View Posts

Send Message
- Cabalist
- Join Date: 4/7/2020
- Posts: 6,541
- Member Details
ChatGPT did not simply develop its language model from open source and the public domain, there's enough litigation citing specific works to demonstrate that. A lot of ChatGPT "training" was done via copyrighted works. It had this pretense that it wasn't violating copyright anymore than a human who remembers something they read was violating copyright, which is a b.s. argument and also further muddies what AI actually does, pushing a fiction that AI creativity is on par with human thinking. I imagine D&D's present and past product lines are only the most often utilized of many game system texts "trained" into ChatGPT by developers.

OpenAI has been demonstrably two faced on regulation (which would include respect for copyright). It's part and parcel of all the fake math that Tech companies proclaim the harm they may do to workers in the present are insignificant compared to the "net good" tech companies will provide, when what we're witnessing isn't so much an actual altruistic drive but b.s. effective altruism derived virtue fig leaves in front of various tech upstarts and titans in a fight for dominance, which will change folks lives whether or not a regulatory framework is imposed.

Rollback Post to Revision RollBack

Jander Sunstar is the thinking person's Drizzt, fight me.
#7 Nov 9, 2023
Tarod_
Tarod_

View User Profile

View Posts

Send Message
- Adventurer
- Join Date: 9/19/2023
- Posts: 3,202
- Member Details
Quote from Agile_DM >>

... If it has the full books in the DB that's because someone scanned them up and pirated on a public URL ...

That's my guess, too.

Quote from jl8e >>

... As you notice in the first one, there's a good chance they'll be sketchy, because the system doesn't understand things, or even know what rules are...

Totally true.

Thanks for your answers, guys!

Last edited by Tarod_: Nov 9, 2023

Rollback Post to Revision RollBack
#8 Nov 9, 2023
jl8e
jl8e

View User Profile

View Posts

Send Message
- Adventurer
- Join Date: 7/23/2020
- Posts: 3,770
- Member Details
Quote from tarodnet >>

Quote from Agile_DM >>

... If it has the full books in the DB that's because someone scanned them up and pirated on a public URL ...

That's my guess, too.

It's known there's at least one large collection of pirated ebooks in the training data. IIRC, it's the basis of at least one of the lawsuits.

Rollback Post to Revision RollBack
#9 Nov 14, 2023
jdornanatwork1
jdornanatwork1

View User Profile

View Posts

Send Message
- Adventurer
- Join Date: 2/22/2021
- Posts: 8
- Member Details
It has access to the internet and 5th edition SRD is a website on it and it quotes pages in PHB in some of the online resource because they copy pasted parts of the PHB when uploading the rules for OGL use.

Rollback Post to Revision RollBack
To post a comment, please login or register a new account.

Previous Thread

Next Thread