The Daily XX
[0] From New York Times, I'm Michael Bobarrow.
[1] This is the Daily.
[2] Today.
[3] A Times investigation shows how, as the country's biggest technology companies raced to build powerful new artificial intelligence systems, they bent and broke the rules from the start.
[4] My colleague, Cade Metz, on what he uncovered.
[5] It's Tuesday.
[6] April 16th.
[7] Cade, when we think about all the artificial intelligence products released over the past couple of years, including, of course, these chatbots we've talked a lot about on the show, we so frequently talk about their future, their future capabilities, their influence on society, jobs, our lives.
[8] But you recently decided to go back in time to AI's past to its origins, to understand the decisions that were made basically at the birth of this technology.
[9] So why did you decide to do that?
[10] Because if you're thinking about the future of these chatbots, that is defined by their past.
[11] The thing you have to realize is that these chat bots learn their skills by analyzing enormous amounts of digital data.
[12] So what my colleagues and I wanted to do with our investigation was really focused on that effort to gather more data.
[13] We wanted to look at the type of data these companies were collecting, how they were gathering it, and how they were feeding it into their systems.
[14] And when you all undertake this line of reporting, what do you end up finding?
[15] We found that three major players in this race, OpenAI, Google, and meta, as they were locked into this, competition to develop better and better artificial intelligence, they were willing to do almost anything to get their hands on this data, including ignoring and in some cases violating corporate rules and wading into a legal gray area as they gathered this data.
[16] Basically, cutting corners.
[17] Cutting corners left and right.
[18] Okay, let's start with open AI, the flashiest place.
[19] of all.
[20] The most interesting thing we found is that in late 2021, as OpenAI, the startup in San Francisco that built ChatGPT, as they were pulling together the fundamental technology that would power that chatbot, they ran out of data, essentially.
[21] They had used just about all the respectable English language text on the internet to build this system.
[22] And just let that sink in for a bit.
[23] I mean, I'm trying to let that sink in.
[24] They basically, like a Pac -Man on a old game, just consumed almost all the English words on the internet, which is kind of unfathomable.
[25] Wikipedia articles, By the Thousands, News articles, Reddit threads, digital books, by the millions, we're talking about hundreds of billions, even trillions of words.
[26] Wow.
[27] So, by the end of 2021, OpenAI had no more English language text that they could feed into these systems.
[28] But their ambitions are such that they wanted even more.
[29] So here, we should remember that if you're gathering up all the English language text on the Internet, A large portion of that is going to be copyrighted.
[30] Right.
[31] So if you're one of these companies gathering data at that scale, you are absolutely gathering copyrighted data as well.
[32] Which suggests that from the very beginning, these companies, a company like OpenAI with ChatGPT is starting to break, bend the rules.
[33] Yes.
[34] They are determined to build this technology.
[35] Thus, they are willing to venture into what is a legal gray area.
[36] So given that, what does OpenAI do once it, as you had said, runs out of English language words to mop up and feed into this system?
[37] So they get together and they say, all right, so what are other options here?
[38] And they say, well, what about all the audio and video on the internet?
[39] we could transcribe all the audio and video, turn it into text, and feed that into their systems.
[40] Interesting.
[41] So a small team at OpenAI, which included its president and co -founder Greg Brockman, built a speech recognition technology called Whisper, which could transcribe audio files into text with high accuracy.
[42] And then they gathered up all sorts of audio files from across the Internet, including audiobooks, podcasts, and most importantly, YouTube videos.
[43] Of which there's a seemingly endless supply, right?
[44] Fair to say maybe tens of millions of videos.
[45] According to my reporting, we're talking about at least a million hours of YouTube videos were scraped off of that video sharing site, fed into this speech recognition system in order to produce new text for training OpenAI's chatbot.
[46] And YouTube's terms of service do not allow a company like OpenAI to do this.
[47] YouTube, which is owned by Google, explicitly says you are not allowed to, in Internet Parlant's scrape videos on mass from across YouTube and use those videos to build a new application.
[48] That is exactly what OpenAI did.
[49] According to my reporting, employees at the company knew that it broke YouTube terms of service, but they resolved to do it anyway.
[50] So, Kate, this makes me want to understand what's going on over at Google, which as we have talked about in the past on the show, is itself thinking about and developing its own artificial intelligence model and product.
[51] Well, as OpenAI scrapes up all these YouTube videos and starts to use them to build their chatbot, according to my reporting, some employees at Google, at the very least, are aware that this is happening.
[52] They are?
[53] Yes.
[54] Now, when we went to the company about this, a Google spoke.
[55] said it did not know that Open AI was scraping YouTube content and said the company takes legal action over this kind of thing when there's a clear reason to do so.
[56] But according to my reporting, at least some Google employees turned a blind eye to Open AI's activities because Google was also using YouTube content to train its AI.
[57] Wow.
[58] So if they raise a stink about what Open AI is doing, they end up shining a spotlight on themselves, and they don't want to do that.
[59] I guess I want to understand what Google's relationship is to YouTube, because, of course, Google owns YouTube.
[60] So what is it allowed or not allowed to do when it comes to feeding YouTube data into Google's AI models?
[61] It's an important distinction.
[62] Because Google owns YouTube, it defines what can be done with that data.
[63] And Google argues that it has a right to that data, that its terms of service allow it to use that data.
[64] However, because of that copyright issue, because the copyright to those videos belong to you and I, lawyers who I've spoken to say people could take Google to court and try to determine whether or not.
[65] those terms of service really allow Google to do this?
[66] There's another legal gray area here where although Google argues that it's okay, others may argue it's not.
[67] Of course, what makes this also interesting is you essentially have one tech company, Google, keeping another tech company, Open AIs, dirty little secret about basically stealing from YouTube because it doesn't want people to know that it too is taking from YouTube.
[68] And so these companies are essentially enabling each other as they simultaneously seem to be bending or breaking the rules.
[69] What this shows is that there is this belief and it has been there for years within these companies among their researchers that they have a right to this data because they're on a larger mission to build a technology that they believe will transform the world.
[70] And if you really want to understand this attitude, you can look at our reporting from inside meta.
[71] And so what does meta end up doing, according to your reporting?
[72] Well, like Google and other companies, meta had to scramble to build artificial intelligence that could compete with open AI.
[73] Mark Zuckerberg is calling engineers and executives at all hours, pushing them to acquire this data that is needed to improve the chatbot.
[74] And at one point, my colleagues and I got hold of recordings of these meta executives and engineers discussing this problem, how they could get their hands on more data, where they should try to find it.
[75] and they explored all sorts of options.
[76] They talked about licensing books one by one at $10 a pop and feeding those into the model.
[77] They even discussed acquiring the book publisher Simon and Schuster and feeding its entire library into their AI model.
[78] But ultimately, they decided all that was just too cumbersome, too time -consuming.
[79] And on the recordings of these meetings, you can hear.
[80] Here executives talk about how they were willing to run roughshod over copyright law and ignore the legal concerns and go ahead and scrape the Internet and feed this stuff into their models.
[81] They acknowledged that they might be sued over this, but they talked about how OpenAI had done this before them, that they meta were just following what they saw as a market precedent.
[82] Interesting.
[83] So they go from having conversations like, should we buy a publisher that has tons of copyrighted material suggesting that they're very conscious of the kind of legal terrain and what's right and what's wrong and instead say, nah, let's just follow the open AI model, that blueprint, and just do what we want to do, do what we think we have a right to do, which is to kind of just gobble up all this material across the internet.
[84] It's a snapshot of that silo.
[85] Valley attitude that we talked about.
[86] Because they believe they are building this transformative technology, because they are in this intensely competitive situation where money and power is at stake, they are willing to go there.
[87] But what that means is that there is, at the birth of this technology, a kind of original sin that can't really be erased?
[88] It can't be erased, and people are beginning to notice, and they are beginning to sue these companies over it.
[89] These companies have to have this copyrighted data to build their systems.
[90] It is fundamental to their creation.
[91] If a lawsuit bars them from using that copyrighted, copyrighted data, that could bring down this technology.
[92] We'll be right back.
[93] So, Cade, walk us through these lawsuits that are being filed against these AI companies based on the decisions they made early on to use technology as they did, and the chances that it could result in these companies not being able to get the data they so desperately say they need.
[94] These suits are coming from a wide range of places.
[95] They're coming from computer programmers who are concerned that their computer programs have been fed into these systems.
[96] They're coming from book authors who have seen their books being used.
[97] They're coming from publishing companies.
[98] They're coming from news corporations, like the New York Times, incidentally, which has filed a lawsuit against Open AI and Microsoft, news organizations that are concerned over their news articles being used to build these systems.
[99] And here, I think it's important to say as a matter of transparency, Cade, that your reporting is separate from that lawsuit.
[100] That lawsuit was filed by the business side of the New York Times by people who are not involved in your reporting or in this daily episode, just to get that out of the way.
[101] Exactly.
[102] I'm assuming that you have spoken too many lawyers about this, and I wonder if there's some insight that you can shed on the basic legal terrain.
[103] I mean, do the companies seem to have a strong case that they have a right to this information, or do companies like the Times who are suing them seem to have a pretty strong case that know that decision violates their copyrighted materials?
[104] Like so many legal questions, this is incredibly competent.
[105] It comes down to what's called fair use, which is a part of copyright law that determines whether companies can use copyrighted data to build new things.
[106] And there are many factors that go into this.
[107] There are good arguments on the Open AI side.
[108] There are good arguments on the New York Times side.
[109] Copyright law says that you can't take my work and reprimand.
[110] produce it and sell it to someone.
[111] That's not allowed.
[112] But what's called fair use does allow companies and individuals to use copyrighted works in part.
[113] They can take snippets of it.
[114] They can take the copyrighted works and transform it into something new.
[115] That is what OpenAI and others are arguing they're doing.
[116] But there are other things to consider.
[117] Does that transformative work compete with the individuals and companies that supplied the data that own the copyrights?
[118] Interesting.
[119] And here, the suit between the New York Times company and OpenAI is illustrative.
[120] If the New York Times creates articles that are then used to build a chatbot, does that chatbot end up complete?
[121] repeating with the New York Times.
[122] Do people end up going to that chatbot for their information rather than going to the Times website and actually reading the article?
[123] That is one of the questions that will end up deciding this case and cases like it.
[124] So what would it mean for these AI companies for some or even all of these lawsuits to succeed?
[125] Well, if, if, If these tech companies are required to license the copyrighted data that goes into their systems, if they're required to pay for it, that becomes a problem for these companies.
[126] We're talking about digital data, the size of the entire Internet.
[127] Licensing all that copyrighted data is not necessarily feasible.
[128] We quote the venture capital firm Andresen Horowitz in our story where one of their lawyers says that it does not work for these companies to license that data.
[129] It's too expensive.
[130] It's on too large a scale.
[131] It would essentially make this technology economically impractical.
[132] Exactly.
[133] So a jury or a judge or a law ruling against open AI could fundamentally change.
[134] change the way this technology is built.
[135] The extreme case is these companies are no longer allowed to use copyrighted material in building these chatbots.
[136] And that means they have to start from scratch.
[137] They have to rebuild everything they've built.
[138] So this is something that not only imperils what they have today.
[139] It imperils what they want to build in the future.
[140] And conversely, what happens if the court's rule in favor of these companies and say, you know what, this is fair use, you were fine to have scraped this material and to keep borrowing this material into the future, free of charge.
[141] Well, one significant roadblock drops for these companies, and they can continue to gather up all that extra data, including images and sounds and videos, and build increasingly powerful systems.
[142] But the thing is, even if they can access as much copyrighted material as they want, these companies may still run into a problem.
[143] Pretty soon, they're going to run out of digital data on the internet.
[144] That human -created data they rely on is going to dry up.
[145] They're using up this data faster than humans create it.
[146] One research organization, estimates that by 2026, these companies will run out of viable data on the Internet.
[147] Wow.
[148] Well, in that case, what would these tech companies do?
[149] I mean, where are they going to go if they've already scraped YouTube, if they've already scraped podcasts, if they've already gobbled up the Internet, and that altogether is not sufficient?
[150] What many people inside these companies will tell you, including Sam Altman, the chief executive of OpenAI, they'll tell you that what they will turn to is what's called synthetic data.
[151] And what is that?
[152] That is data generated by an AI model that is then used to build a better AI model.
[153] It's AI helping to build better AI.
[154] That is the vision ultimately they have for the future that they won't need all this human -generated text.
[155] They'll just have the AI build a text that will feed future versions of AI.
[156] So they will feed the AI systems, the material that the AI systems themselves create.
[157] But is that really a workable, solid plan?
[158] And is that considered high -quality data?
[159] Is that good enough?
[160] If you do this on a large scale, you quickly run into problems.
[161] As we all know, as we've discussed on this podcast, these systems make mistakes.
[162] They hallucinate.
[163] They make stuff up.
[164] They show biases that they've learned from Internet data.
[165] And if you start using the data generated by the AI to build new AI, those mistakes start to reinforce themselves.
[166] Right.
[167] The systems start to get trapped in these cul -de -sacs where they end up not getting better, but getting worse.
[168] What you're really saying is these AI machines need the unique perfection of the human creative mind.
[169] Well, as it stands today, that is absolutely the case.
[170] But these companies have grand visions for where this will go.
[171] And they feel, and they're already starting to experiment with this, that if you have an AI system that is sufficiently powerful, if you make a copy of it, if you have two of these AI models, one can produce new data and the other one can judge that data.
[172] It can curate that data as a human would.
[173] It can provide the human judgment, so to speak.
[174] So as one model produces the data, the other one can judge it, discard the bad data and keep the good data.
[175] And that's how they ultimately see these systems creating viable synthetic data.
[176] But that has not happened yet.
[177] And it's unclear whether it will work.
[178] It feels like the real lesson of your investigation is that if you have to allegedly steal data to feed your AI model and make it economically feasible, then maybe you have a pretty broken model.
[179] And that if you need to create fake data as a result, which, as you just said, kind of undermines AI's goal of mimicking human thinking and language, then maybe you really have a broken model.
[180] And so that makes me wonder if the folks you talk to, the companies that were focused on here, ever asked themselves the question, could we do this differently?
[181] Could we create an AI model that just needs a lot less data.
[182] They have thought about other models for decades.
[183] The thing to realize here is that is much easier said than done.
[184] We're talking about creating systems that can mimic the human brain.
[185] That is an incredibly ambitious task.
[186] And after struggling with that for decades, these companies have finally stumbled on something that they feel works.
[187] that is a path to that incredibly ambitious goal, and they're going to continue to push in that direction.
[188] Yes, they're exploring other options, but those other options aren't working.
[189] What works is more data and more data and more data, and because they see a path there, they're gonna continue down that path.
[190] And if there are roadblocks there and they think they can knock them down, they're going to knock them down.
[191] But what if the tech companies never get enough or make enough data to get where they think they want to go, even as they're knocking down walls along the way?
[192] That does seem like a real possibility.
[193] If these companies can't get their hands on more data, then these technologies, as they're built today, stop improving.
[194] We will see their limitations.
[195] We will see how difficult it really is to build a system that can, can match, let alone surpass the human brain.
[196] These companies will be forced to look for other options technically, and we will see the limitations of these grandiose visions that they have for the future of artificial intelligence.
[197] Okay.
[198] Thank you very much.
[199] We appreciate that.
[200] Glad to be here.
[201] We'll be right back.
[202] Here's what else you need to another day.
[203] Israeli leaders spent Monday debating whether and how to retaliate against Iran's missile and drone attack over the weekend.
[204] Herzegi, Hellevi, Israel's military chief of staff, declared that the attack will be responded to.
[205] In Washington, a spokesman for the U .S. State Department, Matthew Miller, reiterated American calls for restraint.
[206] Of course, we continue to make clear to everyone that we talk to, that we want to see.
[207] de -escalation, that we don't want to see a wider regional war.
[208] That's something that's been.
[209] But emphasized that a final call about retaliation was up to Israel.
[210] Israel is a sovereign country.
[211] They have to make their own decisions about how best to defend themselves.
[212] And the first criminal trial of a former U .S. president officially got underway on Monday in a Manhattan courtroom.
[213] Donald Trump, on trial for allegedly falsifying documents to cover up.
[214] a sex scandal involving a porn star, watched as jury selection began.
[215] The initial pool of 96 jurors quickly dwindled.
[216] More than half of them were dismissed after indicating that they did not believe that they could be impartial.
[217] The day ended without a single juror being chosen.
[218] Today's episode was produced by Stella Tan, Michael Simon Johnson, Muj -Zady and Ricky Nevetsky.
[219] It was edited by Mark George and Liz O 'Balen.
[220] Contains original music by Diane Wong, Dan Powell, and Pat McCusker, and was engineered by Chris Wood.
[221] Our theme music is by Jim Brunberg and Ben Landsberg of Wonderland.
[222] That's it for the Daily.
[223] I'm Michael Babarro.
[224] See you tomorrow.