The Daily XX
[0] Hi, I'm John Gertner.
[1] I'm a contributor to the New York Times magazine, and I write about science and technology.
[2] This week's Sunday read is a story I wrote for the magazine about Wikipedia.
[3] It's a story that explains how the 22 -year -old, wonky, online encyclopedia we've all consulted at one point, is so central to building artificial intelligence right now.
[4] So over the last few years, computer scientists have been creating what are known as large, language models, which are the AI brains, the power, the chat bots like chat GPT.
[5] And in order to build a large language model, they needed to gather vast knowledge banks of information.
[6] And I mean, it's sort of dizzying how much information we're talking about here.
[7] Some models ingest upwards of a trillion words.
[8] And it all comes from public sources like Wikipedia or Reddit or Google's patent database.
[9] What makes sweetly, Wikipedia special is not just that it's free and accessible, but also that it's very highly formatted.
[10] It contains just a tremendous amount of factual information that's maintained by a community of about 40 ,000 active editors in the English language version alone.
[11] The problem with these new AI chatbots is that their fundamental goal is to converse with a user with a kind of fluency of language, but they're not built to regurgitate data or to really be precise.
[12] So whether you're trying to understand historical topics or political upheavals or pandemics, these bots greatly simplify the world in a way that's maybe not conducive at all to our best interests as human beings.
[13] AI chatbots have even been known to hallucinate and conjure falsehoods from whole cloth.
[14] And another problem is that if they're fed only on their own synthetic data, these systems essentially break down.
[15] So if we were to go to AI instead of Wikipedia to find information, to solve our problems, to answer questions, what would happen in the future where our knowledge is factually unreliable?
[16] As I reported this story, I read a lot of what are called community notes, which are are the logs of Wikipedia editor meetings that they transcribe and make public.
[17] And in one recent meeting, editors shared their worries about AI.
[18] What's it going to do to Wikipedia?
[19] I remember reading the notes for this meeting, and one line from an editor popped out at me. We want a future where knowledge is created by humans.
[20] And I thought, well, that's really the essence of it, isn't it?
[21] Can we really choose at this point the future we want?
[22] So here's my article, Wikipedia's Moment of Truth, read by Brian Nishi.
[23] In early 2021, a Wikipedia editor peered into the future and saw what looked like a funnel cloud on the horizon.
[24] The Rise of GPT3, a precursor to the new chatbots from OpenAI.
[25] When this editor, a prolific wikipedian who goes by the handle Barkeep 49 on the site, gave the new technology a try, he could see that it was untrustworthy.
[26] The bot would readily mix fictional elements, a false name, a false academic citation, into otherwise factual and coherent answers.
[27] But he had no doubts about its potential.
[28] I think AI's day of writing a high -quality encyclopedia, is coming sooner rather than later, he wrote, in Death of Wikipedia, an essay that he posted under his handle on Wikipedia itself.
[29] He speculated that a computerized model could, in time, displace his beloved website and its human editors, just as Wikipedia had supplanted the Encyclopedia Britannica, which in 2012 announced it was discontinuing its print publication.
[30] Recently, when I asked this editor, he asked me to withhold his name because Wikipedia editors can be the targets of abuse, if he still worried about his encyclopedia's fate.
[31] He told me that the newer versions made him more convinced that ChatGPT was a threat.
[32] It wouldn't surprise me if things are fine for the next three years, he said, of Wikipedia.
[33] And then, all of a sudden, in year four or five, things drop off a cliff.
[34] Wikipedia marked its 22nd anniversary in January.
[35] It remains, in many ways, a throwback to the Internet's utopian early days when experiments with open collaboration anyone can write and edit for Wikipedia had yet to cede the digital terrain to multibillion -dollar corporations and data miners, advertising schemers and social media propagandists.
[36] The goal of Wikipedia, as its co -founder Jimmy Wales described it in 2004, was to create a world in which every single person on the planet is given free access to the sum of all human knowledge.
[37] The following year, Wales also stated, We help the Internet not suck.
[38] Wikipedia now has versions in 334 languages and a total of more than 61 million articles.
[39] It consistently ranks among the world's ten most visited websites, yet is alone among that select group whose usual leaders are Google, YouTube, and Facebook, in eschewing the profit motive.
[40] Wikipedia does not run ads, except when it seeks donations, and its contributors, who make about 345 edits per minute on the site, are not paid.
[41] In seeming to repudiate capitalism's imperatives, its success can seem surprising, even mystifying.
[42] Some Wikipedians remark that their endeavor works in practice, but not in theory.
[43] Wikipedia is no longer an encyclopedia, or at least not only an encyclopedia.
[44] Over the past decade, it has become a kind of factual netting that holds the whole digital world together.
[45] The answers we get from searches on Google and Bing, or from Siri and Alexa, how old is Joe Biden, or what is an ocean submersible, derive in part from Wikipedia's data having been ingested into their knowledge banks.
[46] YouTube has also drawn on Wikipedia to counter misinformation.
[47] The new AI chatbots have typically swallowed Wikipedia's corpus too, embedded deep within their responses to queries is Wikipedia data and Wikipedia text, knowledge that has been compiled over years of painstaking work by human contributors.
[48] While estimates of its influence can vary, Wikipedia is probably the most important single source in the training of AI models.
[49] Without Wikipedia, generative AI wouldn't exist, says Nicholas Vincent, who will be joining the faculty of Simon Fraser University in British Columbia this month, and who has studied how Wikipedia helps support Google Thurrits.
[50] searches and other information businesses.
[51] Yet, as bots like ChatGPT become increasingly popular and sophisticated, Vincent and some of his colleagues wonder, what will happen if Wikipedia, outflanked by AI that has cannibalized it, suffers from disuse and dereliction.
[52] In such a future, a death of Wikipedia outcome is perhaps not so far -fetched.
[53] A computer intelligence, it might not need to be as good as Wikipedia, merely good enough, is plugged into the web and seizes the opportunity to summarize source materials and news articles instantly, the way humans now do with argument and deliberation.
[54] On a conference call in March that focused on AI's threats to Wikipedia, as well as the potential benefits, the editor's hopes contended with anxiety.
[55] While some participants seemed confident that generative AI tools would soon help expand Wikipedia's articles and global reach, others worried about whether users would increasingly choose chat GPT, fast, fluent, seemingly oracular, over a wonky entry from Wikipedia.
[56] A main concern among the editors was how Wikipedians could defend themselves from such a threatening technological interloper, and some worried about whether the digital realm had reached a point where their own organization, especially in its striving for accuracy and truthfulness, was being threatened by a type of intelligence that was both factually unreliable and hard to contain.
[57] One conclusion from the conference call was clear enough.
[58] We want a world in which knowledge is created by humans.
[59] But is it already too late for that?
[60] Back in 2017, the Wikimedia Foundation and its community of volunteers began exploring how the encyclopedia and its sister sites, like Wikidata and Wikimedia Commons, with their offerings of free information and images, could evolve by the year 2030.
[61] The plan was to ensure that the foundation, the nonprofit that oversees Wikipedia, could protect and share the world's information in perpetuity.
[62] One outcome of that 2017 effort, which included a year's worth of meetings, was a prediction that Wikimedia would become the essential infrastructure of the ecosystem of free knowledge.
[63] Another conclusion was that trends like online misinformation would soon require far more vigilance.
[64] And a research paper commissioned by the Foundation found that artificial intelligence was improving at a rate that could change the way that knowledge is gathered, assembled, and synthesized.
[65] For that reason, the rollout of ChatGPT did not elicit surprise inside the Wikipedia community, though several editors told me they were shocked by the speed of its adoption, which needed just two months after its release in late 2022 to gain an estimated 100 million users.
[66] Despite its stodgy appearance, Wikipedia is more tech -savvy than casual users might assume.
[67] With a small group of volunteers to oversee millions of articles, it has long been necessary for highly experienced editors, often known as administrators, to use semi -automated software to identify misspellings and catch certain forms of intentional misinformation.
[68] And because of its open source ethos, the organization has at times incorporated technology made freely available by tech companies or academics, rather than go through a lengthy and expensive development process on its own.
[69] We've had artificial intelligence tools and bots since 2002, and we've had a team dedicated to machine learning since 2017.
[70] Selina Deckleman, Wikimedia's chief technology officer, told me, They're extremely valuable for a semi -automated content review, and especially for translations.
[71] How Wikipedia uses bots and how bots use Wikipedia are extremely different, however.
[72] For years, it has been clear that fledgling AI systems were being trained on the site's articles as part of the process whereby engineers scrape the web to create enormous data sets for that purpose.
[73] In the early days of these models about a decade ago, Wikipedia represented a large percentage of the scraped data used to train machines.
[74] The encyclopedia was crucial not only because it's free and accessible, but also because it contains a motherload of facts, and so much of its material is consistently formatted.
[75] In more recent years, as so -called large language models, or LLMs, increased increased in size and functionality, these are the models that power chatbots like ChatGPT and Google sparred, they began to take in far larger amounts of information.
[76] In some cases, their meals added up to well over a trillion words.
[77] The sources included not just Wikipedia, but also Google's patent database, government documents, Reddit's Q &A corpus, books from online library, and vast numbers of news articles on the web.
[78] But while Wikipedia's contribution in terms of overall volume is shrinking, and even as tech companies have stopped disclosing what data sets go into their AI models, it remains one of the largest single sources for LLMs.
[79] Jesse Dodge, a computer scientist at the Allen Institute for AI in Seattle, told me that Wikipedia might now make up between three and four.
[80] of the scrape data and LLM uses for its training.
[81] Wikipedia going forward will forever be super valuable, Dodge points out, because it's one of the largest well -c curated data sets out there.
[82] There is generally a link, he adds, between the quality of data a model trains on and the accuracy and coherence of its responses.
[83] In this light, Wikipedia might be seen as a sheep, caught in the jaws of a woefish technology marketplace.
[84] A free site created an achingly good faith, sharing knowledge is by nature an act of kindness, wikimedia noted in 2017, on a page devoted to its strategic direction, is being devoured by companies whose objectives, like charging for subscriptions as OpenAI recently began doing for its latest model, don't jive with its own.
[85] Yet, the relationships are more complicated than they appear.
[86] Wikipedia's fundamental goal is to spread knowledge as broadly and freely as possible, by whatever means.
[87] About 10 years ago, when site administrators focused on how Google was using Wikipedia, they were in a situation that presaged the advent of AI chatbots.
[88] Google's search engine was able, at the top of a .m. its query results, to present Wikipedians' work to users all over the world, giving the encyclopedia far greater reach than before, an apparent virtue.
[89] In 2017, three academic computer scientists, Connor McMahon, Isaac Johnson, and Brent Hecht conducted an experiment that tested how random users would react if just part of the contributions made to Google's search results by Wikipedia were removed.
[90] The academics perceived an extensive interdependence.
[91] Wikipedia makes Google a significantly better search engine for many queries, and Wikipedia, in turn, gets most of its traffic from Google.
[92] One upshot from the collision with Google and others who repurpose Wikipedia's content was the creation two years ago of Wikimedia Enterprise, a separate business unit that sells access to a series of application programming interfaces that provide accelerated updates to Wikipedia articles.
[93] Depending on whom you ask, the Enterprise Unit is either a more formalized way for tech companies to direct the equivalent of large charitable donations to Wikipedia.
[94] Google now subscribes, and altogether the unit took in $3 .1 million in 2022, or a way for Wikipedia to recoup some of the financial value it creates for the digital world and thus help fund its future operations.
[95] Practically speaking, Wikipedia's openness allows any tech company to access Wikipedia at any time, but the APIs make new Wikipedia entries almost instantly readable.
[96] This speeds up what was already a pretty fast connection.
[97] Andrew Lee, a consultant who works with museums to put data about their collections on Wikipedia, told me he conducted an experiment in 2019 to see how long it would take for a new Wikipedia article about a pioneering balloonist named Vera Simons to show up in Google search results.
[98] He found the elapsed time was about 15 minutes.
[99] Still, the close relationship between search engines and research.
[100] Wikipedia has raised some existential questions for the latter.
[101] Ask Google, what is the Russia -Ukrainian war, and Wikipedia is credited with some of its material briefly summarized.
[102] But what if that makes you less likely to visit Wikipedia's article, which runs to some 10 ,000 words and contains more than 400 footnotes?
[103] From the point of view of some of Wikipedia's editors, reduced traffic will oversimplify our understanding of the world and make it difficult to recruit a new generation of contributors.
[104] It may also translate into fewer donations.
[105] In the 2017 paper, the researchers noted that visits to Wikipedia had indeed begun to decline.
[106] And the phenomenon they identified became known as the paradox of reuse.
[107] The more Wikipedia's articles were disseminated through other outlets and media, the more imperiled was Wikipedia's own health.
[108] With AI, this reuse problem threatens to become far more pervasive.
[109] Aaron Hafecker, who led the machine learning research team at the Wikimedia Foundation for several years, and who now works for Microsoft, told me that search engine summaries at least offer users' links and citations and a way to click back to Wikipedia.
[110] The responses from large language models can resemble an information smoothie that goes down easy but contains mysterious ingredients.
[111] The ability to generate an answer has fundamentally shifted, he says, noting that in a chat GPT answer there is literally no citation and no grounding in the literature as to where that information came from.
[112] He contrasts it with the Google or Bing search engines.
[113] This is different.
[114] This is way more powerful than what we had before.
[115] Almost certainly, that makes AI both more difficult to contend with and potentially more harmful, at least from Wikipedia's perspective.
[116] A computer scientist who works in the AI industry, but is not permitted to speak publicly about his work, told me that these technologies are highly self -destructive, threatening to obliterate the very content which they depend, depend upon for training.
[117] It's just that many people, including some in the tech industry, haven't yet realized the implications.
[118] Wikipedia's most devoted supporters will readily acknowledge that it has plenty of flaws.
[119] The Wikimedia Foundation estimates that its English language site has about 40 ,000 active editors, meaning they make at least five edits a month to the encyclopedia.
[120] According to recent data from the Wikimedia Foundation, about 80 % of that cohort is male, and about 75 % of those from the United States are white, which has led to some gender and racial gaps in Wikipedia's coverage.
[121] And lingering doubts about reliability remain.
[122] For a popular article that might have thousands of contributors, Wikipedia is literally the most accurate form of information ever created by humans.
[123] Amy Bruckman, a professor at the Georgia Institute of Technology, told me. But Wikipedia's short articles can sometimes be hit or miss. They could be total garbage, says Bruckman, who is the author of the recent book, Should you believe Wikipedia?
[124] An erroneous fact on a rarely visited page may endure for months or years.
[125] And there continues to exist the ever -present threat of vandalism or tampering with an article.
[126] In 2017, for instance, a photo of the Speaker of the House, Paul Ryan, was added to the entry on invertebrates.
[127] As a Wikipedia editor, whose first name is Jade, put it to me, we have a number of, I would say, almost professional trolls who must dedicate just about as much time to creating spam, creating vandalism, harassing people, as I dedicate to improving Wikipedia.
[128] Several academics told me that whatever wikipedia, Wikipedia's shortcomings, they view the encyclopedia as a consensus truth, as one of them put it.
[129] It acts as a reality check in a society where facts are increasingly contested.
[130] The truth is less about data points.
[131] How old is Joe Biden than about complex events like the COVID -19 pandemic, in which facts are constantly evolving, frequently distorted, and furiously debated.
[132] The truthfulness quotient is raised by Wikipedia's transparency.
[133] Most Wikipedia entries include footnotes, links to source materials, and lists of previous edits and editors, and experienced editors are willing to intercede when an article appears incomplete or lacks what Wikipedians call verifiability.
[134] Moreover, Wikipedia's guidelines insist that its editors maintain an NPO neutral point of view, or risk being overruled, or in the argo of wiki culture, reverted.
[135] And the site has a bent toward self -examination.
[136] You can find long disquisitions on Wikipedia that explore Wikipedia's own reliability.
[137] An entry on how Wikipedia has fallen victim to hoaxes runs to more than 60 printed pages.
[138] As difficult as the pursuit of truth can be for Wikipedians, though, it seems significantly harder for AI chat bots.
[139] ChatGPT has become infamous for generating fictional data points or false citations known as hallucinations.
[140] Perhaps more insidious is the tendency of bots to oversimplify complex issues, like the origins of the Ukraine -Russia war, for example.
[141] One worry about generative AI at Wikipedia, whose articles on medical diagnoses and treatments are heavily visited, is related to health information.
[142] A summary of the March conference call captures the issue.
[143] We're putting people's lives in the hands of this technology.
[144] For example, people might ask this technology for medical advice.
[145] It may be wrong, and people will die.
[146] This apprehension extends not just to chatbots, but also to new search engines connected to AI technologies.
[147] In April, a team of Stanford University scientists evaluated four engines powered by AI, BingChat, Neva AI, Perplexity AI, and U -Chat, and found that only about half of the sentences generated by the search engines in response to a query could be fully supported by factual citations.
[148] We believe that these results are concerningly low for systems that may serve as a primary tool for information -seeking users, the researchers concluded, especially given their facade of trustworthiness.
[149] What makes the goal of accuracy so vexing for chatbots is that they operate probabilistically when choosing the next word in a sentence.
[150] They aren't trying to find the light of truth in a murky world.
[151] These models are built to generate text that sounds like what a person would say.
[152] That's the key thing, Jesse Dodge says.
[153] So they're definitely not built to be truthful.
[154] I asked Margaret Mitchell, a computer scientist who studied the ethics of AI at Google, whether factuality should have been a more fundamental priority for AI.
[155] Mitchell, who says she was fired from the company after criticizing the direction of its work, Google says she was fired for violating the company's security policies, said that most would find that logical.
[156] This common -sense thing, shouldn't we work on making it factual if we're putting it forward for fact -based applications?
[157] Well, I think for most people who are not in tech, it's like, why is this even a question?
[158] But, Mitchell said, the priorities at the big companies, now in frenzied competition with one another, are concerned with introducing AI products rather than reliability.
[159] The road ahead will almost certainly lead to improvements.
[160] Mitchell told me that she foresees AI companies making gains in accuracy and reducing biased answers by using better data.
[161] The state of the art until now has just been a laissez -faire data approach, she said.
[162] You just throw everything in and you're operating with a mindset where the more data you have, the more accurate your system will be, as opposed to the higher quality of data you have, the more accurate your system will be.
[163] Jesse Dodge, for his part, points to an idea known as retrieval, whereby a chatbot will essentially consult a high -quality source on the web to fact -check and answer in real time.
[164] It would even cite precise links, as some AI -powered search engines now do.
[165] Without that retrieval element, Dodge says, I don't think there's a way to solve the hallucination problem.
[166] Otherwise, he says, he doubts that a chatbot answer can gain factual parity with Wikipedia or the Encyclopedia Britannica.
[167] Market competition might help prompt improvement, too.
[168] Owen Evans, a researcher at a nonprofit in Berkeley, California, who studies truthfulness in AI systems, pointed out to me that OpenAI now has several partnerships with businesses, and those firms will care greatly about responses achieving a high level of accuracy.
[169] Google, meanwhile, is developing AI systems to work closely with medical professionals on disease detection and diagnostics.
[170] There's just going to be a very high bar there, he adds, so I think there are incentives for the companies to really improve this.
[171] At least for now, AI companies are focusing on what they call fine -tuning, when it comes to factuality.
[172] Sandini -Argawal and Gyrriss Sustri, researchers at OpenA