Google Plunders the Web

2024-04-02: Google has changed its operations since the original publication of this post. Now it does not show (and, possibly, does not steal) pages excluded in robots.txt. It continues stealing the same content when re-published on other sites. In any case, robots.txt does not grant any copyright permissions; it is merely a technical file, as explained below.

Google and other Big Tech companies were mostly honest, value-creating enterprises until around 2008. The main factor behind Big Tech’s wealth, and the collapse of honest journalism and civil society, was Google and Microsoft’s plundering of content from millions of websites with impunity. Here, I am focused on text-based content, like news, commentary, scholarly and scientific works—in other words, the works that contain or create human knowledge.

In accordance with Article I, Section 8 of the US Constitution and 17 U.S. Code § 106, authors, editors, and publishers have exclusive rights to their works, including (1) making copies; (2) making derivatives; and (3) distributing the copies. This copyright protection applies fully to web content. Web content is intended for viewing by other individuals; this is an implied license. Google is not an individual, but rather a huge corporation. When Google bots crawl and index websites, they make copies and derivatives of every post, page, and article they find. This copyright violation happens even when Google shows none of this content or links it to any Google search user.

Copyright on the internet invokes bad connotations, because of the overreach by the movie and audio recording industries, but reasonable copyright laws are not less necessary for the purposes of free speech than the First Amendment. I am not a copyright maximalist. The Internet allows many ways to reuse authored material, including by use of crawlers and other automatic tools, and the copyright laws include permission for transformative and other fair use. But the wholesale theft by $Trillion monopolies, followed by suppression of the same authors and publications from whom they have stolen, must be stopped. Many liberal authors and journalists, who do not pick crumbs after Google, would agree with that.

Plundering Techniques

For every page of every website that a Google bot can reach, it makes a copy and stores it on a Google server, without license. Then, Google’s software makes a few thousand copies and distributes them around the world. Then it creates multiple derivatives of each page’s contents (including short phrases), inserts them into Google’s index, and distributes them over thousands of servers. Next, links are extracted and added them to Google’ matrix of links. Google maintains many billions of such matrices, which are also distributed over many computers around the world. Then Google uses the page to train the search engine’s artificial intelligence. Google uses these derivatives in its other products, such as YouTube, Google Maps, and others. Doing that to all websites without regards to the authors rights, allow Google to claim that it has information from all over the web and to hint that the websites which are downranked or unavailable in Google search are low quality, deceitful, or “misinformation”.

Add to this the Google Books project, in which Google has, to date, scanned and inserted into its database tens of millions printed books with the assistance of state universities. Google does not passively hold their copies, but rather indexes them and extracts knowledge. This is what makes Google so powerful: it plunders the world’s knowledge and then  controls access to it (to some extent).

By plundering content and providing selected access to websites, Google acts as a distributor, taking most of the value from the content it holds. Using this advantage, Google competes against real-world publishers, editors, authors, critics, librarians, educators, and most other knowledge workers, including scientists and medical doctors. Google’s predation has led to the collapse of many knowledge industries, starting with the news, and reaching scientific publishing and the medical profession by the beginning of the pandemic. Only servile outlets, funded by governments or picking up crumbs from Google and Facebook, are allowed to thrive.

Field v. Google, 2006

Unfortunately, on many occasions, courts sided with Google against copyright owners. Most of them cite the 2006 case Field v. Google, which was heard in Nevada District Court. This decision was made based on a grotesquely incorrect determination of technical facts, taken from the testimony of Google’s expert. The plaintiff was clueless. For example, the robots.txt file is not a license and cannot be interpreted as one. Now Google ignores even the robots.txt file.

Nevertheless, the idea that Google’s search operations were within the bounds of the fair use affirmative defense was reasonable in 2006. In the prior years, most of the web’s content was either academic or promotional, and no one seemed to lose anything from Google’s actions. Most people maintained their subscriptions to printed newspapers and magazines, despite the availability of some online content. 2003 was a record-setting year for the newspaper industry. That changed within a few years, when most knowledge work moved online. Google accumulated more content and began to compete with publishers by providing knowledge panels and promoting Wikipedia in Google search results. Soon, it became next to impossible to monetize even high-quality online content. In 2009, when the news industry was dying, it was able to point the finger at Google, but failed to use the copyright to defend itself. MSM elected to join the Marxist-led Obamanet coalition, putting the power of the state behind Big Tech. Eventually, the New York Times became a fake news outlet. The production of fake news is cheap. Google throws crumbs (in cash and clicks) to fake news outlets, which keeps them happy.

One of few people who attempted to resist Google power was Rupert Murdoch, the owner of the Fox News and WSJ.  In 2009, he attempted to negotiate with Microsoft giving its Bing exclusive search rights for the Fox News and WSJ content in 2009. I can only guess what or who interfered with that sound deal.

Downranking search results and denying ad revenues to some of the plundered websites needs no special mention.

Here, plunder is more accurate than copyright infringement because there is a physical element to it. Obamanet prevented publishers from physically protecting their work by selling access to it directly to users through ISPs. Google was the main power behind Obamanet (2015).

Conclusion

Microsoft takes part in plundering knowledge workers via its search engine Bing, but Google has about 80% of the search market. Yahoo combines results from Google and Bing. Google pays Apple ~$9B annually for installing its search on iPhones and iPads.

Facebook and Twitter have benefitted from the collapse of the traditional authors–editors–publishers–distributors business model on the Internet. Fleeing Google’s plunder, many authors have joined Facebook and Twitter, which make money from their unpaid labor, too.

Researchers, authors, and editors create valuable information. Publishers, taken together, organize it. Google takes their work without permission or payment, and collects all the money, while publishers and authors get nothing. I am surprised that publishers who are not living off Google’s crumbs, do not question Google’s corrupt and illegal behavior in the courts.

Technical Details

Plundering Books

The University of California invited Google to digitize, index, and to insert into its database  millions of books, in violation of the rights of publishers and authors.  More universities  joined in. This privilege was granted exclusively to Google. After protracted litigation, the courts allowed this corrupt deal to stand. The University of California is a state body. It is governed by regents, the supermajority of whom are appointed by the state governor. The California governor and the lieutenant governor are UC regents ex officio. It seems that the involvement of the state government in the Google Books Project makes Google a state actor in regards to all the books content and its derivatives, including Google web search and YouTube.

From Some Fear Google’s Power in Digital Books (NYT, 2009):

““Google will enjoy what can only be called a monopoly, a monopoly of a new kind, not of railroads or steel but of access to information,” Mr. Darnton writes.”

Robert Darnton was the head of the Harvard library system. He was right on the money.

Plundering Science

Most peer reviewed scientific journals require a subscription. A common courtesy is to allow the authors to upload their articles to personal or university websites, in order to share their work with colleagues (not with Google). The scientific journals retain the copyrights. Despite this, Google copies the copyrighted content, makes multiple derivatives, and uses references to build its internal relationship graphs and for AI training. Google even advises authors on how to help it  steal the content:

“If you’re an individual author, it works best to simply upload your paper to your website, … and add a link to it on your publications page …”

Then Google proceeds to arbitrarily de-platform scientists from YouTube and to downrank them in its search results.

Robots.txt

There is a false belief that a website’s robots.txt file gives search engines a license to use that website’s sections which are Allowed or not Disallowed. This is not true. Each website’s robots.txt file contains only advisory instructions for crawlers. It contains no information about the rights of the content owners (which are frequently different from the rights of the website owners). It also contains no information about the companies that use the crawlers, nor anything that can be interpreted as a license. The robots.txt format and protocol was first created in 1994, with the purpose of saving resources of both the crawler and the web server it crawls.

“In 1993 and 1994 there have been occasions where robots have visited WWW servers where they weren’t welcome for various reasons. Sometimes these reasons were robot specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations robots traversed parts of WWW servers that weren’t suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting).”

The protocol used with robots.txt is sometimes called Robots Exclusion Protocol. Google admits its bots ignore this protocol and robots.txt when come from external links. If the excluded page is returned in a search result, Google displays the message “No information is available for this page. Learn why.Google tells webmasters to insert a ‘noindex’ attribute in the files that the website owner wants Google not to display, but that ‘noindex’ attribute applies to all crawlers, not just Google’s. More from the horse’s mouth – 1, 2. Even if Google respected robots.txt, an author publishing on websites he does not control, cannot opt out of Google’s plunder. Further, a website and its authors are subject to “the wrath of Google” – to suffer lower visibility across the web and higher rate of negative information in the Google search.

The presence of a copyright symbol, © or (C), anywhere on a page expressly tells Google that someone other than Google holds the copyright to its content. If anyone wanted to give Google a copyright license over any content, there are software-readable license formats, like those from Creative Commons, which specify such license terms.

Twitter changed its robots.txt file in mid-October, purporting to disallow all bots. This is the earlier version: TWTR-robots-2021-10-09.txt

The following is the YouTube robots.txt (www.YouTube.com/robots.txt), as of October 21, 2021. You can see that it is a joke.

# robots.txt file for YouTube
# Created in the distant future (the year 2000) after
# the robotic uprising of the mid 90's which wiped out all humans.

User-agent: Mediapartners-Google*
Disallow:

User-agent: *
Disallow: /channel/*/community
Disallow: /comment
Disallow: /get_video
Disallow: /get_video_info
Disallow: /get_midroll_info
Disallow: /live_chat
Disallow: /login
Disallow: /results
Disallow: /signup
Disallow: /t/terms
Disallow: /timedtext_video
Disallow: /user/*/community
Disallow: /verify_age
Disallow: /watch_ajax
Disallow: /watch_fragments_ajax
Disallow: /watch_popup
Disallow: /watch_queue_ajax

Sitemap: https://www.youtube.com/sitemaps/sitemap.xml
Sitemap: https://www.youtube.com/product/sitemap.xml

Originally published on November 1, 2021