As regulators seek ways to curb the company’s power, there is more focus on the vast index — hundreds of billions of web pages — behind its search engine.
Send any friend a story
As a subscriber, you have 10 gift articles to give each month. Anyone can read what you share.
OAKLAND, Calif. — In 2000, just two years after it was founded, Google reached a milestone that would lay the foundation for its dominance over the next 20 years: It became the world’s largest search engine, with an index of more than one billion web pages.
The rest of the internet never caught up, and Google’s index just kept on getting bigger. Today, it’s somewhere between 500 billion and 600 billion web pages, according to estimates.
Now, as regulators around the world examine ways to curb Google’s power, including a search monopoly case expected from state attorneys general as early as this week and the antitrust lawsuit the Justice Department filed in October, they are wrestling with a company whose sheer size has allowed it to squash competitors. And those competitors are pointing investigators toward that enormous index, the gravitational center of the company.
“If people are on a search engine with a smaller index, they’re not always going to get the results they want. And then they go to Google and stay at Google,” said Matt Wells, who started Gigablast, a search engine with an index of around five billion web pages, about 20 years ago. “A little guy like me can’t compete.”
Understanding how Google’s search works is a key to figuring out why so many companies find it nearly impossible to compete and, in fact, go out of their way to cater to its needs.
Every search request provides Google with more data to make its search algorithm smarter. Google has performed so many more searches than any other search engine that it has established a huge advantage over rivals in understanding what consumers are looking for. That lead only continues to widen, since Google has a market share of about 90 percent.
Google directs billions of users to locations across the internet, and websites, hungry for that traffic, create a different set of rules for the company. Websites often provide greater and more frequent access to Google’s so-called web crawlers — computers that automatically scour the internet and scan web pages — allowing the company to offer a more extensive and up-to-date index of what is available on the internet.
When he was working at the music site Bandcamp, Zack Maril, a software engineer, became concerned about how Google’s dominance had made it so essential to websites.
In 2018, when Google said its crawler, Googlebot, was having trouble with one of Bandcamp’s pages, Mr. Maril made fixing the problem a priority because Google was critical to the site’s traffic. When other crawlers encountered problems, Bandcamp would usually block them.
Mr. Maril continued to research the different ways that websites opened doors for Google and closed them for others. Last year, he sent a 20-page report, “Understanding Google,” to a House antitrust subcommittee and then met with investigators to explain why other companies could not recreate Google’s index.
“It’s largely an unchecked source of power for its monopoly,” said Mr. Maril, 29, who works at another technology company that does not compete directly with Google. He asked that The New York Times not identify his employer since he was not speaking for it.
A report this year by the House subcommittee cited Mr. Maril’s research on Google’s efforts to create a real-time map of the internet and how this had “locked in its dominance.” While the Justice Department is looking to unwind Google’s business deals that put its search engine front and center on billions of smartphones and computers, Mr. Maril is urging the government to intervene and regulate Google’s index. A Google spokeswoman declined to comment.
Websites and search engines are symbiotic. Websites rely on search engines for traffic, while search engines need access to crawl the sites to provide relevant results for users. But each crawler puts a strain on a website’s resources in server and bandwidth costs, and some aggressive crawlers resemble security risks that can take down a site.
Since having their pages crawled costs money, websites have an incentive to let it be done only by search engines that direct enough traffic to them. In the current world of search, that leaves Google and — in some cases — Microsoft’s Bing.
Google and Microsoft are the only search engines that spend hundreds of millions of dollars annually to maintain a real-time map of the English-language internet. That’s in addition to the billions they’ve spent over the years to build out their indexes, according to a report this summer from Britain’s Competition and Markets Authority.
Google holds a significant leg up on Microsoft in more than market share. British competition authorities said Google’s index included about 500 billion to 600 billion web pages, compared with 100 billion to 200 billion for Microsoft.
Other large tech companies deploy crawlers for other purposes. Facebook has a crawler for links that appear on its site or services. Amazon says its crawler helps improve its voice-based assistant, Alexa. Apple has its own crawler, Applebot, which has fueled speculation that it might be looking to build its own search engine.
But indexing has always been a challenge for companies without deep pockets.
The privacy-minded search engine DuckDuckGo decided to stop crawling the entire web more than a decade ago and now syndicates results from Microsoft. It still crawls sites like Wikipedia to provide results for answer boxes that appear in its results, but maintaining its own index does not usually make financial sense for the company.
“It costs more money than we can afford,” said Gabriel Weinberg, chief executive of DuckDuckGo. In a written statement for the House antitrust subcommittee last year, the company said that “an aspiring search engine start-up today (and in the foreseeable future) cannot avoid the need” to turn to Microsoft or Google for its search results.
When FindX started to develop an alternative to Google in 2015, the Danish company set out to create its own index and offered a build-your-own algorithm to provide individualized results.
FindX quickly ran into problems. Large website operators, such as Yelp and LinkedIn, did not allow the fledgling search engine to crawl their sites. Because of a bug in its code, FindX’s computers that crawled the internet were flagged as a security risk and blocked by a group of the internet’s largest infrastructure providers. What pages they did collect were frequently spam or malicious web pages.
“If you have to do the indexing, that’s the hardest thing to do,” said Brian Schildt Laursen, one of the founders of FindX, which shut down in 2018.
Mr. Schildt Laursen launched a new search engine last year, Givero, which offered users the option to donate a portion of the company’s revenue to charitable causes. When he started Givero, he syndicated search results from Microsoft.
Most large websites are judicious about who can crawl their pages. In general, Google and Microsoft get more access because they have more users, while smaller search engines have to ask for permission.
“You need the traffic to convince the websites to allow you to copy and crawl, but you also need the content to grow your index and pull up your traffic,” said Marc Al-Hames, a co-chief executive of Cliqz, a German search engine that closed this year after seven years of operation. “It’s a chicken-and-egg problem.”
In Europe, a group called the Open Search Foundation has proposed a plan to create a common internet index that can underpin many European search engines. It’s essential to have a diversity of options for search results, said Stefan Voigt, the group’s chairman and founder, because it is not good for only a handful of companies to determine what links people are shown and not shown.
“We just can’t leave this to one or two companies,” Mr. Voigt said.
When Mr. Maril started researching how sites treated Google’s crawler, he downloaded 17 million so-called robots.txt files — essentially rules of the road posted by nearly every website laying out where crawlers can go — and found many examples where Google had greater access than competitors.
ScienceDirect, a site for peer-reviewed papers, permits only Google’s crawler to have access to links containing PDF documents. Only Google’s computers get access to listings on PBS Kids. On Alibaba.com, the U.S. site of the Chinese e-commerce giant Alibaba, only Google’s crawler is given access to pages that list products.
This year, Mr. Maril started an organization, the Knuckleheads’ Club (“because only a knucklehead would take on Google”), and a website to raise awareness about Google’s web-crawling monopoly.
“Google has all this power in society,” Mr. Maril said. “But I think there should be democratic — small d — control of that power.”