Categories
Uncategorized

How Big Is Google Scholar? 500 million Documents?

nebula.jpg For years, the mother ship Google has outperformed other search engines due to its speed, size – 8 billion documents, at last count – and comprehensive coverage. Searchers vote with their keywords and keyboards – Google is still #1.

On the eve of its seventh anniversary last fall, Google curiously began to drop its homepage boast of being the largest index of/ to the Web. The idea seems to be that as the Web scales up in size, crawling unique documents from the dark, deep edges of the Web is what counts (above the boast of size & numbers). The size of the web indexed by Google is irrelevant, right?

Well, yes and no. From an information specialist’s perspective, the size of a general database conveys an important metric to searchers. Specialty medical databases like PubMed index the best content, and is a tightly controlled 16 million citations; Scirus, due to its broader scientific coverage of peer-reviewed journal literature and web content, comes in at 250 million documents.

Windows Live Academic Search beta was announced less than a month ago (with only a sliver of content in computer science, physics and electrical engineering):

“.. as of launch date, [Academic Search] has deep content … – with more than 6 million records from approximately 4300 journals and 2000 conferences.”

Take note, Google scholar. Size is important – and information retrieval experts like to have these facts made known to us. We estimate Google scholar weighs in at .5 billion documents and webpages, making it the largest scholarly index ever built.

500 million (1/2 billion) documents in Google scholar, searchable in seconds. Isn’t that worth boasting about?

Dean Giustini
UBC Search-Scholar blogger

4 replies on “How Big Is Google Scholar? 500 million Documents?”

Dean,

I don’t like to disagree but I think size is a poor metric. PubMed with 16 million documents – how does that help a searcher? Would it make much difference if it had 1 million or 1 billion? When you’re looking for a limited number of results (say less than 25) all you know is that 99.99999999999% is pointless.

I imagine Google dropped their count as they realised it conveyed little useful information.

Hi Jon,
You are right, of course – it matters little what size any database is, except for perhaps knowing that size conveys quantitative information to the retriever.

I also link this issue to Microsoft’s and Scirus’ [Elsevier’s]
transparency around coverage, because Google is so coy. Notice Windows Live Academic Search states its size, coverage, scope upfront even though in beta this is somewhat negative for them due to the small size of its
product at this point.

Thank you for your opinion. Will TRIP be more transparent in the future? Hope so

cheers!

Dean Giustini
UBC Biomedical Branch Librarian
Vancouver General Hospital (700 W. 10th)
blogfolio
move website: http://www.library.ubc.ca/bmb/new_bmb/
<604-875-4505> MSN chat:

Off course size matters for usability!
A database covering all academic topics on the Internet with let’s say only 10.000 records simply is less valuable than a database with 1 billion records. Even more when you don’t know anything about the selection criteria.
You may have a very restricted topic, where 500 records will make a complete coverage for a given period. But for a database covering all topics for all (?) times world wide SIZE MATTERS.
Yours sincerely
Dorte Nielsen MLIS
Royal School of Library and Information Science, Denmark

Cool article. Out of interest, how did you estimate that Google Scholar’s index was 500 million documents? According to this report by the STM Association, there are only 1.5 million articles published per year
http://www.stm-assoc.org/2009_10_13_MWC_STM_Report.pdf

It would take 357 years of publishing at that rate to get to an index of 500 million papers. Presumably there are some non-papers in the Google Scholar index, but presumably not that many.

Leave a Reply

Your email address will not be published. Required fields are marked *

Spam prevention powered by Akismet