Note: This content is accessible to all versions of every browser. However, this browser may not support basic Web standards, preventing the display of our site's design details. We support the mission of the Web Standards Project in the campaign encouraging users to upgrade their browsers.

Tobi Waves


INDEX | NOW | 2003|2004|2005 / 02|03|09|10 / 02|04

Finding Needles in a TB Haystack.

Monday, February 02, 2004 10:20 // Audi Max, ETH Zurich, Switzerland // href

A Talk by Urs Hölzel, Vice President for Technology,

About Google

Mission: TO organize the worlds information and make it universally accessible and useful,

An international company: 250% traffic from outside US

Engine has 4 Billion pages in index

Profitable since q1/2001

23 Office Location Worldwide.

15k boxes, several TB disk storage

There are over 1000 queries a seconds on dec 25th, 2am.

Engineering Offices in the US, Zurich and Bangalore

About the Web

Static web 167 TB in 11 Giga Pages, but dynamic websize 92 PB. (Estimates)

1 in 4 hosts on the net run a webserver.

Problem: All data, users, hosts grow exponentially. This means the problem of finding useful information grows exponentially too which makes for interesting problems.

Google Infrastructure

A high reliable system based on low cost comodity hardware. Redundancy has to be built into the software and hardware. Monitoring, repair and maintain these boxes is a prime problem.

The Google Filesystem GFS

Stripe files across many boxes and replicate them on multiple servers.

Components: Master - keeps directory and plans file layout, ChunkServer - hold the data. Clients - use the data. (Chunksize is 64 MB. Data is cached on client once retrieved. SOSP'03 (www.cs.rochester.edu ...) )

10+ Clusters of 1000+ boxes.

350 TB Filesystem

How to be a Search Engine

Crawling: Recursive Process. Problem: dynamic pages, slow servers, management of the link list, session ids in the URL, how to prioritize the URLs, being nice to the web servers, detection of duplicates, avoiding traps, actively fill forms to pull "hidden" contents, figure out when the page needs to be re-crawled.

Indexing: Words by document and position in the document. One Terra Words in the index.

Ranking: Hard problem. All traditional assumptions on searching like long, coherent, high quality documents are not valid for web documents. Googles idea is to define a PageRank for figuring the importance of the page. The PageRank of a page is the sum of PageRanks of other pages pointing to this page. A page contributes its PageRank divided by number of out-links to each of its target pages. In reality it is more complex. Google has about 100 factors in its real PageRank function like font size, color, proximity to other words.

Serving: Partition the data to different servers and have each solve a sub problem of each query. Query goes to Google Webserver, it queries Index Farm, accesses the Doc Farm for the real data. Additional services from Add Server and Spelling Server. IEEE Micro, 2003 has more on the structure (www.computer.org ...) .

Advertising: Find the best add, relevant to the query. This is a very important problem as this is the main source of revenue. Only show an add which has a chance to be clicked on, if the click-through is low, the add will be dropped. Advertisers only pay for adds actually clicked.

Google Playground

There is lots of data and computing infrastructure at Google. Google pays people who spend their time on figuring new ways to analyze and present this data: (labs.google.co ...)

 

NEWER | LONGER |