Monday, January 26, 2015

The Wayback Machine

Sherman and Mr. Peabody enter the WABAC machine

A giant digital archive of the World Wide Web is collected and stored on The Wayback Machine at a company called The Internet Archive.  The name of the Wayback Machine actually came from the1960's Mr. Peabody cartoon on the Rocky and Bullwinkle Show.

Anyway, this Web archiving project started before most of us even thought much about the Internet.  In 1996, Brewster Kahle, a digital librarian and advocate of universal access to all information, established a non-profit company called the Internet Archive. The Wayback Machine went to work, archiving every cached website it could find.  

Many people believe that once something is "on the Internet" it's there forever.  Not true.  Websites are removed, organizations and companies dissolve and their data disappears along with them.  Information is constantly being removed, changed, edited and over-written.  The Internet Archive Company allows users anywhere to see archived versions (snapshots, really) of web pages across time, even if they no longer exist.

How many web pages?
More than four hundred and thirty billion pages. That would be 20,000,000,000,000,000 bytes and counting. Twenty "petabytes" of data.


The Internet Archive Company isn't a secret server farm out on the prairie, but a grand place you can visit like a library.  Since 2009, the company headquarters have been located in San Francisco in this Greek revival building.  It was formerly a Christian Science church.  Now towers of computers are working round-the-clock on the top level.

The Wayback Machine is a robot. It crawls the entire Internet attempting to make a copy of every website it finds, every two months.  Some sites (like major newspapers) might be copied several times a day-- others much less. It isn't entirely random, because The Wayback Machine is also filled with Web pages chosen by librarians. In fact, anyone can add a webpage. There are other Internet archives out there, but the Wayback Machine is by far the biggest and the best.  If it isn't on there, it probably doesn't exist. 

Out of professional curiosity, of course I had to see if Feathers and Flowers was captured by the Wayback Machine robot.  (OK, I have a streak of narcissism.)

Searching the Wayback Machine doesn't work like Google, which uses something called "relevancy ranking."  In simple terms, that means the more popular you are, the more popular you become on Google. Subject searching isn't possible on the Internet Archive. There are over 400 billion web pages and not enough librarians in the world to do the cataloging and indexing for a "database" that large!

As far as I know, the only way to access the Internet Archive is by entering the URL of a website, in my case:
http://sue-feathersandflowers.blogspot.com/

And?
 And there it was.  Feathers and Flowers was saved four times between May 2013 and June 2014.

Let's say I decided my blog was the most unfortunate and embarrassing thing ever written. With a few strokes I could "remove" it from Google Blogger. The link would be gone, and if you searched for Feathers and Flowers you would get that familiar and annoying "page not found" message.

Now, here's the reason why the Internet Archive is beloved by researchers, litigators, governments, law enforcement, lawyers, museums, librarians, archivists, journalists, and assorted snoopy people all over the world.  Websites can be changed or deleted, but that snapshot captured by the Wayback Machine stays forever in the Internet Archive, at least as long as there is an Internet Archive.

You would think an archivist (even a mostly retired one) would know all about this, but I just learned about the Wayback Machine in the January 26th New Yorker Magazine.  For an interesting article, click HERE

No comments:

Post a Comment