Donna Wentworth
( Archive | Home | Technorati Profile)

Ernest Miller
( Archive | Home )

Elizabeth Rader
( Archive | Home )

Jason Schultz
( Archive | Home )

Wendy Seltzer
( Archive | Home | Technorati Profile )

Aaron Swartz
( Archive | Home )

Alan Wexelblat
( Archive | Home )

About this weblog
Here we'll explore the nexus of legal rulings, Capitol Hill policy-making, technical standards development, and technological innovation that creates -- and will recreate -- the networked world as we know it. Among the topics we'll touch on: intellectual property conflicts, technical architecture and innovation, the evolution of copyright, private vs. public interests in Net policy-making, lobbying and the law, and more.

Disclaimer: the opinions expressed in this weblog are those of the authors and not of their respective institutions.

What Does "Copyfight" Mean?

Copyfight, the Solo Years: April 2002-March 2004

a Typical Joe
Academic Copyright
Jack Balkin
John Perry Barlow
Blogbook IP
David Bollier
James Boyle
Robert Boynton
Brad Ideas
Ren Bucholz
Cabalamat: Digital Rights
Cinema Minima
Consensus @ Lawyerpoint
Copyfighter's Musings
Copyright Readings
CopyrightWatch Canada
Susan Crawford
Walt Crawford
Creative Commons
Cruelty to Analog
Culture Cat
Deep Links
Derivative Work
Julian Dibbell
Digital Copyright Canada
Displacement of Concepts
Downhill Battle
Exploded Library
Bret Fausett
Edward Felten - Freedom to Tinker
Edward Felten - Dashlog
Frank Field
Seth Finkelstein
Brian Flemming
Frankston, Reed
Free Culture
Free Range Librarian
Michael Froomkin
Michael Geist
Michael Geist's BNA News
Dan Gillmor
Mike Godwin
Joe Gratz
James Grimmelmann
Groklaw News
Matt Haughey
Erik J. Heels
Induce Act blog
Inter Alia
IP & Social Justice
IPac blog
Joi Ito
Jon Johansen
JD Lasica
Legal Theory Blog
Lenz Blog
Larry Lessig
Jessica Litman
James Love
Alex Macgillivray
Madisonian Theory
Maison Bisson
Kevin Marks
Tim Marman
Matt Rolls a Hoover
Mary Minow
Declan McCullagh
Eben Moglen
Dan Moniz
Danny O'Brien
Open Access
Open Codex
John Palfrey
Chris Palmer
Promote the Progress
PK News
PVR Blog
Eric Raymond
Joseph Reagle
Recording Industry vs. the People
Lisa Rein
Thomas Roessler
Seth Schoen
Doc Searls
Seb's Open Research
Shifted Librarian
Doug Simpson
Stay Free! Daily
Sarah Stirland
Swarthmore Coalition
Tech Law Advisor
Technology Liberation Front
Siva Vaidhyanathan
Vertical Hold
Kim Weatherall
David Weinberger
Matthew Yglesias

Timothy Armstrong
Bag and Baggage
Charles Bailey
Beltway Blogroll
Between Lawyers
Blawg Channel
Chief Blogging Officer
Drew Clark
Chris Cohen
Crooked Timber
Daily Whirl
Dead Parrots Society
Delaware Law Office
J. Bradford DeLong
Betsy Devine
Ben Edelman
Ernie the Attorney
How Appealing
Industry Standard
IP Democracy
IP Watch
Dennis Kennedy
Rick Klau
Wendy Koslow
Elizabeth L. Lawley
Jerry Lawson
Legal Reader
Likelihood of Confusion
Chris Locke
Derek Lowe
MIT Tech Review
Paper Chase
Frank Paynter
Scott Rosenberg
Scrivener's Error
Jeneane Sessum
Silent Lucidity
Smart Mobs
Trademark Blog
Eugene Volokh
Kevin Werbach

Berkman @ Harvard
Chilling Effects
CIS @ Stanford
Copyright Reform
Creative Commons
Global Internet Proj.
Info Commons
IP Justice
ISP @ Yale
NY for Fair Use
Open Content
Public Knowledge
Shidler Center @ UW
Tech Center @ GMU
U. Maine Tech Law Center
US Copyright Office
US Dept. of Justice
US Patent Office

In the Pipeline: Don't miss Derek Lowe's excellent commentary on drug discovery and the pharma industry in general at In the Pipeline


« (Permission) Culture Study | Main | Classical Myopia and the BBC's Beethoven »

July 12, 2005

Opening Up the Wayback Can of Worms

Email This Entry

Posted by Alan Wexelblat

Back in November of last year I noted a court case in which saved Web pages from the Internet Archive (informally the Wayback Machine) was used as evidence. At the time I expressed surprise at the judge's ready acceptance of the evidence and noted that this is an extremely murky and untried legal area.

Now, if the report in the NJ Star Ledger is correct, we may see some litigation of a few of the issues raised by an archive of this sort and its involvement in copyright and court proceedings.

As best I understand it, what seems to be happening is the operators of the Wayback Machine are themselves being sued. Geist pointed to a Geocities page for a copy of the actual complaint, but it was 404 when I went to look.

What Kevin Coughlin's story says is that Healthcare Advocates is suing Wayback because the operators of Wayback failed to block access to certain archived materials during a 2003 trade secrets dispute. According to the complaint, the opposing counsel at the time obtained pages from the Wayback Machine. One issue is how those pages were obtained - did they come from normal searching or from some kind of "hacking?"

Another issue is the copyrights of the pages - if the pages were copyrighted by Healthcare Advocates, then what was Wayback doing with copies of them in the first place? And why was it serving up copies of material it didn't own the copyrights to? And were opposing counsel engaged in knowingly obtaining by extra-judicial means material they knew was supposed to be protected by IP laws? And does the Internet Archive have responsibility in part due to what it apparently admits were broken "blocking procedures"? (My instinctive guess is that their spider wasn't properly obeying robot exclusion directives.)

Kurt Opsahl, staff attorney for the EFF, is quoted as opining that the doctrine of fair use generally allows the gathering of copyrighted materials as evidence in trade secret cases. In which case, the whole thing may get chucked out quickly and no legal precedents will emerge. But I remain convinced that this is the barest tip of a huge legal iceberg that is going to crash into the business of search engines and other 'net archives, soon. Maybe not this specific case, but the issues I pointed to last year still remain completely unresolved and in the absence of guiding legislation parties wishing to establish principles have little choice except to litigate their claims and hope for good precedents.

Comments (12) + TrackBacks (0) | Category: Laws and Regulations


1. Donna on July 12, 2005 1:03 PM writes...

William Patry has more...

Permalink to Comment

2. Crosbie Fitch on July 12, 2005 1:42 PM writes...

Maybe the next Web should require all participants to agree that everything they publish on their websites is published under a Creative Commons license?

Otherwise it's petulance of the highest order to publish information and then expect to be able to remove it from the historical record at any time thereafter - like some kind of gentleman's DRM agreement. "You can look, but don't record"

Almost as bad as "You can look, but don't copy"

Permalink to Comment

3. Dr. wex on July 12, 2005 2:01 PM writes...

it's petulance of the highest order to publish information and then expect to be able to remove it from the historical record at any time thereafter

This is pretty much precisely what the NYTimes/Boston Globe do now. Also it strikes me this is what the various DRM schemes used by music downloaders propose to do as well.

Permalink to Comment

4. Crosbie Fitch on July 12, 2005 2:35 PM writes...

What would you do or say if someone who'd commented on Copyfight over the years said the following?

"I have decided to join the dark side of strong IP and therefore require that all my comments (largely containing viewpoints of a libertarian nature) be deleted from your archives. I am the author and copyright owner (irrespective of your site's obscure policy which I did not consent to and have only just become aware of) and will sue for infringement if you do not comply within 28 days"

1) Not a problem - didn't like 'em anyway
2) Sure, but you'll have to ask umpteen other syndicated sites to expunge their records yourself - let alone search engines and archival sites.
3) Blog off, you plonker!

Permalink to Comment

5. Kevin Wimberly on July 12, 2005 2:51 PM writes...

I'm just throwing this out there - it's not exactly on point, but it's sort of interesting in light of your comment about possible copyright infringement by Wayback. Remember the 4 DMCA exceptions that the Copyright Office approved in October of 2003? The third one - "Computer programs and video games distributed in formats that have become obsolete and which require the original media or hardware as a condition of access" - was lobbied for by the Internet Archive. They wanted the exception to broadly apply to "Literary and audiovisual works embodied in software whose access control systems prohibit access to replicas of the works." It seems to me that that would have included web pages and protection measures found on servers. This is a big "if," but IF robots.txt files are access control systems under the DMCA, and IF the CO would have gone with IA's broad request, then the Internet Archive would be permitted to break in and archive any website it wanted to - subject to it meeting the additional conditions that it is an "archivist" as section 108 requires (the comments to this 3rd exemption seem to suggest that it is limited to section 108(c) uses). Being that the IA makes the works available to the public, they would not have been able to take advantage of the proposed exception anyway.

Just a thought...

See p. 41 (bottom)

Permalink to Comment

6. Seth Finkelstein on July 12, 2005 3:40 PM writes...

Here's an important comment from Patry's blog post:

[repost, not my words, FvL's comment below]

"Fred von Lohmann said...

I'd like to suggest a different interpretation of their DMCA claim (while acknowledging that the complaint is not clear): that the robot.txt file operates as a TPM as used by the Internet Archive. The standing provisions of the DMCA have been interpreted broadly, so perhaps the plaintiff here is arguing that the Internet Archive has implemented a TPM that controls access to its archived materials. The robot.txt file is intended to block external access to these materials, and was bypassed by the defendants. (I'll admit, this sounds like the Archive's claim to bring, not the plaintiffs', but the DMCA's standing provision has been stretched before.)

I think the claim still fails for the other reason you note. But I don't think the complaint need necessarily be construed as arguing that robot.txt is a TPM generally.
10:03 AM"

Permalink to Comment

7. Daniel Brandt on July 13, 2005 1:57 PM writes...

On August 12, 2004, I sent a fax to Brewster Kahle requesting a permanent block of all pages -- past, present, and future -- on all 12 of my domains. The Wayback Machine (Internet Archive) complied within two days.

The Archive does not use robots.txt in the conventional manner. They have an arrangement with Alexa, owned by Amazon but which has close relations with Brewster Kahle, to get Alexa's crawl. Alexa uses the crawl to sell packaged sets of the Intenet.

The Archive checks for a robots.txt in real time when a request is made. If it finds an exclusion that includes "ia_archiver" it will say so and not show the page. However, it still has those pages and it still keeps them.

I know this because I sold a domain that I had always excluded from ia_archiver to someone who does not use robots.txt. Without a real-time robots.txt exclusion, all the old pages on that domain, which I had assumed were not crawled when I owned the domain, were suddenly available on the Archive.

Now that would have been a strong lawsuit, and that's why the Archive honored my request to block all my domains. However, they still have those pages. I now exclude Alexa's bots entirely in my routing table, just to be safe.

Permalink to Comment

8. Dr. wex on July 13, 2005 2:20 PM writes...

Thanks for the story, Daniel. It's interesting. For the record, Alexa used to be Kahle's company before he sold it to Amazon. I also wonder how effective it is to simply block Alexa's bots - what's to say that the Archive doesn't also buy data from other crawlers?

Permalink to Comment

9. Daniel Brandt on July 13, 2005 2:41 PM writes...

I should have added that when I did some experiments last August, I tried blocking the Archive's real-time fetch of robots.txt to see what would happen. It turned out that after a 20-second timeout, the Archive went ahead and provided the requested page.

That's when I registered the domain "" but it's only been parked so far. When I'm done with Google's copyright heist at the University of Michigan library, maybe I'll get more interested in the Archive and put up a page or two under

I really don't like Brewster Kahle's approach. He's less rich and more nonprofit than Google, but he still takes liberties with copyright law that I don't think can be justified.

Permalink to Comment

10. Ned Ulbricht on July 13, 2005 3:41 PM writes...

A text version of the complaint is now up at

Permalink to Comment

11. Jack on July 15, 2005 3:17 AM writes...

They have an arrangement with Alexa, owned by Amazon but which has close relations with Brewster Kahle, to get Alexa's crawl. Alexa uses the crawl to sell packaged sets of the Intenet.
The Archive checks for a robots.txt in real time when a request is made. If it finds an exclusion that includes "ia_archiver" it will say so and not show the page. However, it still has those pages and it still keeps them.

Permalink to Comment


Email this entry to:

Your email address:

Message (optional):

Sherlock Holmes as Classical Fairytale
Trademark Law Includes False Endorsement
Kickstarter Math
IP Without Scarcity
Crash Patents
Why Create?
Facebook Admits it Might Have a Video Piracy Problem
A Natural Superfood, and Intellectual Property