By Author Amit Klein
Last Modified: 2/28/2005
[TEXT] size: 26k (MD5 SUM: 498938f1183a80be42e9c1e2331fb6f0)
[HTML] size: 40k (MD5 SUM: dab463a3a7e9b308da91d50e7e7233ed)
This paper describes several techniques (many of them new) for
exposing file contents using the site search functionality. It is
assumed that a site contains documents which are not
visible/accessible to external users. Such documents are typically
future PR items, or future security advisories, uploaded to the
website beforehand. However, the site is also searchable via an
internal search facility, which does have access to those documents,
and as such, they are indexed by it not via web crawling, but
rather, via direct access to the files (and therein lies the
Several attack techniques are described, some very simple and quick,
while other require an enormous amount of traffic; not all attacks
are relevant to a particular site, as they depend on the richness of
syntax supported by the site's search engine.
The paper concludes with methods to detect insecure indexing
vulnerability and suggested solutions.
Note that this attack is fundamentally different than exploitation
of external (remote) search engines (, , , ).
Description of the vulnerability and attacks
Let us assume that a site enables searching its contents using an
internal indexing/search engine. The emphasis here is on internal
engine, unlike sites that forward the search engine query to
external search engines (e.g. Google and Yahoo).
Let us further assume that the search engine supports exact match of
multiple words (e.g. "Hello World"). Preferably it also supports
conjunction search (e.g. Hello AND World), a proximity operator
(e.g. Hello NEAR World), and a wildcard operator (at the word level,
or ideally at the character level).
Let us now define two terms:
An invisible resource (file/page) is a resource (other than the root
document) that is not linked to from within the site, or from
external sites. That is, an invisible resource is a resource which
is unlikely to be indexed by external search engines (e.g. Google
and Yahoo) or likely to be requested by anyone other than an
attacker, since it by definition does not appear as a normal link.
An inaccessible resource (file/page) is a resource which, when
normally requested, is not provided by the web server (the web
server typically responds with a 403/401 HTTP response ).
We will assume henceforth that an inaccessible resource is also
invisible (although there are examples to the contrary).
If a site contains an indexable, inaccessible resource (i.e.
document that is not accessible to external users) typically the
users receive an HTTP 403/401 response status when attempting to
request the document directly. This document may still be
reconstructed in full or in part, using the techniques below.
The way some sites create such an inaccessible document is by
restricting web access to certain (or no) users, e.g. using the
.htaccess file (for Apache servers, and many others). However, the
internal search engine still has access to the file (as long as file
permissions are not modified), and as such, it will index it and
make its contents searchable. The failure of file-level indexing (as
opposed to crawling) to observe the web level access control makes
this setup vulnerable to the attacks described below.
Among the local search engines that support file-level indexing are
Swish-e (), Perlfect (), WebGlimpse () and Verity
Developer Kit (). This is a very partial list--many other
engines support this functionality.
Before the attack techniques are outlined, it should be noted that
the first problem is to find a lead to such a file, i.e. to know
that such a file (or files) exists, and to have some idea of what it
may contain. The more prior information the attacker has, the more
likely for the attack to quickly succeed. In fact, if (in a very
theoretic case) the attacker wants to verify that an inaccessible
document is identical to a text the attacker has, then this attack
can be realized with very few requests.
In all attacks, it is desired to have an initial search string that
provides narrowed down results (i.e. a short list of matching pages
in the site containing the invisible/inaccessible file). Ideally
this list would contain only one item--the hidden file itself. For
this, prior knowledge of the file contents is an enormous help.
A technique to find invisible/inaccessible resources
One well known technique is to guess a file name from names already
seen. So for example, if one sees a public file by the name of
PR-15-01-2005.txt, one may infer that PR files have the format
PR-DD-MM-YYYY.txt, and thus guess names such as PR-31-01-2005.txt.
However, using a search engine, it may be possible to uncover less
predictable file names. One trick is to try to enumerate all the
files accessible to the search engine (which is usually a superset
of all the files accessible to the external user, since, as we noted
above, the search engine may not honor the .htaccess directives). To
accomplish this, one can try various words that are very common and
likely to be found in almost all English texts, such as "a", "the",
"is", or the vendor/product name, or in fact any special word that
should be found in a desired document, e.g. "password". (in fact,
 and more explicitly  discuss the question of whether a local
search engine can be exploited in exactly this context, i.e. how to
locate invisible resources).
By comparing this list with a list of links the attacker can obtain
externally (e.g. using a search engine or a spider/crawler), it is
possible to locate the invisible files. Some of the invisible files
may be directly accessible (and therefore, the attacker gains a lot
simply by using this technique alone). Some invisible files may not
be directly accessible--these would be the inaccessible resources.
If prior information about the target document is available, then it
can be used to quickly locate this document. For example, if one
looks for a hidden PR item, and it is well known that PR items
contain the text "For immediate release", then using that string for
the search query would result in a list of PR item files, which is
much shorter than the list of all links in the site.
Another similar technique is to use a search query (perhaps with a
special operator) to look for a word or a pattern in the resource
name (or full path). For example, searching for "txt" may yield a
list of all resources whose names contain "txt", or better yet, if
the search engine supports a syntax such as inurl:txt (we will
hereinafter use the Google/Yahoo syntax , ,  to illustrate
search queries, except that we use AND to explicitly refer to
conjunction search), it can be used to limit the search to only path
We now discuss three techniques to reconstruct an inaccessible
resource, given (to simplify the discussion) a basic query that
results in this resource alone.
Technique #1 - when the search engine provides excerpts from the
This case is the simplest. The search engine not only returns the
file name (URL) wherein the text was found, but also some
surrounding text. It is possible to quickly proceed both forward and
backward from the location in the file of the first text match. For
example, let us assume that the initial search was for "Foo". The
search engine returns the match, with the three preceding and
... first version of Foo, the world leading ...
The next search would be for "first version of Foo, the world
leading", which yields:
... to release the first version of Foo, the world leading anti-
gravity engine ...
The next search, for the above string, would yield:
... We are happy to release the first version of Foo, the world
leading anti-gravity engine that works on ...
And so forth.
Technique #2 - when only a match is displayed
In this case, the search engine only displays a link to the
resource(s) where a match was found. Naturally, the resource that is
of interest is not accessible. But it is still possible to
reconstruct the file, by painfully going over all possible words
that can syntactically fit in. In this technique, prior knowledge
can save a lot of time, as it can significantly reduce the guess space.
To follow the above example, the first search word is "Foo", and a
match is found. Then, the attacker tries a prioritized list of
combinations of "Foo X", where X is an English word (there are
hundreds of thousands of words in English , , but only a few
thousands are commonly used , ). An attacker may hit the
"Foo the" combination pretty quickly, since "the" is a very common
word, and should be very high in the list. Then the attacker tries
"Foo the X" until there's a match in "Foo the world", and so forth.
Technique #3 - when less prior knowledge is available
Again, the search engine only displays a link to the resource(s)
where a match was found. If there's very little prior knowledge on
the file, it may be more efficient to proceed along the following
lines. Let us assume first that the search engine supports Boolean
queries (X AND Y). For simplicity, let us assume that there's a word
that limits the hit to the file of interest, e.g. that the word Foo
is unique to the file (if there is no such single word, then a
combination of words would work just as well, e.g. Foo AND Bar).
The attacker first loops through all possible words in English
(including names, surnames and vendor specific terminology), and for
each such word X the attacker requests Foo AND X. This ends up
(after a long while) in the list of words that appear in the
document. Typically, such a document contains 200-600 words
(author's crude estimation), so, assuming 400 words, it would take
guessing 400 words 400 times each to complete the document
(requesting Foo AND "X", then Foo AND "X Y", then Foo AND "X Y Z",
and so forth).
In order for this document to be more readable, notes regarding this
section are placed at the end of the document (after the
Detecting insecure indexing
There isn't a very simple or thorough method for detecting this
vulnerability. Several approaches are suggested:
i) Enumerate known search engines. This is a black box approach,
usually employed by CGI scanners. The downside is that if the site
uses a search engine which the scanner does not recognize (or is not
located in the default path), it will not be reported as vulnerable.
ii) Locate the search facility manually, and using the above
technique and the search facility, construct a list of all indexed
files. Compare that to a list of all visible sites (which can be
obtained by crawling the site). If there are indexed files which are
not visible, then the site is vulnerable. This is another black box
iii) If there's access to the host itself (i.e. white box approach),
then a test can consist of adding a new file to an indexable folder,
with unique content (a unique string, such as
"youneversawmebefore"), and then querying the search engine for this
string (this should be done after the search engine refreshes its
indexing database, either naturally or by force). If the string is
found, then the site is vulnerable (the new file is not visible--
there's no link to it from anywhere, yet it is indexed).
Recommendations for web site developers and owners
If possible, choose crawling style indexing over direct file access
indexing (all above mentioned file-based-enabled search engines also
provide a crawling option). While on that subject, crawling should
be done using a remote proxy if possible, to simulate a remote
client (some applications associate higher privileges to requests
originating from the local machine, hence the crawling may reveal
resources and information intended only for a local user).
A less intrusive solution may be to use access control in order to
restrict the indexing to allowed material. Let us assume that the
web server runs with a higher privilege than the search engine. Now,
the visible files need to be assigned low privilege, so they are
readable by both the web server user and the search engine user. The
invisible (or inaccessible) files are assigned higher privileges, so
they are readable only by the web server. Thus, those files can be
accessed remotely by those that know about them, and possibly
possess the required credentials (for the inaccessible files), yet
they cannot be indexed. If they are later required to become public,
this can be done as usual by adding a link and possibly changing the
.htaccess file, yet the files would still not be indexed. In order
to restore "indexability," the privilege of the files should be
Finally, when deploying a file-level search engine, heed the
security recommendations and understand what security features are
supported. Many engines enable restricting the indexed files by type
(file extensions) and location (directories). This should be used
(according to the vendor recommendations) in order to prevent
indexing of script source code. That is, by instructing the search
engine not to index the extensions .cfm, .cfml, .jsp, .java, .aspx,
.asax, .vb, .cs, .asp, .asa, .inc, .pl, .plx, .cgi, .php, .php3,
etc., or by instructing it to index only .htm and .html extensions,
one can make sure that script sources are not indexed; likewise if
the search engine is not allowed to index the /cgi-bin, /scripts,
etc. directories, or is limited to /html, etc.
Recommendations for search engine vendors
File-level search engines should honor the web server access control
for the indexed resource. That is, the search engine should attempt
to identify the web server (or at least request this information in
the configuration phase) and query the web server (or mimic its
logic) regarding access rights to resources about to be indexed.
Only publicly accessible resources should be indexed. Still, this
does not guarantee that invisible resources won't be indexed.
Local search engines that use file-level access may pose a security
hazard (insecure indexing) due to their access to resources which
are not accessible to remote users. By indexing those resources, the
search engine creates a channel through which data may be leaked to
Therefore, crawling style indexing should be preferred over direct
file indexing. If file-level indexing cannot be avoided, more
consideration should be made when deploying a search engine that
facilitates it. In particular those search engines should be
systematically limited to the visible resources (or at the very
least, to accessible resources).
a) While the above attack techniques aim to recover a resource in
its fullness, it is also beneficial (and also much quicker) to
recover parts and pieces of the document. For example, once an
inaccessible document (let's assume it is a security advisory) is
located in the vulnerable site, the attacker may be interested to
know what the advisory is about, so the attacker may try some
phrases and keywords such as "cross site scripting", "buffer
overflow", "sql injection". The attacker may also try to figure out
to which product, module and function the advisory applies, by trying
names of products, modules and functions that are relevant for the
site. Names of people can also be located, and so on.
b) In the above attack techniques, it is assumed that the search
engine can handle a query of arbitrary size (actually, of the size
of the document to be retrieved). This assumption may not always
hold (even for the technical issue of maximum URL/query length in a
GET request, e.g. Microsoft URLScan imposes a limit of 4KB on the
query  and Microsoft IIS/6.0 imposes a limit of 16KB on the URL
), but this assumption can also be easily done without. When the
query text gets long enough, it can be used in a "sliding window"
fashion to completely cover the document. That is, once the known
text gets too long, it is possible to exclude the few last words and
thus be able to add a guess for the words preceding the uncovered
segment. This process can be repeated until the start of the
document is reached, while keeping the query's size fixed. Likewise,
by repetitively omitting words in the beginning of the segment, the
segment can be moved towards the end of the document.
c) It is assumed that the search engine ignores punctuation marks,
non-printable/special characters, meta-information (e.g. HTML tags)
and so forth, so it allows a continuous search through the document
as a sequence of words. If that is not the case, then these objects
should be guessed as well. Furthermore, the search engine itself may
not support all these kinds of data, e.g. due to syntax restrictions
or security considerations. In such cases, the document should be
guessed piecewise, and it is impossible to know the order of the
pieces with the information collected so far. If a wildcard syntax
is supported (e.g. "X * Y", where any word matches the asterisk),
then the "offending" data can be skipped on the fly. If the wildcard
syntax is not supported, yet the NEAR operator (obsolete syntax
supported at one time by Altavista and WebCrawler , , ,
not supported by Google and Yahoo) is, it may be possible to try
various combinations of sentences and to reconstruct the order.
d) If the search engine provides access to its cache (a-la Google
and Yahoo), then the above techniques are not needed. Once the
resource path is known, it can be requested directly from the search
engine cache. It is unlikely though that a cache would be used in
local search engines.
e) If the search engine is not properly configured, it may also
index server side scripts (or in general, files that the web server
does not return as-is by definition). In such cases, the attack can
be used for source code disclosure (and it may be possible to locate
scripts by searching language specific keywords, e.g. CFML tags
, JSP keywords , , ASPX page elements , VB.NET 
and C# keywords , ASP page elements , , Perl functions
and syntax elements  or simply #!/usr/bin/perl and its variants,
PHP keywords  and SSI syntax ). In fact, in the very
unlikely case wherein the search engine indexes files outside the
virtual root (in which case one may wonder how the links to these
files are presented by the search engine), then the above techniques
can be used to retrieve the contents of such files.
f) If the search engine supports wildcards at the character level
(e.g. obsolete Altavista syntax ), then enumeration can be done
at character level, not at word level, which dramatically reduces
the number of requests needed for the attack. Instead of guessing up
to hundreds of thousands of words, a typical five letter word can be
guessed at up to 26 x 5 = 130 requests (much less on average,
especially if English word statistics are used), making the attack
much more feasible.
g) Legal aspects: the techniques presented use the site's search
function in a way that is not obviously illegal (disclaimer: the
author is not a lawyer). For example, using the technique for
finding invisible files, the attacker can actually generate a list of
URLs (links) to those files. This raises a question of whether it is
thus legal to access the invisible files directly as those links are
generated by the site itself. Another question is whether using
these techniques to retrieve the contents of inaccessible files is
Moreover, the attacker can embed a link to a search query
(generating a list of invisible files) in the attacker's site (note
that without actually executing the query himself, the legal
question of whether this is allowed is even less trivial). At a
later time, an external search engine crawls through the attacker's
site, follows the link (i.e. the link generating query) and indexes
the target site's invisible files (a similar idea is presented in
). Now those files are available through the external search
engine to all Internet users.
Note: all URLs verified on February 3rd, 2005.
 "Google Hacking Mini-Guide",
Johnny Long. May 7th, 2004.
 "Google: A Hacker's Best Friend", Paris2K, @ Articles May 30th, 2003.
 "Perfecto's Black Watch Labs Advisory #00-01", February 17th, 2000
 Pen-Test mailing list posting "Website search engine is a hacking tool", Amal Mohammad Al Hajeri, July 19th, 2004.
 Pen-Test mailing list posting "RE: Website search engine is a hacking tool", Amal Mohammad Al Hajeri, July 24th, 2004.
 "Google Help Center - Advanced Search Made Easy".
 "Google Help Center - Advanced Operators".
 "Yahoo! Help - Search Tips".
 "How many words are there in the English language",
 "Number of words in the English language", Johnny Ling, 2001.
 "World Wide Words - How many words?", Michael Quinion, April 1st, 2000.
 "Ritter Library Guide to Search Engine Syntax", November 19th, 2001.
 "The Spider's Apprentice", Linda Barlow.
 "The Google Attack Engine", Thomas C. Green, The Register, November 28th, 2001.
 "SWISH-RUN - Running Swish-e and Command Line Switches"
 "Perlfect Search 3.31 README documentation"
 "Configuring an Archive"
 "Verity's Developer Kit"
 "ColdFusion Tags" http://livedocs.macromedia.com/coldfusion/6.1/htmldocs/tags-pt0.htm
 "JavaServer Pages Syntax Reference" (follow links)
 "Java Language Keywords"
 ".NET Framework General Reference - ASP.NET Syntax" (follow links)
 "Visual Basic Language Specification - 2.3 Keywords"
 "C# Language Specification - C. Grammar" (see section C.1.7 "Keywords")
 "Using Scripting Languages to Write ASP Pages"
 "Visual Basic Scripting Edition - Statements"
"Visual Basic Scripting Edition - Functions"
 "Perl builtin functions"
 "List of Reserved Words"
 "Module mod_include" [Apache's Server Side Include implementation]
 RFC 2616 "Hypertext Transfer Protocol - HTTP/1.1"
 "URLScan Security Tool"
 "Graceless Degradation, Measurement, and Other
Challenges in Security and Privacy" Jon Pincus (Microsoft)
About the author
Amit Klein is a renowned web application security researcher. Mr.
Klein has written many research papers on various web application
technologies--from HTTP to XML, SOAP and web services--and covered
many topics--blind XPath injection, HTTP response splitting,
securing .NET web applications, cross site scripting, cookie
poisoning and more. His works have been published in Dr. Dobb's
Journal, SC Magazine, ISSA journal, and IT Audit journal; have been
presented at SANS and CERT conferences; and are used and referenced
in many academic syllabi.
The current copy of this document can be here:
Information on the Web Application Security Consortium's Article Guidelines can be found here:
A copy of the license for this document can be found here: