Doc Search is a pro-grade tool for finding PDFs on the web. It's very different from the consumer-grade search engines you're likely familiar with. It doesn't try to replace them for what they're good at, but I think you'll find it helps fill a gap where they're weak.
Indexing the web is costly and time-consuming. I've chosen to focus on PDF documents because they seem to have a high quality-density compared to other document types and because consumer-grade search engines don't seem to be that great at hunting for them anymore. I've also chosen to focus on documents that seem to be in English because it's the language with which my first users and I are most familiar.
Doc Search is a work in progress. I keep a list of release notes so you can see what changes I'm making from time to time. I make these changes in response to feedback from researchers like you, so please don't hesitate to drop me a line if you have a question or if you notice something that could be better.
Aaron D. ParksThe Doc Search interface is arranged to encourage you to refine and expand your query while exploring the documents you discover. The narrower left column is where you'll compose your query and explore matching documents. The wider right column is a document viewer, where you can review matching documents without leaving behind your query and matches. This arrangement makes it efficient to quickly review many documents and refine your query while looking at a document.
Doc Search offers a powerful and expressive query language for you to use
in forming and refining your queries.
Enter your query in the Query:
card at the top of the
Request
section in the left column.
Click the Search
button to submit your query.
A small “spinner” will appear while Doc Search executes your query and ranks
the matching documents it finds.
Terms are the most common and most familiar part of a query. Doc Search will look for PDF documents on the web which contain the terms you enter. Keep in mind that Doc Search looks only within the document for the terms you enter — it doesn't look at how other people describe the document (for example, it doesn't look at the link text of links that point at the document).
If you're interested in finding documents which contain the term “fish,” you'll often want to also find documents which contain the terms “fishes” and “fishing.” Other times, you may wish to find documents which contain a proper name but not also documents which contain only terms similar to the proper name.
To make this easy — without having to think of all the variations each term
might have and bloat your query with them — Doc Search uses a “stemming”
routine to reduce terms in your queries and in the documents it indexes
to common “stems” using linguistic rules.
For example, “fish,” “fishes,” and “fishing” all reduce to the term
Zfish
(stemmed terms are prefixed with a Z
to
distinguish them from unstemmed terms).
When composing your query, enter terms you would like stemmed in lower-case.
Enter terms you would not like stemmed with their first letter capitalized.
Since proper names should be capitalized anyway, this should feel natural
for the most common cases. For example, to find a paper about fishing by
Professor Farmer (without also finding documents containing only the term
“farming”) you might enter the query fishing Farmer
.
Unless instructed otherwise, Doc Search will find PDF documents on the web which contain any of the terms you search for. This usually works well and allows the ranking algorithm to surface the documents most relevant to your query.
Sometimes, though, you'll find that your query results can be improved by requiring that documents must contain certain terms to be included in the match set. You'll likely also have queries whose results could be improved by excluding documents which contain certain terms from the match set.
When composing your query, prefix terms which must appear
in results with a +
and prefix terms which
must not appear in results with a -
.
Some words are so common and so unhelpful in information retrieval that Doc Search does not index them and ignores them in queries. These are called “stop-words” and include words like “a,” “by,” “do,” “for,” and so on.
You can use operators like AND
, OR
,
and NOT
to express complex queries.
Operators are written in all-caps to distinguish them from query terms.
For example, if you submit the query cat AND dog
, Doc Search will
match only documents which contain both terms.
If you don't put any operator between two terms, Doc Search assumes
OR
.
This makes entering typical queries more convenient and less verbose.
If you use both AND
and OR
between a list of terms,
the AND
operator takes precedence.
In other words, if you submit the query cat OR dog AND bird
, Doc
Search will match documents which contain the term “cat” as well as documents
which contain the terms “dog” and “bird.“
But it will not match documents which contain only the term “dog” or only the
term “bird.”
You can use parentheses in your query to override this precedence.
For example, if you submit the query (cat OR dog) AND bird
, Doc
Search will match only documents which contain the term “bird” and also contain
either the term “cat” or the term “dog.”
The +
and -
term prefixes we discussed earlier
are shortcuts for composing more tedious combinations of operators which
will require or exclude the prefixed terms.
Combining them with other operators can be a little tricky.
When you submit your query, Doc Search will show a card in the
Request
section which describes how it has understood your query.
Prefixes, automatic stemming, and other shortcuts will be fully-expanded in
this explanation so you can get a good understanding of what Doc Search
thinks you've asked for.
Once you submit your query, Doc Search identifies all the documents it knows
about which match your query.
These matches are shown as cards in the Matches
section of the
left column, under the Request
section where you entered your
query.
Featured prominently in the middle of each match card is the name of the matching document. The document name is a link which, when clicked, will load the document for viewing in the right column. The match card for the document loaded in the viewer will be highlighted to identify it.
Five matches are shown at a time.
Below the five match cards is a set of three buttons you can use to page
through the results.
If you update and re-submit your query, the Matches
section will
automatically go back to the first page so you'll see the documents which
rank highest for your refined query.
Doc Search assigns a relevance score to each match. The relevance of a match has to do with how frequently each query term appear in it. Each query term is weighted according to its overall frequency across all the documents Doc Search knows about; Doc Search considers less-common terms to be more “selective” in helping to identify the matches you're likely to be most interested in. Rankings are also adjusted for match length, so that longer documents do not get an unfair advantage.
The first line of each match card shows the match's ordinal (#1, #2, #3, and so on) rank as well as its relevance score (normalized as a percentage of the score of the best match).
When you submit your query, Doc Search will show a card in the
Request
section which details the frequency (among all documents)
and weight (possibly modified by documents you have marked as relevant) each
query term will be given.
This information is provided to help you understand how your matches will
be ranked.
If, after reviewing a match, you find that it is particularly relevant to your
query, you can let Doc Search know by clicking the
Mark as relevant
button on its match card.
Doc Search will use this information to refine the weight it gives to each
of your query terms.
Marking matches as relevant does not change your query and it does not add
documents to or remove documents from the set of matches; it only affects how
the set of matching documents is ranked.
Doc Search will also identify terms in documents you mark as relevant which
have good “selectivity” (meaning they appear frequently in the marked document
but not very frequently across the set of all documents).
These terms will be listed on a card in the Request
section for
your consideration.
You may wish to add some of them to your query to expand or refine the match
set.
The second-to-last line of each match card shows the name of the website the document was found on. Along with the name of the document, this may help you identify documents you would like to load and review as you skim through your matches.
Next to the name of the website the document was found on is a link to the
original document.
If Doc Search has trouble loading the document in the viewer, you can click
this link to try to load the document from its original source in a new browser
tab.
You can also use this link for citations: just right-click and select
Copy link
to get a link to the original document.
The last line of each match card shows the query terms which Doc Search found
in this document.
This information may be helpful to you in refining your query.
For example, you may notice that matches which do not include one of your
query terms are not relevant.
You could prefix that term with a +
so that only documents which
include it will be included in the match set.
Put a link to the User Guide in the Query
card.
This should make it easier to find when you need it.
Collapse duplicate matches. Over the past eighteen months or so, I've significantly grown the document index. As the index grows over time, some documents are re-indexed and consequently have multiple entries in the index. Collapsing these duplicates in the matches section should reduce clutter and help surface relevant results.
Rename from WIRS to Doc Search. The new name should better describe the service.
Add a cute mascot. He should make the service feel more personable and fun.
Restructure and revise documentation. This should make Doc Search easier to learn and use.
Improve printability of documentation. This should make it easier to print out a hard copy of the Doc Search user guide in case you'd like to have it handy while you use the service.
Add a sponsor message to the top of the right column. This should help get the word out about my other projects and defray the costs of operating Doc Search.
When viewing a document in the document viewer, try to figure out the
document's filename and provide that (or “Unnamed document.pdf” if we
can't) to the web browser as a suggested filename. Most browsers will use
this suggested filename if you ask them to save the document.
This should fix the problem of all downloaded or saved documents being called
proxy.pdf
.
Add an Original
link to cards in the
Relevant documents
and Matches
sections.
This should be handy if the viewer can't load the document or if you need a
copy of the original link for a citation.
Correct a defect in how document URLs with percent-encoded characters were encoded and decoded. This defect prevented the affected documents from being displayed in the document viewer.
Highlight all cards in the Relevant documents
or
Matches
sections which refer to the document being displayed
in the viewer.
This should make it easier to explore matches without wasted effort.
Open the user guide in a new tab. This should be handy for reading the user guide while using Doc Search.
Show a loading message in place of the document viewer while a document is loading. This should reduce ambiguity about what's going on while a document is loading.
Correct mangled verbiage in the performance card at the bottom of the
Request
section.
Make the document name a clickable link on cards in the
Relevant documents
section.
This should make it easier to review relevant documents.
Return to the first page of the match set when the Search
button is pressed.
This should make it easier to review the most relevant matches for a
refined query.
When viewing a document, have Doc Search perform the request on behalf of
the browser so that Doc Search can provide the browser with appropriate
headers.
Without this, some sites could provide an
access-control-allow-origin
header which would instruct the
browser not to display the document or a content-disposition
header which would instruct the browser to download the document
rather than display it.
Lock out controls that would change the request while it is being processed by the server. This includes the query text input, search button, mark and un-mark as relevant buttons, and the pagination buttons. This should prevent multiple contradictory requests from being sent to the server and causing confusion.
Scroll to the relevant section when a request is sent to the server. This should make it more efficient to refine queries by reducing manual scrolling in the left column.
Show a status message in the relevant section while a request is being processed. This should reduce ambiguity about what Doc Search is doing while it's working.
Show a “spinner” while a request is being processed. This should reduce ambiguity about whether a button-click was registered.
Display underscores in document filenames as spaces. This makes them a bit more readable, but also allows long filenames to word-wrap more gracefully.
Reorganize the Request
section to split the query description,
relevance set terms, query terms, and performance information into their
own cards.
This should make it easier to find what you're looking for at a glance.