A hand drawn illustration depicts a happy, balding doctor looking
	toward the viewer through a magnifying glass.
	The magnifying glass distorts a portion of the doctors face,
	making it look bigger.

Doc Search

User Guide

February 9, 2023

Preface

Doc Search is a pro-grade tool for finding PDFs on the web. It's very different from the consumer-grade search engines you're likely familiar with. It doesn't try to replace them for what they're good at, but I think you'll find it helps fill a gap where they're weak.

Indexing the web is costly and time-consuming. I've chosen to focus on PDF documents because they seem to have a high quality-density compared to other document types and because consumer-grade search engines don't seem to be that great at hunting for them anymore. I've also chosen to focus on documents that seem to be in English because it's the language with which my first users and I are most familiar.

Doc Search is a work in progress. I keep a list of release notes so you can see what changes I'm making from time to time. I make these changes in response to feedback from researchers like you, so please don't hesitate to drop me a line if you have a question or if you notice something that could be better.

Aaron D. Parks
Parks Digital LLC
4784 Pine Hill Drive, Potterville, Michigan
support@parksdigital.com

Interface tour

The Doc Search interface is arranged to encourage you to refine and expand your query while exploring the documents you discover. The narrower left column is where you'll compose your query and explore matching documents. The wider right column is a document viewer, where you can review matching documents without leaving behind your query and matches. This arrangement makes it efficient to quickly review many documents and refine your query while looking at a document.

Composing a query

Doc Search offers a powerful and expressive query language for you to use in forming and refining your queries. Enter your query in the Query: card at the top of the Request section in the left column. Click the Search button to submit your query. A small “spinner” will appear while Doc Search executes your query and ranks the matching documents it finds.

Query terms

Terms are the most common and most familiar part of a query. Doc Search will look for PDF documents on the web which contain the terms you enter. Keep in mind that Doc Search looks only within the document for the terms you enter — it doesn't look at how other people describe the document (for example, it doesn't look at the link text of links that point at the document).

Term stemming

If you're interested in finding documents which contain the term “fish,” you'll often want to also find documents which contain the terms “fishes” and “fishing.” Other times, you may wish to find documents which contain a proper name but not also documents which contain only terms similar to the proper name.

To make this easy — without having to think of all the variations each term might have and bloat your query with them — Doc Search uses a “stemming” routine to reduce terms in your queries and in the documents it indexes to common “stems” using linguistic rules. For example, “fish,” “fishes,” and “fishing” all reduce to the term Zfish (stemmed terms are prefixed with a Z to distinguish them from unstemmed terms).

When composing your query, enter terms you would like stemmed in lower-case. Enter terms you would not like stemmed with their first letter capitalized. Since proper names should be capitalized anyway, this should feel natural for the most common cases. For example, to find a paper about fishing by Professor Farmer (without also finding documents containing only the term “farming”) you might enter the query fishing Farmer.

Required and excluded terms

Unless instructed otherwise, Doc Search will find PDF documents on the web which contain any of the terms you search for. This usually works well and allows the ranking algorithm to surface the documents most relevant to your query.

Sometimes, though, you'll find that your query results can be improved by requiring that documents must contain certain terms to be included in the match set. You'll likely also have queries whose results could be improved by excluding documents which contain certain terms from the match set.

When composing your query, prefix terms which must appear in results with a + and prefix terms which must not appear in results with a -.

Stop-words

Some words are so common and so unhelpful in information retrieval that Doc Search does not index them and ignores them in queries. These are called “stop-words” and include words like “a,” “by,” “do,” “for,” and so on.

Query operators

You can use operators like AND, OR, and NOT to express complex queries. Operators are written in all-caps to distinguish them from query terms. For example, if you submit the query cat AND dog, Doc Search will match only documents which contain both terms. If you don't put any operator between two terms, Doc Search assumes OR. This makes entering typical queries more convenient and less verbose.

If you use both AND and OR between a list of terms, the AND operator takes precedence. In other words, if you submit the query cat OR dog AND bird, Doc Search will match documents which contain the term “cat” as well as documents which contain the terms “dog” and “bird.“ But it will not match documents which contain only the term “dog” or only the term “bird.”

You can use parentheses in your query to override this precedence. For example, if you submit the query (cat OR dog) AND bird, Doc Search will match only documents which contain the term “bird” and also contain either the term “cat” or the term “dog.”

The + and - term prefixes we discussed earlier are shortcuts for composing more tedious combinations of operators which will require or exclude the prefixed terms. Combining them with other operators can be a little tricky. When you submit your query, Doc Search will show a card in the Request section which describes how it has understood your query. Prefixes, automatic stemming, and other shortcuts will be fully-expanded in this explanation so you can get a good understanding of what Doc Search thinks you've asked for.

Exploring matches

Once you submit your query, Doc Search identifies all the documents it knows about which match your query. These matches are shown as cards in the Matches section of the left column, under the Request section where you entered your query.

Featured prominently in the middle of each match card is the name of the matching document. The document name is a link which, when clicked, will load the document for viewing in the right column. The match card for the document loaded in the viewer will be highlighted to identify it.

Five matches are shown at a time. Below the five match cards is a set of three buttons you can use to page through the results. If you update and re-submit your query, the Matches section will automatically go back to the first page so you'll see the documents which rank highest for your refined query.

Ranking

Doc Search assigns a relevance score to each match. The relevance of a match has to do with how frequently each query term appear in it. Each query term is weighted according to its overall frequency across all the documents Doc Search knows about; Doc Search considers less-common terms to be more “selective” in helping to identify the matches you're likely to be most interested in. Rankings are also adjusted for match length, so that longer documents do not get an unfair advantage.

The first line of each match card shows the match's ordinal (#1, #2, #3, and so on) rank as well as its relevance score (normalized as a percentage of the score of the best match).

When you submit your query, Doc Search will show a card in the Request section which details the frequency (among all documents) and weight (possibly modified by documents you have marked as relevant) each query term will be given. This information is provided to help you understand how your matches will be ranked.

Marking relevant documents

If, after reviewing a match, you find that it is particularly relevant to your query, you can let Doc Search know by clicking the Mark as relevant button on its match card. Doc Search will use this information to refine the weight it gives to each of your query terms. Marking matches as relevant does not change your query and it does not add documents to or remove documents from the set of matches; it only affects how the set of matching documents is ranked.

Doc Search will also identify terms in documents you mark as relevant which have good “selectivity” (meaning they appear frequently in the marked document but not very frequently across the set of all documents). These terms will be listed on a card in the Request section for your consideration. You may wish to add some of them to your query to expand or refine the match set.

More about matches

The second-to-last line of each match card shows the name of the website the document was found on. Along with the name of the document, this may help you identify documents you would like to load and review as you skim through your matches.

Next to the name of the website the document was found on is a link to the original document. If Doc Search has trouble loading the document in the viewer, you can click this link to try to load the document from its original source in a new browser tab. You can also use this link for citations: just right-click and select Copy link to get a link to the original document.

The last line of each match card shows the query terms which Doc Search found in this document. This information may be helpful to you in refining your query. For example, you may notice that matches which do not include one of your query terms are not relevant. You could prefix that term with a + so that only documents which include it will be included in the match set.

Release notes

February 9, 2023

Put a link to the User Guide in the Query card. This should make it easier to find when you need it.

Collapse duplicate matches. Over the past eighteen months or so, I've significantly grown the document index. As the index grows over time, some documents are re-indexed and consequently have multiple entries in the index. Collapsing these duplicates in the matches section should reduce clutter and help surface relevant results.

Rename from WIRS to Doc Search. The new name should better describe the service.

Add a cute mascot. He should make the service feel more personable and fun.

Restructure and revise documentation. This should make Doc Search easier to learn and use.

Improve printability of documentation. This should make it easier to print out a hard copy of the Doc Search user guide in case you'd like to have it handy while you use the service.

Add a sponsor message to the top of the right column. This should help get the word out about my other projects and defray the costs of operating Doc Search.

October 28, 2021

When viewing a document in the document viewer, try to figure out the document's filename and provide that (or “Unnamed document.pdf” if we can't) to the web browser as a suggested filename. Most browsers will use this suggested filename if you ask them to save the document. This should fix the problem of all downloaded or saved documents being called proxy.pdf.

Add an Original link to cards in the Relevant documents and Matches sections. This should be handy if the viewer can't load the document or if you need a copy of the original link for a citation.

Correct a defect in how document URLs with percent-encoded characters were encoded and decoded. This defect prevented the affected documents from being displayed in the document viewer.

October 27, 2021

Highlight all cards in the Relevant documents or Matches sections which refer to the document being displayed in the viewer. This should make it easier to explore matches without wasted effort.

Open the user guide in a new tab. This should be handy for reading the user guide while using Doc Search.

Show a loading message in place of the document viewer while a document is loading. This should reduce ambiguity about what's going on while a document is loading.

October 25, 2021

Correct mangled verbiage in the performance card at the bottom of the Request section.

Make the document name a clickable link on cards in the Relevant documents section. This should make it easier to review relevant documents.

Return to the first page of the match set when the Search button is pressed. This should make it easier to review the most relevant matches for a refined query.

When viewing a document, have Doc Search perform the request on behalf of the browser so that Doc Search can provide the browser with appropriate headers. Without this, some sites could provide an access-control-allow-origin header which would instruct the browser not to display the document or a content-disposition header which would instruct the browser to download the document rather than display it.

October 20, 2021

Lock out controls that would change the request while it is being processed by the server. This includes the query text input, search button, mark and un-mark as relevant buttons, and the pagination buttons. This should prevent multiple contradictory requests from being sent to the server and causing confusion.

Scroll to the relevant section when a request is sent to the server. This should make it more efficient to refine queries by reducing manual scrolling in the left column.

Show a status message in the relevant section while a request is being processed. This should reduce ambiguity about what Doc Search is doing while it's working.

Show a “spinner” while a request is being processed. This should reduce ambiguity about whether a button-click was registered.

October 8, 2021

Display underscores in document filenames as spaces. This makes them a bit more readable, but also allows long filenames to word-wrap more gracefully.

Reorganize the Request section to split the query description, relevance set terms, query terms, and performance information into their own cards. This should make it easier to find what you're looking for at a glance.