Saturday, July 7, 2007

Google File Search

Web pages are useful, but if you've ever wanted to find a specific file on the web, you noticed it's not very easy. Fortunately, search engines like Google could be used for this tricky task.

Sometimes people create a web site, put some files in a directory, but forget to add an index file. So they end up with an unprotected directory that lists all of its files and subdirectories, when directly accessed from a browser. If someone links to the directory or submits it to Google, it becomes available to anyone who performs a search.

Because these directory listings are built using similar templates (depending on the web server), you can add to your query the most distinctive traits:

* The title starts with "index of" -> add to the Google query: intitle:"index of"

* They typically contain these words: "parent directory", name, "last modified", size, description -> you can add to your query "parent directory", for example

* Since most sites use Apache servers, you could also add Apache, that appears in the footer of a listing for Apache web servers


To find the page from the screenshot, you could use a query like:
intitle:"index of" firefox 2.0 rc1 source

Of course, you could use this idea to find any kind of file from a PDF e-book to an MP3 podcast or song. Some of the files are shared by breaking a copyright law, so you must you use your judgment before downloading them.

But finding files using this technique is too complicated, you'll say. First you have to enter a very complicated query, then visit all these strange-looking web pages and perform a new search in the current page to actually find the file. Then there are so many dead links and disingenuous webmasters that try to trick you with fake pages.

Some people with too much time on their hands built web apps that make it easy to search for files using Google. Briefli builds the query internally, loads the first results from Google and displays the links to the files on the same page. Moreover, the files that actually match your query are highlighted. To play the MP3s inline, you could add the del.icio.us bookmarklet to your browser and for Office files and PDFs, use Docufarm.



A site optimized for finding and playing MP3 files is mp3Salad. It lets you play all the MP3 files from a directory using a simple Flash player and even export the entire listing as a playlist.

The avalanche of file hosting sites brought a new to search for files: restrict the search results to one or more of these sites. Some examples of popular file hosting sites: esnips.com or megaupload.com. This custom search engine lets you restrict the search to 127 file hosting sites.

And then there are BitTorrent sites. Because they're so many, this custom search engine is useful to search across the most popular ones.

Google actually indexes some of these files, mostly Office documents, PDF files, text files. You can restrict a Google search to a file type by using the filetype: operator in your query (examples: bash linux filetype:pdf restrict the search for [bash linux] to PDF files). This way you can search inside these files and not only in a listing of filenames.

For files residing on your hard disk, a desktop search engine like Google Desktop (Windows/Mac/Linux), Windows Vista's search, Mac's Spotlight are great and should be used before searching on the web.

Maybe one day Google will come up with a nice file search engine that indexes unprotected directories, FTP servers, file hosting sites, torrent sites. But probably the legal challenges outweigh the advantages of a such a search engine (Yahoo has a music search engine, but only for China).

No comments:

Post a Comment