A team of boffins is developing a search engine which can find all the data on the world wide web which cannot be seen by search bots.
The engine, dubbed Brown Dog, searches the web for uncurated data and makes it accessible to scientists.
Kenton McHenry, who along with Jong Lee lead the Image and Spatial Data Analysis division at the National Center for Supercomputing Application (NCSA) said that the information age has made it easy for anyone to create and share vast amounts of digital data, including unstructured collections of images, video and audio as well as documents and spreadsheets.
But the ability to search and use the contents of digital data has become exponentially more difficult because digital data is often trapped in outdated, difficult-to-read file formats and because metadata–the critical data about the data, such as when and how and by whom it was produced–is nonexistent.
McHenry and his team at NCSA have been given a $10 million, five year award from the National Science Foundation (NSF) to manage and make sense of vast amounts of digital scientific data that is currently trapped in outdated file formats.
So far they have come up with a Data Access Proxy (DAP) which transforms unreadable files into readable ones by linking together a series of computing and translational operations behind the scenes.
Similar to an internet gateway, the configuration of the Data Access Proxy would be entered into a user’s machine settings and then forgotten. Data requests over HTTP would first be examined by the proxy to determine if the native file format is readable on the client device. If not, the DAP would be called in the background to convert the file into the best possible format readable by the client machine.
The second tool, the Data Tilling Service (DTS), lets individuals search collections of data, possibly using an existing file to discover other similar files in the data.
Once the machine and browser settings are configured, a search field will be appended to the browser where example files can be dropped in by the user. Doing so triggers the DTS to search the contents of all the files on a given site that are similar to the one provided by the user.
While browsing an online image collection, a user could drop an image of three people into the search field, and the DTS would return images in the collection that also contain three people. If the DTS encounters a file format it is unable to parse, it will use the Data Access Proxy to make the file accessible.
The Data Tilling Service will also perform general indexing of the data and extract and append metadata to files to give users a sense of the type of data they are encountering.
McHenry said the two services are like the Domain Name Service (DNS) in that they can translate inaccessible uncurated data into information.
According to IDC, a research firm, up to 90 percent of big data is “dark,” meaning the contents of such files cannot be easily accessed.
Brown Dog is not only useful for searching the Deep Web, it could one day be used to help individuals manage their ever-growing collections of photos, videos and unstructured/uncurated data on the Web.