Sketch of an Open Source Data Repository

Feb 02, 2005

One of the largest obstacles to creating effective data-mining software is the scarcity of good data sets. If good data sets were easily available to people on the web, I think it's likely that more people would come up with innovative data mining applications. Some data sets that I would like to see:

Stock Data: I've been looking for a good repository of stock data (on and off) for quite a while, and I've yet to find one.
Del.icio.us Data: How cool are the possibilities with a big del.icio.us data dump? Has anyone created one? You could graph cliques of users, find similar tags, find tag misspellings, or do a hundred other interesting things with the data.
Audioscrobbler Data: Audioscrobbler publishes a data dump (available here; it relates users of the audioscrobbler system to the songs that they listen to.
Scientific Data: Often, experiments generate more data than the researchers who run the experiment can handle. Opening scientific data sets could provide some interesting results due to the "many eyes" syndrome.
Game Data: There are hundreds of data sets of IRC poker games, internet chess games, internet go games, and other games just waiting to be gathered in one central location.
Literary Data: The Gutenberg Project makes over 6000 books available in text on the web, but they are not in a data-mining-friendly format. If tagged appropriately, it could be trivial for me to graph comma frequency rates over the last 100 years to see if they changed, or maybe even do something interesting.
Internet Data: Different collections of websites - corporate websites, academic websites, random collections of websites.
Open Source Code: Koders has shown that searching open source code can be valuable; certailnly interesting statistics could be derived from a large collection of open source code.

Problems

I can imagine several practical problems with such a data source. First is copyright; assuring that the site had a copyright for so much data would be a daunting task. Perhaps the responsibility for the copyright could be held by the submitter?

The format of the data would also be an interesting problem. To make the data worthwhile, it would likely have to be constrained to some known subset of well-documented data formats. It would require a fairly large effort to convert existing data to an acceptable format, and verify that the data is in the correct format.

Documenting the meaning of the data contained in the files would also be a daunting challenge, requiring a fairly large effort. If the repository contained a lot of data, but nobody knew what it meant, it would be worhless.

Finally, the site's success would be a part of its problem. Transferring large data sets over the internet would create a hell of a bandwidth bill. Finding a way to deal with this bill - through donations, advertising, or perhaps a fee for use of the site - would be crucial to its success.

Conclusions

The value of such a website is, in my mind, undeniable. If you could build a community of users around it, I believe that novel applications of data-mining techniques would inevitably arise. While administrative problems would be significant, the success of such large open sites as Sourceforge and Wikipedia leads me to believe that it's a conceivable project.