This project is solving the Data Treasure Hunting challenge. Description
NASA has a lot of data, but it's hard to find what it's about, and how it connects to other data. To solve the first, we have built a tagging system that extracts natural keywords from titles and descriptions. We ran this across not only NASA data, but on datasets from all government departments. We covered 26,000 dataset in total, and the process is trivial to re-run over any new datasets that are added - making this easy to keep up to date. The results are stored in a web-accessible database, which makes the enriched data easy to use for any app.
We included other government data as NASA doesn't live in a void. There are more datasets tagged 'space' in non-NASA datasets than NASA datasets. The cross-over is particularly strong with the National Science Foundation, and the Department of Energy - both significant funding bodies for space based research. Now, for the first time, it's easy to see how concepts are being discussed across different agencies.
Once we have a uniform process for extracting core concepts from the documents, we needed to provide users with an easy way to query that data. Searching for a keyword doesn't solve the problem - you usually need to know the keyword beforehand. Instead we allow for a fuzzy search on the extracted concepts, and then surface not only the most common keyword, but the constellation of related concepts. You may be interested in Space Science, but you may not know that you can search directly for 'Pulsars' or 'Neutron Stars' to narrow the universe of results.
The visual graph based search makes it easy to search datasets, understand the connected concepts, and access the data, directly from source.
Live site: http://spacetag.space
License: Apache License 2.0 (Apache-2.0)
Source Code/Project URL: https://github.com/jonroberts/nasaMining
Data.gov - http://www.data.gov/
NASA Data Home - https://data.nasa.gov/
Socrata APIs - http://dev.socrata.com/