There's no unified classification of NASA data. We solved this by uniformly extracting natural keyphrases from more than 26,000 datasets. It's hard to search keywords if you don't know what they are already, or how they relate. We solved this by building a visual search on our keywords, to allow users to explore concepts, see related concepts, and drill down directly into the data. Since the hackathon, the project has already been picked up for an internal NASA big data workshop.

This project is solving the Data Treasure Hunting challenge.


NASA has a lot of data, but it's hard to find what it's about, and how it connects to other data. To solve the first, we have built a tagging system that extracts natural keywords from titles and descriptions. We ran this across not only NASA data, but on datasets from all government departments. We covered 26,000 dataset in total, and the process is trivial to re-run over any new datasets that are added - making this easy to keep up to date. The results are stored in a web-accessible database, which makes the enriched data easy to use for any app.

We included other government data as NASA doesn't live in a void. There are more datasets tagged 'space' in non-NASA datasets than NASA datasets. The cross-over is particularly strong with the National Science Foundation, and the Department of Energy - both significant funding bodies for space based research. Now, for the first time, it's easy to see how concepts are being discussed across different agencies.

Once we have a uniform process for extracting core concepts from the documents, we needed to provide users with an easy way to query that data. Searching for a keyword doesn't solve the problem - you usually need to know the keyword beforehand. Instead we allow for a fuzzy search on the extracted concepts, and then surface not only the most common keyword, but the constellation of related concepts. You may be interested in Space Science, but you may not know that you can search directly for 'Pulsars' or 'Neutron Stars' to narrow the universe of results.

The visual graph based search makes it easy to search datasets, understand the connected concepts, and access the data, directly from source.

Live site: http://spacetag.space

Video: https://www.youtube.com/watch?v=SeVsKrD-hZ0

Project Information

License: Apache License 2.0 (Apache-2.0)

Source Code/Project URL: https://github.com/jonroberts/nasaMining


Data.gov - http://www.data.gov/
NASA Data Home - https://data.nasa.gov/
Socrata APIs - http://dev.socrata.com/


  • Irena Chaushevska
  • Tim Winkler
  • Matthew Lipson