Metatron aims to lead us out of the metadata wilderness using natural language processing and various data mining technologies.

This project is solving the Data Treasure Hunting challenge.


Metatron: Metadata improvements for public data sets

Metatron is a public dataset search portal for people who don't know in advance what data set they might want, comprising:

  • A data search front end with a catalog of general topics

  • A cloud-based back end comprising:

    • Robots to crawl data websites

    • Mirrors of online data sets, or even a standalone data set repository service

    • NLP engine for discovering metadata relationships

    • A suggestion engine based on user search history

    • Metadata as a service to propagate canonical tags



  • Continuously crawl sites for publicly available data set URLs to add to our database of data sets

  • Crawl sites for references to data sets (by name or URL) and associate them with topics using NLP

  • Allow users to add URLs to database

  • Mirror datasets and analyze their metadata

  • Suggest data sets based on keyword searches

  • Identify synonymous metadata tags, and:

    • choose canonical tag based on usage frequency

    • associate canonical tag with search keywords based on past searches, user-submitted schema, and NLP (possibly separate tiers)

Project Information

License: GNU General Public License version 3.0 (GPL-3.0)

Source Code/Project URL:



  • Mitch Lewis
  • Henry Poon