metatron

Metatron aims to lead us out of the metadata wilderness using natural language processing and various data mining technologies.

This project is solving the Data Treasure Hunting challenge.

Description

Metatron: Metadata improvements for public data sets

Metatron is a public dataset search portal for people who don't know in advance what data set they might want, comprising:

  • A data search front end with a catalog of general topics

  • A cloud-based back end comprising:

    • Robots to crawl data websites

    • Mirrors of online data sets, or even a standalone data set repository service

    • NLP engine for discovering metadata relationships

    • A suggestion engine based on user search history

    • Metadata as a service to propagate canonical tags

.............

Goals:

  • Continuously crawl sites for publicly available data set URLs to add to our database of data sets

  • Crawl sites for references to data sets (by name or URL) and associate them with topics using NLP

  • Allow users to add URLs to database

  • Mirror datasets and analyze their metadata

  • Suggest data sets based on keyword searches

  • Identify synonymous metadata tags, and:

    • choose canonical tag based on usage frequency

    • associate canonical tag with search keywords based on past searches, user-submitted schema, and NLP (possibly separate tiers)


Project Information


License: GNU General Public License version 3.0 (GPL-3.0)


Source Code/Project URL: https://github.com/oneirochrone/metatron


Resources


Team

  • Mitch Lewis
  • Henry Poon


Loading...
×
Loading...
×