metatron
Metatron aims to lead us out of the metadata wilderness using natural language processing and various data mining technologies.This project is solving the Data Treasure Hunting challenge. Description
Metatron: Metadata improvements for public data sets
Metatron is a public dataset search portal for people who don't know in advance what data set they might want, comprising:
-
A data search front end with a catalog of general topics
-
A cloud-based back end comprising:
-
Robots to crawl data websites
-
Mirrors of online data sets, or even a standalone data set repository service
-
NLP engine for discovering metadata relationships
-
A suggestion engine based on user search history
-
Metadata as a service to propagate canonical tags
-
.............
Goals:
-
Continuously crawl sites for publicly available data set URLs to add to our database of data sets
-
Crawl sites for references to data sets (by name or URL) and associate them with topics using NLP
-
Allow users to add URLs to database
-
Mirror datasets and analyze their metadata
-
Suggest data sets based on keyword searches
-
Identify synonymous metadata tags, and:
-
choose canonical tag based on usage frequency
-
associate canonical tag with search keywords based on past searches, user-submitted schema, and NLP (possibly separate tiers)
-
Project Information
License: GNU General Public License version 3.0 (GPL-3.0)
Source Code/Project URL: https://github.com/oneirochrone/metatron