keyword-distilleryRanks and relates keywords to datasets.
This project is solving the Data Treasure Hunting challenge. Description
NASA Space Apps Challenge 2015 (Data Treasure Hunting)
Generates a ranked list of keywords. Each keyword can generate a ranked list of databases which relate to it, ranked by the strength of the relationship.
- Generate a list of keywords you would like to map, save this file as
- Download a
- Run distill.py:
python distill.py weigh
python distill.py filter
- After the filtering is finished (will take some time) a
filtered_keyword_relationship_map.jsonwill be created.
filtered_keyword_relationship_map.json contains a list of keywords and any datasets related to it. The datasets are ranked in order of
relation_score. This value, calculated by multiplying keyword weight and relation_weight is a reasonable approximation of the strength of a keyword-to-dataset relation.
Existing keywords were aggregated from the provided toolkit. These keywords were then given scaled weights by inversing the number of search results returned when searched for on a web browser.
Public data hosted on data.hawaii.gov was then crawled. Each dataset listed was downloaded and scanned for matching keywords. Keyword frequency was calculated and stored in the relationship map. Freqency was calculated by dividing the total number of keyword instances by the length of the document.
Using inverse search results to calculate word "potency" necessitates filtering of highly weighted, but nonsense, keywords (because the work is done in lowercase, it might not be immediately obvious that the keyword is an acronym). This filtering is done at the end of the mapping, as it is unlikely that nonsense words will show up repeatedly in other datasets. In some cases a human generated keyword list might be the best choice when using the tool. Based on preference or use-case keywords can be generated and assigned automatically to a dataset by setting a threshold for the
relation_score and tagging the dataset when it is exceeded.
The tool also can be improved with how it calculates
relation_weight. Searching for word roots instead of exact matches (or synonyms) and throwing out
relation_weights that are well outside a normal distribution. Calculating relationship strength based only on keyword frequency unfairly biases the relationship strength towards datasets with lots of words, namely, lots of the same words. There are likely instances where a keyword was used as a variable or fieldname and was over-ranked. This could suppress or hide datasets that are otherwise obviously related.
License: GNU General Public License version 2.0 (GPL-2.0)
Source Code/Project URL: https://github.com/kris-s/keyword-distillery