spaceshield

There are currently only a few thousands asteroids with known properties. Our aims - to develop categorization models using different machine learning techniques, predict asteroids' characteristics and currently to search for new yet-undiscovered orbit types. During the SpaceApps event, we will show an example of what our models are capable of doing; if NASA is pleased with results we will make the models available for public use through a simple yet convenient web interface.

This project is solving the Neuromorphic Studies of Asteroid Imagery challenge.
Description
"To be glad of life, because it gives you the chance to love and to work and to play and to look up at the stars", Henry Van Dyke

Introduction

Our world is immensely fragile. One of the greatest existential threats (except for the ones we generate ourselves) comes from the space. NEO's (near-earth objects) such as asteroids and comets pose the same danger to our planet as they did before billion years ago. Latest events around the world (Chelyabinsk meteorite i.e.) has shown the scale of such danger. This makes study of these threats an essential purpose of world's astronomical and aerospace community.

One can see all by now confirmed impacts on pic. 1.

Pic. 1. Map of all the confirmed impact craters on Earth (PASSC, 2014)

According to some articles, only about 30% of the predicted population of NEOs larger than 100m have been discovered (MPC data and Mainzer et al. 2011). At current rates it will take 15 years to discover the rest, 18 years to characterize orbits, and around 190 years to complete spectroscopy. (Beeson, 2013)

The purpose of our team's research is to develop a set of characterization models, which will use different algorithms for characterization of all the possible parameters of asteroids and probably predicting new orbit types. According to our rough estimation, our models are able to save NASA dozens years of extra analysis and research work.

Data and Data Preparation

First of all, in order to start developing the machine learning models we had to find suitable data which could have been used for training and testing the models (labeled data) and data which could have have been classified based on the results of the training stage. Two main sets of data we use at the moment were both found on MPC web-page.

Properties data

Could be downloaded here.

This is a large compilation of all the known characteristics of around 680 000 asteroids. It has various properties for: orbits, albedos and sizes, colors and spins. In total, there are approximately 50 different characteristics pic. 2.

Pic. 2, Orbital Elements Diagram (MPC, 2015)

With this data we have performed several actions:

1) Writing a Split program which is able to split 3 Gb document into 7 smaller files, so it could be uploaded to Mongo for the beginning of primary data-analysis

2) Search for the characteristics to be classified and characteristics to be used for training of the model

3) Searching for suitable asteroids to be used in our research

4) Generating a cross-data using other sources

Code of the Split program could be observed here.

There are several possible orbit types at the moment in the data:

0 Unclassified (mostly Main Belters)

1 Atiras

2 Atens

3 Apollos

4 Amors

5 Mars Crossers

6 Hungarias

8 Hildas

9 Jupiter Trojans

10 Distant Objects

14, 15, 16 - description for some of them was not given. Based on characteristics, we assume that these are distant objects

As could be found from description, most unclassified objects are main belters, so they are not atiras, amors, atens, apollos, mars crossers, jupiter trojans, or distant objects. Generally, these asteroids could be either Hungarias, Phocaeas or belonging to new classes.

After analysing the data and consulting with NASA representatives, we have decided to concentrate on unclassified main-belter asteroids and try to predict their orbit types and probably search for new orbit types among them. After this work was done, we have started developing a model for diameters.

However, this data itself could be not enough for training a model, so we have been searching for more informative characteristics in order to make a large cross-data to be fed to the model.

LightCurves data

Lightcurves data could be downloaded here.

Finally, this is an informative data set, which could be actually used for making more precise predictions of the properties pic. 3. This data has around 23 300 observations for 2 423 asteroids.

Pic. 3, Sample of LightCurve data for 1036 Ganymed (MPC, 2015)

With this data, we have performed following actions:

1) Coding a Converter program, so data could be transformed to JSON format and uploaded to Mongo

2) Structuring data

3) Comparing data to properties data, search for the asteroids with same number/name/designation

4) Producing a cross-data

Code of the ALCDEF-to-JSON converter could be observed here.

The main product of data-preparation stage was produced after.

Cross-Data

Finally, this is one of the significant results of our work. Using all the large data-sets available on MPC web-site, we have produced a cross-data, suitable for feeding to categorization models, which combines properties and lightcurves parameters. Data could be downloaded here.

The data was generated by searching for objects with same object number (Pic. 4)

Pic. 4, Cross-Data merging algorithm

With this data, we have performed following actions:

1) Searching for asteroids to be part of the data

2) Merging Properties and LightCurves data sets together

3) Data optimization

Data optimization is in italics as it was one of the significantly challenging stages. The idea here is that various asteroids and even various observations of certain asteroid could have different number of LightCurve measurements, while machine-learning algorithms require the number of columns in each observation in a data set to be equal.

For solving this task, we have made a small data normalization algorithm. We have calculated a weight average of all the different numbers of LightCurve measurements using MongoDB and then adjusted amount of measurements to this value (in our case we have chosen 36 measurements per observation). Extra measurements were proportionally cut down and for the files with lower amount of measurement simple imputation techniques were used in order to fill the missing values. Observations with less than 26 measurements were excluded from data-set.

Code of normalization algorithm could be found here.

At the end, we have made a data set which was suitable for analyzing with machine learning techniques and the first part of our research that took 2 weeks of continuous work was done.

Infrared Measurements by NEOWISE

This data could be downloaded here.

At the moment, while 'data analysis' team is working with cross data, 'data preparation' team is working with NEOWISE infrared measurements. In order to increase accuracy and amount of parameters we can predict, infrared data could be used separately or in combination with existing data. This is one of the possible directions of future development of our project.

Characterization Models' Algorithms

For our machine-learning analysis, we have chosen several algorithms: neural network, random forests, conditional inference trees, logistic regression, k-nearest neighbors technique and k-means clustering.

They will be briefly explained in following sub-chapters, as well as other techniques used during the research.

Cross-validation

The most important technique done while optimizing the models which is used for significant increase in accuracy. The idea here is that information is divided in n parts. Then model is trained using n-1 parts of data and the rest of data is used for testing. The procedure is repeated n times; as a result, each of n pieces of data used for testing. The result is an assessment of the effectiveness of the chosen model with the most even usage of available data. (Starkweather, 2011)

This technique is used in all the models we are testing.

Neural network

Artificial neural networks are adaptive models which can trained based on given data and produce information based on what they have learnt. It is nowadays mostly used in medicine, stock market predictions and image recognition. The general structure could be seen on pic. 5.

Pic. 5, Neural Network Structure

All in all, there are input, hidden and output layers. The circles on this image represent neurons which are responsible for analyzing the data with chosen algorithm. Amount of layers can differ and that is one of the parameters, together with various algorithm-optimization schemes, which we will adjust in order to get the highest possible accuracy. (Cheng & Titterington, 1994)

However, neural networks are highly prone to over-fitting which makes them not the most accurate model for many cases.

Random forest

Random forest is one of the algorithms for which we rely the most. The idea here requires using of large collection of de-correlated decision trees. The scheme of the method could be seen on pic. 6. (Breiman, 2001)

Pic. 6, Random Forest Tree Scheme

This technique is quite suitable for classification and regression problems, also it has high levels of predictive accuracy and is pretty resistant to over-fitting.

Conditional inference tree

Conditional inference trees are quite similar to random forest model; however they they use a significance test procedure in order to select variables instead of selecting the variable that maximizes an information measure (i.e. like in Gini coefficient). (Hothorn et al., 2006)

Logistic regression

The idea is that this model provides a method for modelling a binary response variable, which takes values 1 and 0. It makes characterization by simply fitting the data to a logistic curve. (Peng et al., 2010)

It is one of the simplest classification algorithms and it is computationally inexpensive; however it can seriously under-fit the dataset and therefore not capture the maximum amount of variance that can be easily captured with other algorithms.

In our case we use it as a baseline. So, we compare the results of other techniques to this one, while working on improvements and comparing algorithms.

K-nearest neighbors (KNN)

Finally, the last algorithm we are using for categorization is nearest neighbors model. This algorithm is used for imputation of missing values in data sets. The basic principle of the method is that the object is assigned to the class which is the most common among the neighbors of a given element. (Lu et al., 2012)

Neighbors are taken on the basis of a set of objects whose classes are already known and based on the key to this method value of 'K' the most numerous class among them is calculated (pic. 7).

Pic. 7, K-nearest Neighbors Basic Scheme. 'Red' circle will be classified as 'green' based on nearest neighbors class

This algorithm has a very large set of optimization techniques (like Parzen window, etc.). So, in our research we are working with it as well in order to ensure that all the promising or potentially promising techniques are taken into consideration.

K-means clustering

This algorithm was used after we have found out the most significant parameters for classifying orbit types. This model was applied for dividing unclassified main-belter to several subcategories.

The idea here is that it tends to minimize the total deviation of the points from the cluster centers of these clusters. At each iteration the center of mass is re-calculated for each cluster obtained in the previous step, then the vectors are partitioned into clusters again in accordance with which of the new centers was closer to the chosen metric. (Wagstaff et al., 2001)

R Implementation

Implementation is done in two steps: data processing and training the actual models.

Data processing

Data processing steps:

1) Excluding variables with 70% or more unknown values

2) Imputing missing variables using KNN imputation with K=3

3) Normalizing the data: centering and scaling

4) Using principal component analysis retaining 99% of variance

We have this information about orbits for cross-data, for example. Upper row are orbit types and bottom row is amount of observations for this orbit type in our data set (Table 1).

Table 1. Cross-data orbits information

There is only 1 observation for "1" type of the orbit, so we will be unable to predict these orbits. That observation was dropped.

We split the data into training set (labeled data) and the set for classification (unlabeled data, orbits unclassified).

After we have discovered that we actually can use all 680 000 properties data, we have made same processing for that. Table 2 represents distribution of known asteroids in all the possible categories.

Table 2, Properties data orbits information

Training the actual models

R implementation code could be found here.

Results

By now our team has achieved several considerably valuable results that are worth to be listed here.

ALCDEF-to-JSON converter

All the high-quality informative lightcurves data on MPC web-site is stored in unique format called ALCDEF (The Asteroid Lightcurve Data Exchange Format). In order to be able to analyse and proceed this data we had to make a converter to suitable and more useful format.

Converter currently exists as a beta version with just a couple of settings and would be further improved.

Asteroids' Cross-Data

One of our main results is producing a cross-data using properties and lightcurve data of asteroids. This is the data-set which could be used by other researchers wanting to perform a study using machine-learning techniques. It is already in a suitable format and with adjusted data. This data could be fed directly to a learning algorithm. Cross-data.

Algorithms

Since for nearly all asteroids orbit types are already known, we have done a categorization model which can be used for future predictions and used it also for extracting key variables while searching for new orbit types.

With diameters, we have been working more on the exact model itself, so we can predict diameters of asteroids for which it is unknown. Among all the tested algorithms random forest gave the best results for predicting diameters (for each parameter there could be different model the best suitable) with root-mean-square deviation (RMSE) = 0.4397909. Here you can see the results for all the tested models:

1) Generalised linear model, RMSE = 8.416011

2) Bayesian Regularized Neural Network, RMSE = 2.075327

3) Neural Network, RMSE = 1.795411

4) Conditional inference tree, RMSE = 1.642696

5) Random forest, RMSE = 0.4397909

Continuing our work we have noticed that random forests gave the best results in all the cases we have studied so far. So, we would like to advise other researchers working with asteroids data analysis to concentrate on random forest model.

Finally, we have been able to make a model for predicting asteroids' orbit types based on lightcurves only with accuracy of 93% using random forest model (detailed explanation in 'Predictions' chapter).

Predictions

In our work we are going from one stage to another. Firstly, we have been working with orbits and orbit types; after that was done we moved to studying the diameters.

Our work with orbit types was based on several stages:

1) Training a categorization model for future classification based on lightcurves

2) Search for the key parameters using cross-data

3) Performing k-means clustering

4) Classifying data

5) In case of new possible orbit types, identifying their characteristics

Firstly, we have made several predictions based on cross-data, in order to find the key parameters for predicting orbit type. The results could be seen on pic. 8.

Pic. 8, Variable importance in Random Forest model for orbit types (full-size picture is available on project's github)

We have noticed that the key parameters come not from lightcurves, but from properties data. However, we have still decided to try making several models based only on information from lightcurves. Using random forest algorithm we have been able to train a model, which can predict asteroids' orbit type with 93% accuracy based on lightcurves information only. Key variables could be seen on pic. 9.

Pic. 9, Prediction of orbits based on lightcurves, key variables

This in one of the main results of our work, as it gives a possibility to identify asteroids' type using only simple lightcurve measurements. Implementation of the algorithm could be found on projects' github. We will continue working with this model in order to increase accuracy even more.

Also, we have unexpectedly classified one Russell's Teapot. We don't know where this measurement comes from.

Still, as we have noticed, the most important parameters come from properties data. We have decided to try to work with whole 680 000 asteroids data set alone in order to produce some more interesting results not required in the challenge description. The following variables were chosen for further research:

1) Perihelion distance

2) Eccentricity

3) Semimajor axis

4) Aphelion distance

5) Mean daily motion

6) Period

7) Tisserand jupiter (true/false)

8) Absolute magnitude

Before starting k-means clustering we have visualized data for all known orbit types in order to see the logic of how they are divided using these parameters. Originally, there are 26 different graphs (available on project's github); however, here we will shown one of the most informative ones, Mean daily motion vs. Eccentricity (pic. 10).

Pic. 10, Mean daily motion vs. Eccentricity. Graph a - all known orbit types, graph b - unclassified main-belters

After checking these results, we have decided to make 2 sets of clustering - with 3 clusters and with 2 clusters for unclassified main-belters. The results could be seen on pic. 11.

Pic. 11, New subclasses for unclassified main-belters. Graph c - division into two subclasses, graph d - division into 3 subclasses

The principles used for dividing these subclasses are presented via decision trees. Pic. 12 represents decision tree for defining 3 subclasses and pic. 13 represents decision tree for defining 2 subclasses.

Pic. 12, Decision tree for defining one of the subclasses for 3-subclasses case

Pic. 13, Decision tree for defining one of the subclasses for 2-subclasses case

Also, by visualizing all the graphs, we have noticed that there are several groups of asteroids separated from the main part. They can be seen on a lot of graphs, here is just one example (pic. 14).

Pic. 14, Possible subclasses based on visual investigation

Generally, this gives NASA and MPC an example of how this unclassified main-belters could be divided into several subcategories, which can make their classification and analysis easier and more effective. We ourselves prefer division in 3 subclasses, as algorithm uses more essential parameters for this division. Currently we are working with NASA representatives upon analysing these models and deciding future direction of the research.

Then, we have moved to diameters. Diameters distribution could be represented by power law, which is visualized on pic. 15. We have noticed that in our case it correlates pretty much with power law information from Meinzer et al., (2011).

Pic. 15, Diameters' power law

After training the model we have discovered that the key variables for predicting diameters all come from NEOWISE data and for all asteroids which have these parameters, diameters are already known pic. 16.

Pic. 16, Variable Importance for diameters (full-size image is available on project's github)

It is impossible to predict diameters of other asteroids directly as they don't have any of these parameters known and the ones for which they are known already have their diameters. Currently, we are looking for the best way to find a correlation between lightcurve information and key variables for predicting diameters, so it is possible to study lightcurves and some few parameters and already get a suitable prediction of diameter.

Further development

As for now, we are analysing the existing results in order to find the most promising directions of further development. Generally, ability to predict orbit types based on lightcurves data only is already a quite valuable achievement; however, we are pretty sure that working with such data using machine-learning techniques could give even more impressive results.

At the same time, we are working with other data sets which can make our models more accurate and increase the amount of characteristics we can predict. One of the main ones at the moment is NEOWISE data of infrared measurements. We are also looking for other possible sources of information, such as radiometry and spectrometry.

Once a basic set of prediction models is ready and gives satisfactory expected accuracy, we are going to implement a web-based interface for amateur and professional astronomers to be able to make their own predictions. Since the most important goal of a prediction research is to achieve a highest possible accuracy, the user interface is planned as the last stage of the project.

Overall

The idea is that predicting of every one of the 50 parameters and studying all the orbit types requires training separated algorithms, deep data-mining and is highly time- and resources-consuming. Also, there are other data sets suitable (NEOWISE data i.e.) which could be processed in order to be used in prediction and categorization using our models.

However, it goes then far beyond the scales of the challenge project and could be classified as a full-scale scientific research; though with extremely promising benefits. So, if NASA is pleased with the results we are going to show during the SpaceApps weekend, our team is ready to continue this research using other data available and predicting other parameters using our models - officially as a research group or as a part of a research group. If so, one can contact via - [email protected]

Project Information

License: MIT license (MIT)

Source Code/Project URL: https://github.com/spaceshield

Resources

LightCurve data - http://www.minorplanetcenter.net/light_curve2/light_curve.php
Properties data - http://minorplanetcenter.net/web_service
Infrared data - http://neo.jpl.nasa.gov/programs/neowise.html
SpaceShield Hackpad - https://spaceshield.hackpad.com/SpaceShield-MW4Tamjypzr
Beezon, 2013 - http://thurj.org/research/2013/05/4458/
Mainzer et al. 2011 - http://iopscience.iop.org/0004-637X/743/2/156
Starkweather, 2011 - https://www.unt.edu/rss/class/Jon/Benchmarks/CrossValidation1_JDS_May2011.pdf
Cheng & Titterington, 1994 - https://www.biostat.wisc.edu/~kbroman/teaching/statgen/2004/refs/ct.pdf
Breiman, 2001 - http://oz.berkeley.edu/~breiman/randomforest2001.pdf
Hothorn et al., 2006 - http://cran.r-project.org/web/packages/partykit/vignettes/ctree.pdf
Peng et al., 2010 - https://datajobs.com/data-science-repo/Logistic-Regression-[Peng-et-al].pdf
Lu et al., 2012 - http://vldb.org/pvldb/vol5/p1016_weilu_vldb2012.pdf
Wagstaff et al., 2001 - http://nichol.as/papers/Wagstaff/Constrained%20k-means%20clustering%20with%20background.pdf

Team