For scientific modeling, space and time are key features. It is crucial to understand where and when effects are likely to occur. As such, when handling scientific data one must always be cognizant of how geospatial and temporal information is stored. One problem is that across and within scientific domains datasets often apply different standards to how this information is stored. For example, November 15, 2021 can be written as any of:
- ...and on and on it goes
Similarly, latitudes and longitudes can be stored in enumerable ways (lat, latitude, Y, etc). To address this issue, we created and open sourced Geotime Classify (https://github.com/jataware/geotime_classify). Geotime Classify is a Python library that can automatically detect locations and time features in a dataset and infer their correct types.
This model is a type recurrent neural network that uses LSTM to learn text classification. The model is trained on Fake data provided by Faker. The goal was for a given spreadsheet where we expect some kind of geospatial and temporal columns, can we automatically infer things like:
- Admin levels (0 through 3)
- Timestamp (from arbitrary formats)
- Which column likely contains the "feature value"
- Which column likely contains a modifier on the feature
To do this, we collected example data from Faker along with additional locally generated data. The model was built using pytorch. We used padded embedding, and LSTM cell, a linear layer and finally a LogSoftmax layer. This model was trained with a dropout of .2 to reduce overfitting and improving model performance.
self.embedding = nn.Embedding(vocab_size, embedding_dim) self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=1) self.hidden2out = nn.Linear(hidden_dim, output_size) self.softmax = nn.LogSoftmax(dim=1) self.dropout_layer = nn.Dropout(p=0.2)
After a few iterations the model was performing well enough with accuracy hovering around 91 percent with 57 categories. Confusion Matrix: