Skip to content

Commit e838bf4

Browse files
committed
Schema v1 to Aardvark migrator
- Handle elements without crosswalk (via lookup tables) - Support migrating collections in dct_isPartOf_sm - Convert single to multivalued fields where appropriate - Retain custom fields and remove deprecated fields Closes #121
1 parent 323a711 commit e838bf4

5 files changed

Lines changed: 169 additions & 28 deletions

File tree

README.md

Lines changed: 30 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
11
# GeoCombine
22

3-
![CI](https://github.com/OpenGeoMetadata/GeoCombine/actions/workflows/ruby.yml/badge.svg)
3+
![CI](https://github.com/OpenGeoMetadata/GeoCombine/actions/workflows/ruby.yml/badge.svg)
44
| [![Coverage Status](https://img.shields.io/badge/coverage-95%25-brightgreen)]()
55
| [![Gem Version](https://img.shields.io/gem/v/geo_combine.svg)](https://github.com/OpenGeoMetadata/GeoCombine/releases)
66

7-
87
A Ruby toolkit for managing geospatial metadata, including:
8+
99
- tasks for cloning, updating, and indexing OpenGeoMetdata metadata
1010
- library for converting metadata between standards
1111

@@ -43,6 +43,32 @@ Or install it yourself as:
4343
> iso_metadata.to_html
4444
```
4545

46+
### Migrating metadata
47+
48+
You can use the `GeoCombine::Migrators` to migrate metadata from one schema to another.
49+
50+
Currently, the only migrator is `GeoCombine::Migrators::V1AardvarkMigrator` which migrates from the [GeoBlacklight v1 schema](https://github.com/OpenGeoMetadata/opengeometadata.github.io/blob/main/docs/gbl-1.0.md) to the [Aardvark schema](https://github.com/OpenGeoMetadata/opengeometadata.github.io/blob/main/docs/ogm-aardvark.md)
51+
52+
```ruby
53+
# Load a record in geoblacklight v1 schema
54+
record = JSON.parse(File.read('.spec/fixtures/docs/full_geoblacklight.json'))
55+
56+
# Migrate it to Aardvark schema
57+
GeoCombine::Migrators::V1AardvarkMigrator.new(v1_hash: record).run
58+
```
59+
60+
Some fields cannot be migrated automatically. To handle the migration of collection names to IDs when migrating from v1 to Aardvark, you can provide a mapping of collection names to IDs to the migrator:
61+
62+
```ruby
63+
# You can store this mapping as a JSON or CSV file and load it into a hash
64+
id_map = {
65+
'My Collection 1' => 'institution:my-collection-1',
66+
'My Collection 2' => 'institution:my-collection-2'
67+
}
68+
69+
GeoCombine::Migrators::V1AardvarkMigrator.new(v1_hash: record, collection_id_map: id_map).run
70+
```
71+
4672
### OpenGeoMetadata
4773

4874
#### Clone OpenGeoMetadata repositories locally
@@ -63,7 +89,7 @@ You can also specify a single repository:
6389
$ bundle exec rake geocombine:clone[edu.stanford.purl]
6490
```
6591

66-
*Note: If you are using zsh, you will need to use escape characters in front of the brackets:*
92+
_Note: If you are using zsh, you will need to use escape characters in front of the brackets:_
6793

6894
```sh
6995
$ bundle exec rake geocombine:clone\[edu.stanford.purl\]
@@ -83,7 +109,7 @@ You can also specify a single repository:
83109
$ bundle exec rake geocombine:pull[edu.stanford.purl]
84110
```
85111

86-
*Note: If you are using zsh, you will need to use escape characters in front of the brackets:*
112+
_Note: If you are using zsh, you will need to use escape characters in front of the brackets:_
87113

88114
```sh
89115
$ bundle exec rake geocombine:pull\[edu.stanford.purl\]
Lines changed: 76 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,41 +1,86 @@
11
# frozen_string_literal: true
22

3+
require 'active_support'
4+
35
module GeoCombine
46
module Migrators
5-
# TODO: WARNING! This class is not fully implemented and should not be used in
6-
# production. See https://github.com/OpenGeoMetadata/GeoCombine/issues/121
7-
# for remaining work.
8-
#
97
# migrates the v1 schema to the aardvark schema
108
class V1AardvarkMigrator
119
attr_reader :v1_hash
1210

1311
# @param v1_hash [Hash] parsed json in the v1 schema
14-
def initialize(v1_hash:)
12+
# @param collection_id_map [Hash] a hash mapping collection names to ids for converting dct_isPartOf_sm
13+
def initialize(v1_hash:, collection_id_map: {})
1514
@v1_hash = v1_hash
15+
@v2_hash = v1_hash
16+
@collection_id_map = collection_id_map
1617
end
1718

1819
def run
19-
v2_hash = convert_keys
20-
v2_hash['gbl_mdVersion_s'] = 'Aardvark'
21-
v2_hash
20+
# Return unchanged if already in the aardvark schema
21+
return @v2_hash if @v2_hash['gbl_mdVersion_s'] == 'Aardvark'
22+
23+
# Convert the record
24+
convert_keys
25+
convert_single_to_multi_valued_fields
26+
convert_non_crosswalked_fields
27+
remove_deprecated_fields
28+
29+
# Mark the record as converted and return it
30+
@v2_hash['gbl_mdVersion_s'] = 'Aardvark'
31+
@v2_hash
2232
end
2333

34+
# Namespace and URI changes to fields
2435
def convert_keys
25-
v1_hash.transform_keys do |k|
36+
@v2_hash.transform_keys! do |k|
2637
SCHEMA_FIELD_MAP[k] || k
2738
end
2839
end
2940

41+
# Fields that need to be converted from single to multi-valued
42+
def convert_single_to_multi_valued_fields
43+
@v2_hash = @v2_hash.each_with_object({}) do |(k, v), h|
44+
h[k] = if !v.is_a?(Array) && k.match?(/.*_[s|i]m/)
45+
[v]
46+
else
47+
v
48+
end
49+
end
50+
end
51+
52+
# Convert non-crosswalked fields via lookup tables
53+
def convert_non_crosswalked_fields
54+
# Keys may or may not include whitespace, so we normalize them.
55+
# Resource class is required so we default to "Other"; resource type is not required.
56+
@v2_hash['gbl_resourceClass_sm'] = RESOURCE_CLASS_MAP[@v1_hash['dc_type_s']&.gsub(/\s+/, '')] || ['Other']
57+
resource_type = RESOURCE_TYPE_MAP[@v1_hash['layer_geom_type_s']&.gsub(/\s+/, '')]
58+
@v2_hash['gbl_resourceType_sm'] = resource_type unless resource_type.nil?
59+
60+
# If the user specified a collection id map, use it to convert the collection names to ids
61+
is_part_of = @v1_hash['dct_isPartOf_sm']&.map { |name| @collection_id_map[name] }&.compact
62+
if is_part_of.present?
63+
@v2_hash['dct_isPartOf_sm'] = is_part_of
64+
else
65+
@v2_hash.delete('dct_isPartOf_sm')
66+
end
67+
end
68+
69+
# Remove fields that are no longer used
70+
def remove_deprecated_fields
71+
@v2_hash = @v2_hash.except(*SCHEMA_FIELD_MAP.keys, 'dc_type_s', 'layer_geom_type_s')
72+
end
73+
3074
SCHEMA_FIELD_MAP = {
3175
'dc_title_s' => 'dct_title_s', # new namespace
3276
'dc_description_s' => 'dct_description_sm', # new namespace; single to multi-valued
3377
'dc_language_s' => 'dct_language_sm', # new namespace; single to multi-valued
34-
'dc_language_sm' => 'dct_language_sm', # new namespace; single to multi-valued
78+
'dc_language_sm' => 'dct_language_sm', # new namespace
3579
'dc_creator_sm' => 'dct_creator_sm', # new namespace
3680
'dc_publisher_s' => 'dct_publisher_sm', # new namespace; single to multi-valued
3781
'dct_provenance_s' => 'schema_provider_s', # new URI name
3882
'dc_subject_sm' => 'dct_subject_sm', # new namespace
83+
'solr_geom' => 'dcat_bbox', # new URI name
3984
'solr_year_i' => 'gbl_indexYear_im', # new URI name; single to multi-valued
4085
'dc_source_sm' => 'dct_source_sm', # new namespace
4186
'dc_rights_s' => 'dct_accessRights_s', # new URI name
@@ -47,6 +92,27 @@ def convert_keys
4792
'geoblacklight_version' => 'gbl_mdVersion_s', # new URI name
4893
'suppressed_b' => 'gbl_suppressed_b' # new namespace
4994
}.freeze
95+
96+
# Map Dublin Core types to Aardvark resource class sets
97+
# See: https://github.com/OpenGeoMetadata/opengeometadata.github.io/blob/main/docs/ogm-aardvark/resource-class.md
98+
RESOURCE_CLASS_MAP = {
99+
'Collection' => ['Collections'],
100+
'Dataset' => ['Datasets'],
101+
'Image' => ['Imagery'],
102+
'InteractiveResource' => ['Websites'],
103+
'Service' => ['Web services'],
104+
'StillImage' => ['Imagery']
105+
}.freeze
106+
107+
# Map geometry types to Aardvark resource type sets
108+
# See: https://github.com/OpenGeoMetadata/opengeometadata.github.io/blob/main/docs/ogm-aardvark/resource-type.md
109+
RESOURCE_TYPE_MAP = {
110+
'Point' => ['Point data'],
111+
'Line' => ['Line data'],
112+
'Polygon' => ['Polygon data'],
113+
'Raster' => ['Raster data'],
114+
'Table' => ['Table data']
115+
}.freeze
50116
end
51117
end
52118
end

spec/fixtures/docs/full_geoblacklight.json

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,13 @@
2828
"dct_spatial_sm":[
2929
"Uganda"
3030
],
31+
"dct_isPartOf_sm":[
32+
"Uganda GIS Maps and Data, 2000-2010"
33+
],
34+
"dc_source_sm": [
35+
"stanford-rb371kw9607"
36+
],
3137
"solr_geom":"ENVELOPE(29.572742, 35.000308, 4.234077, -1.478794)",
32-
"solr_year_i":2005
38+
"solr_year_i":2005,
39+
"suppressed_b":false
3340
}
Lines changed: 26 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,31 @@
11
{
22
"gbl_mdVersion_s":"Aardvark",
3-
"dct_identifier_sm":"http://purl.stanford.edu/cz128vq0535",
3+
"dct_identifier_sm":[
4+
"http://purl.stanford.edu/cz128vq0535"
5+
],
46
"dct_title_s":"2005 Rural Poverty GIS Database: Uganda",
5-
"dct_description_sm":"This polygon shapefile contains 2005 poverty data for 855 rural subcounties in Uganda. These data are intended for researchers, students, policy makers and the general public for reference and mapping purposes, and may be used for basic applications such as viewing, querying, and map output production.",
7+
"dct_description_sm":[
8+
"This polygon shapefile contains 2005 poverty data for 855 rural subcounties in Uganda. These data are intended for researchers, students, policy makers and the general public for reference and mapping purposes, and may be used for basic applications such as viewing, querying, and map output production."
9+
],
610
"dct_accessRights_s":"Public",
711
"schema_provider_s":"Stanford",
812
"dct_references_s":"{\"http://schema.org/url\":\"http://purl.stanford.edu/cz128vq0535\",\"http://schema.org/downloadUrl\":\"http://stacks.stanford.edu/file/druid:cz128vq0535/data.zip\",\"http://www.loc.gov/mods/v3\":\"http://purl.stanford.edu/cz128vq0535.mods\",\"http://www.isotc211.org/schemas/2005/gmd/\":\"http://opengeometadata.stanford.edu/metadata/edu.stanford.purl/druid:cz128vq0535/iso19139.xml\",\"http://www.w3.org/1999/xhtml\":\"http://opengeometadata.stanford.edu/metadata/edu.stanford.purl/druid:cz128vq0535/default.html\",\"http://www.opengis.net/def/serviceType/ogc/wfs\":\"https://geowebservices.stanford.edu/geoserver/wfs\",\"http://www.opengis.net/def/serviceType/ogc/wms\":\"https://geowebservices.stanford.edu/geoserver/wms\"}",
913
"gbl_wxsIdentifier_s":"druid:cz128vq0535",
1014
"id":"stanford-cz128vq0535",
11-
"layer_geom_type_s":"Polygon",
15+
"gbl_resourceType_sm": [
16+
"Polygon data"
17+
],
1218
"gbl_mdModified_dt":"2015-01-13T18:46:38Z",
1319
"dct_format_s":"Shapefile",
14-
"dct_language_sm":"English",
15-
"dc_type_s":"Dataset",
16-
"dct_publisher_sm":"Uganda Bureau of Statistics",
20+
"dct_language_sm":[
21+
"English"
22+
],
23+
"gbl_resourceClass_sm":[
24+
"Datasets"
25+
],
26+
"dct_publisher_sm":[
27+
"Uganda Bureau of Statistics"
28+
],
1729
"dct_creator_sm":[
1830
"Uganda Bureau of Statistics"
1931
],
@@ -28,6 +40,12 @@
2840
"dct_spatial_sm":[
2941
"Uganda"
3042
],
31-
"solr_geom":"ENVELOPE(29.572742, 35.000308, 4.234077, -1.478794)",
32-
"gbl_indexYear_im":2005
43+
"dct_source_sm": [
44+
"stanford-rb371kw9607"
45+
],
46+
"dcat_bbox":"ENVELOPE(29.572742, 35.000308, 4.234077, -1.478794)",
47+
"gbl_indexYear_im":[
48+
2005
49+
],
50+
"gbl_suppressed_b":false
3351
}

spec/lib/geo_combine/migrators/v1_aardvark_migrator_spec.rb

Lines changed: 29 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,17 +6,41 @@
66
include JsonDocs
77

88
describe '#run' do
9-
it 'migrates keys' do
9+
it 'migrates fields to new names and types' do
1010
input_hash = JSON.parse(full_geoblacklight)
11-
# TODO: Note that this fixture has not yet been fully converted to
12-
# aardvark. See https://github.com/OpenGeoMetadata/GeoCombine/issues/121
13-
# for remaining work.
1411
expected_output = JSON.parse(full_geoblacklight_aardvark)
1512
expect(described_class.new(v1_hash: input_hash).run).to eq(expected_output)
1613
end
1714

15+
it 'removes deprecated fields' do
16+
input_hash = JSON.parse(full_geoblacklight)
17+
output = described_class.new(v1_hash: input_hash).run
18+
expect(output.keys).not_to include(described_class::SCHEMA_FIELD_MAP.keys)
19+
expect(output.keys).not_to include('dc_type_s')
20+
expect(output.keys).not_to include('layer_geom_type_s')
21+
end
22+
23+
it 'leaves custom fields unchanged' do
24+
input_hash = JSON.parse(full_geoblacklight)
25+
input_hash['custom_field'] = 'custom_value'
26+
output = described_class.new(v1_hash: input_hash).run
27+
expect(output['custom_field']).to eq('custom_value')
28+
end
29+
1830
context 'when the given record is already in aardvark schema' do
19-
xit 'returns the record unchanged'
31+
it 'returns the record unchanged' do
32+
input_hash = JSON.parse(full_geoblacklight_aardvark)
33+
expect(described_class.new(v1_hash: input_hash).run).to eq(input_hash)
34+
end
35+
end
36+
37+
context 'when the user supplies a mapping for collection names to ids' do
38+
it 'converts the collection names to ids' do
39+
input_hash = JSON.parse(full_geoblacklight)
40+
collection_id_map = { 'Uganda GIS Maps and Data, 2000-2010' => 'stanford-rb371kw9607' }
41+
output = described_class.new(v1_hash: input_hash, collection_id_map: collection_id_map).run
42+
expect(output['dct_isPartOf_sm']).to eq(['stanford-rb371kw9607'])
43+
end
2044
end
2145
end
2246
end

0 commit comments

Comments
 (0)