How To Query Dataset

how to query dataset

Handling multiple, large datasets using GIS: current progress

Possibly the biggest challenge built into the EngLaId project is in how we bring together and synthesise the diversely recorded datasets that we are using. Whilst some consistencies exist between the recording methods used by different data sources, there remains a considerable amount of diversity. English Heritage (EH), England's 84 Historic Environment Records (HERs), the Portable Antiquities Scheme (PAS) and other data providers / researchers all keep their own separate databases, all recorded in different ways, and the entries within which relate to some of the same objects and to some different objects.

As a result, there is considerable duplication between different data sources, which is not at all easy to extract. Where data objects have names, such as in the case of many larger sites, this can be used to assess duplication (assuming all datasets use the same names for an object), but this does not apply to the much more common case of objects with no assigned names.

Selectivity estimation of window queries for line segment datasets ([Research paper] / Carnegie Mellon University. School of Computer Science)
Learn more
Guido Proietti

Therefore, the best way in which to discover duplication and attempt to present a synthesis between different datasets is to test for spatial similarity. In other words, if a Roman villa is present in the same space within two different datasets, we can assume that it is the same villa. However, this in turn is complicated by the fact that different data sources contain data recorded to different levels of spatial precision and using different data types (e.g. points vs polygons). The way that I am experimenting with to get around this problem is in applying a tessellation of grid squares over the map and testing the input datasets for which objects fall within each square, recording their type and period, and aggregating across datasets to assess presence or absence of each site type for each period.

The first stage is to simplify down the terms used in the input dataset to a set of (currently) eight output terms (these are still not fully defined as yet and the number of output terms will undoubtedly grow). This is partly so that the output text fields do not exceed the 254 character limit for ArcGIS shapefiles (I will be working on a solution to this, probably involving moving to the geodatabase format), and partly so that we can identify objects of similar type recorded using different terminologies. This is accomplished through the use of a Python script.

Managing and Mining Graph Data (Advances in Database Systems)
Learn more

The grid square tessellations were created using the tools provided as part of the Geospatial Modelling Environment software, which is free to download and use. So far, I have created tessellations at resolutions of 1km x 1km, 2km x 2km, and 5km x 5km to cover the different scales of analysis to be undertaken (and ultimately to give flexibility in the resolution of outputs for publishing purposes with regard to the varying requirements of our data providers). These were then cut down to the extent of England using a spatial query.

ArcGIS's identity tool was then used to extract which input objects fell within which grid square (or squares in the case of large polygons and long lines). The attribute tables for these identity layers were then exported and run through another Python script to aggregate the entries for each grid square and to eliminate duplication for each grid square. The table output by the script (containing the cell identifier, a text string of periods, and a text string of types per period) was then joined to the grid square tessellation layer based upon the identifier for each cell. The result is a layer consisting of a series of grid squares, each of which carries a text string attribute recording the broad categories of site type (by period) falling within itself.

This methodology means that we can bring together different datasets within a single schema. Input objects that overlap more than one output square can record their presence within several output squares* (assuming they are represented in the GIS as polygons / lines of appropriate extent). Querying the data to produce broad-scale maps of our different periods and/or categories of data is simple (using the ArcMap attribute query system's 'LIKE' query, remembering to use appropriate wildcards [% for shapefiles] to catch the full set of terms within each text field**). The analysis can also be redone using different resolutions of grid tessellation, depending on the quality of input data and the spatial scale of research question considered (e.g. 1km x 1km or 2km x 2km or 5km x 5km squares).

So far, this methodology has only been tested using EH's National Record of the Historic Environment (NRHE) data (as seen online at PastScape: the process described above is also capturing the relevant identifiers to link through the data to PastScape, with an eye on linked data output in our final website) and using an initial, rather arbitrary, set of simplification terms to produce test results, but it should be straightforward to extend this system to encompass the various other datasets that we are in the process of gathering. As an example of the output produced, here is a map of Roman settlement sites in the south west of England (settlement being defined here as entries containing any of the words: villa, house, settlement, hut, roundhouse, room, burh, town, barn, building, floor, mosaic; some of these terms obviously do not apply to the Roman period and the list will be subject to revision before final outputs are produced):

Roman settlement in the south west of England

As can be seen, on the scale of a region the output is both clear and instructive. The result is one that shows presence or absence of a type of site within each cell, with no quantification given of how many of each type (as we ultimately will not know whether the total count is due to duplication or due to genuine multiplicity). This picture will only get better once we have fully defined the terms used in our simplification process and once we start building in more data from our other data sources.

I shall be presenting a paper on this subject at CAA in Southampton in March.

Chris Green

* Whether this is appropriate or whether they should fall only within the square within which the majority of the polygon falls is still open to debate. I feel that under a strict rationale of presence / absence, they should appear in all squares they overlap, but this could present a misleading picture in cases where, for example, a small site overlapped the junction of four large grid squares.

** e.g. [Term] LIKE '%RO_SETTLEMENT%'

notebook

2012年2月2日木曜日

How To Query Dataset