In Class SQL Workshop

On Monday we wrote a SQL for determining the most culturally diverse neighborhoods in Chicago as an exercise to demonstrate the potential and functionality of SQL in PGAdmin.

Continue reading


Tutorials and Resources for Spatial Analytics in PostGIS SQL and QGIS

This post is a repository of useful tutorials, workshops, demonstrations, and reference for spatial analytics and modeling in the Postgres, PostGIS, and QGIS. Please add links to this post as you discover useful sites (include a brief description of the site).

  1. w3schools – a very useful site on the fundamentals of (non-spatial) SQL queries, including tutorials on joins and SQL functions such as AVG()
  2. boundlessgeo – an excellent primer on many of the PostGIS spatial functions, great place to get your feet wet with PostGIS spatial analytics/modeling
  3. spatialthoughts – this site provides tutorials in a wide range of techniques in QGIS
  4. linfinity – this site provides a number of tutorials clearly illustrating spatial analytic approaches using QGIS
  5. Geospatial Analysis – 4th Edition” by de Smith, Goodchild, Longley – this textbook provides the conceptual and geometric principles of spatial analytics and modeling – an excellent resource.


Laster semester we utilize two kinds of clustering  algorithms to do our analyze. The first one is distance based clustering, the second  one is grid based clustering. Although logically they are very similar, both of them are forming clusters based on distances, they are different in doing this, and results can be different. Below is the logic of these 2 algorithms.

A. distance based  clustering:

1. Buffering every single points with a distance which can be set by analyzers.

LB cluster1

2. Merging circles which have larger overlaps than the setting number into clusters.

LB cluster2

B. Grid based clustering

1. Set the distance of grid lines. Divide the target area by grid.

grid cluster1

2. Locate points into cells, then look at neighbor cells of target cell. If there is point in theses neighbor cells, merge these points as core of a cluster.

grid cluster2

3. Making convex hulls based on these cores of cluster. There is a parameter through which you can control the size of clusters.

grid cluster3

Blow is the SQL for Grid based clustering

WITH clstrtags AS ( SELECT *, tag.geom as tgeom FROM gridcluster(30,’urbantag’,’geom’) as grid
JOIN urbantag as tag
ON st_contains(st_setsrid(grid.geom,3435),st_setsrid(tag.geom,3435))
ORDER BY rid,cid
counts AS (SELECT count(tagid) as count, clusterid, activity FROM clstrtags GROUP BY clusterid, activity),
countss AS (SELECT count(tagid) as count, clusterid FROM clstrtags GROUP BY clusterid)

select counts.clusterid, counts.activity as act, counts.count as actct,countss.count as tagid, counts.count/(countss.count + 0.00) as percentage
from counts join countss
on counts.clusterid=countss.clusterid
where countss.count>1
order by clusterid

Linking Point Data to Polygon Data Using QGIS and PostGIS SQL

This post explains two ways to use spatial joins between tables that have point and polygon geom fields. The first approach uses built in tools in QGIS to quickly make basic joins for initial analysis. The second approach uses PostGIS SQL queries with the potential to do far more sophisticated joins and data analysis.

QGIS Vector/Geometry Tools:

Screen Shot 2014-01-26 at 9.47.30 PM

QGIS provides a number of tools for analyzing and rebuilding vector layers in the “Vector” tab. This includes the “Geometry Tools/Multipart to Singleparts” command, which breaks a multipolygon into individual polygon elements. The “Analysis Tools/Points In Polygons” command is shown above. When a polygon layer and a points layer are selected, it will produce a new shapefile with the polygons + a new field that is a count of the number of points in each polygon. Here is a detailed walkthrough for this tool.

PostgreSQL Queries:

These queries use an INNER JOIN on rows where the result of the “st_within()” postgis function == TRUE. This function returns TRUE if the first geometry (“pts.geom”) is completely within the second geometry (“blocks.geom”).

This code will create a new table with all the fields from the point table for all of the points that fall within one of the polygons, and add an additional field that shows an ID for the polygon in which that point falls:

SELECT pts.*, blocks.gid as blockID
FROM “Detroit_BP1995_point” AS pts
INNER JOIN “Detroit_CityMap3_region_individualPoly” AS blocks
ON st_within(pts.geom, blocks.geom) AS result

This code joins the point table created by the SQL above to the original polygon table, generating a new polygon table with a count of points within each polygon. This allows for more sophisticated calculations, for instance, normalizing counts against the size of each polygon, etc. (This SQL needs some more work):

SELECT blockID, number_of_permits, total_cost, average_cost, geom FROM(
SELECT pts.blockID, COUNT(blockID) AS number_of_permits, SUM(“COSTBP”) AS total_cost, AVG(“COSTBP”) AS average_cost
FROM detroitTest04b as pts
LEFT JOIN “Detroit_CityMap3_region_individualPoly” as blocks
ON blocks.gid = pts.blockID
GROUP BY blockID) AS results
LEFT JOIN “Detroit_CityMap3_region_individualPoly” as blocks
ON results.blockID = blocks.gid

Making Routes and Visualizing in QGIS Using SQL

PgRouting analyzes graphs to find the shortest path between two nodes through edges. In a street network, you can think of nodes as intersections and edges as the streets themselves. I will be using the Chicago street network, which can be found here. I have also uploaded it to the server, named chicagostreets.

1. In order to develop data capable of using PgRouting a source and target geometry must be set up at the start and end of each road. These will serve as the start and end destination points. To perform this follow the “Creating a routable road network” section of this tutorial (blog also has many helpful QGIS tutorials). Another thing to get familiar with in SQL is the WITH Clause. This allows you to reference other SQLs within one query.

2. Once your network is made you can start routing. Follow this tutorial to get multiple options for routing. pgr_dijkstra and pgr_kdijkstraPath are two that work well for visualizing one path or one startpoint to many paths.

pgr_kdijkstraPath3. By joining the geometries in your network to the results of your routing you can import this into QGIS and see your paths.

To reference my SQL query from the tutorial in class, click link.

Creating Convex Hulls with SQL to QGIS: Part 2

We created clusters (convex hulls) of venues making distinctions between the venues by categories. We started off by creating a mathematical algorithm that combines multiple equations:

  • We needed to determine a method of comparison for all the venues. Our method used the average distance between all the venues. We used the average distance, which we calculated by comparing the distance from one venue to all other venues and doing this for all the venues and then dividing the number of venues. In essence, it is a permutation with repetition calculation.
  • After we calculated the average distance we set a threshold divider of the average distance to determine the size of the clusters that we wanted to use.
  • Then we had to match the venue points of comparison to cluster only by category whereby venue points would only cluster when one venue category was equal to another venue category.
  • The next step was to create the shape of the cluster creating a geometry polygon to import as a PostGIS shapefile.

Step 1:

CREATE TABLE avgdistances2 AS

SELECT * FROM  (SELECT f1.gid as v1id, f1.fsid, f2.gid as v2id, st_distance(f1.geom,f2.geom) AS dist, f1.upperlevelcat as f1ul, f2.upperlevelcat as f2ul

FROM fsvenuewithupperlevelcat2 as f1

CROSS JOIN fsvenuewithupperlevelcat2 as f2  WHERE f1.checkinct > 200 and f1.gid<>f2.gid) AS q1

WHERE q1.dist < ( SELECT SUM(avgdist)/count(hack) as avgdisttotal    

FROM     (SELECT f1.gid, 1 as hack, SUM(st_distance(f1.geom,f2.geom))/COUNT(f1.gid) AS avgdist

FROM fsvenue as f1

CROSS JOIN fsvenue as f2

WHERE  f1.gid <> f2.gid     GROUP BY f1.gid) as subq    GROUP BY subq.hack)/4 AND f1ul=f2ul  ORDER BY q1.v1id;

Step 2:

CREATE TABLE fsvenue_cathulls(   gid serial NOT NULL,   fsid character varying(256),   CONSTRAINT fsvenue_hulls_pkey PRIMARY KEY (gid) ) WITH (   OIDS=FALSE );

ALTER TABLE fsvenue_hulls  OWNER TO poweruser; GRANT ALL ON TABLE hullmp3 TO poweruser;

SELECT AddGeometryColumn(‘fsvenue_hulls’,’geom’,3435,’MULTIPOINT‘,2);

INSERT INTO fsvenue_hulls (fsid,geom) SELECT ad.fsid, st_collect(v.geom)  FROM avgdistances2 AS ad Left JOIN fsvenue AS v ON ad.v2id = v.gid WHERE v1id <> 42 GROUP BY ad.fsid ORDER BY ad.fsid;

CREATE VIEW fsvenue_cathulls AS SELECT v.gid, v.fsid, v.upperlevelcat, st_convexhull(h.geom)::geometry(‘Polygon’, 3435) as geom FROM fsvenue_hulls as h LEFT JOIN fsvenuewithupperlevelcat2 as v ON h.fsid = v.fsid WHERE st_npoints(h.geom) > 3;

SELECT * FROM fsvenue_cathulls;

Step 3:

Import PostGIS file to QGIS

Result from category 1: Arts & Entertainment

Bucktown & Wicker Park



SQL and PgRouting Continued

In order to get pg routing to work, first a transportation network must be formed with a source and target column created to specify which start and end points are used. The SQL below creates a network  in a buffer of a 1 mile radius and examines the Damen CTA Stop. To alter the name of the table created, change the buffer distance, or to change the CTA Stop, change the red values in the SQL below.

CREATE TABLE damennetwork AS
WITH ctastop AS
(SELECT * FROM cta_railstations
WHERE gid = 32),
transpo AS
row_number() OVER (ORDER BY transportation.gid)::integer AS gid,
transportation.gid AS id, 
transportation.street_nam AS name,
transportation.length AS cost,
st_startpoint(ST_LineMerge(transportation.geom)) as source,
st_endpoint(ST_LineMerge(transportation.geom)) as target
FROM ctastop
LEFT JOIN transportation
ON st_within(transportation.geom, 
st_setsrid(st_buffer(ctastop.geom, 5280), 3435))),
nodes AS
(SELECT DISTINCT transpo.source AS gid 
FROM transpo
FROM transpo)
SELECT transpo.gid,,, transpo.cost, transpo.geom, as source, as target
FROM transpo
(SELECT row_number() OVER (ORDER BY nodes.gid)::integer AS id, 
nodes.gid AS geom
FROM nodes
GROUP BY nodes.gid)
AS source ON transpo.source = source.geom
(SELECT row_number() OVER (ORDER BY nodes.gid)::integer AS id, 
nodes.gid AS geom
FROM nodes
GROUP BY nodes.gid) 
AS target ON = target.geom

To be able to find a route with this SQL use the below SQL. Input the name of your table used in the previous SQL where there are the “damennetwork” in red. To change to start and end points, change to “502” and “341” respectfully.

SELECT seq,id1 as node, id2 as edge, route.cost, as streetname, damennetwork.geom 
FROM pgr_dijkstra('
 SELECT gid AS id,
 cost::double precision AS cost
 FROM damennetwork', 502, 341, false, false ) 
AS route
LEFT JOIN damennetwork
ON route.id2 = damennetwork.gid;

What we have been working on is using the CTA stop as a start point and all the venues as the end points. The first thing I did was take all the venues in the same buffer as the network and corresponded them to their closest edge (the target). Different buffers and train stops can be chosen by the the red numbers.

WITH ctastop AS
(SELECT * FROM cta_railstations
WHERE gid = 32),
p AS
venues.gid AS gid, 
venues."legal name" AS company,
venues."doing busi" AS name,
venues.address AS address,
venues.latitude AS lat,
venues.longitude AS lon,
venues.geom AS geom
LEFT JOIN venues
ON st_within(venues.geom, 
st_buffer(ctastop.geom, 5280)))
SELECT rcost.*, p_to_l.targetID, p_to_l.geom FROM
(SELECT p.gid as restaurantID,
min(st_distance(p.geom,l.geom)) as distance
FROM p, damennetwork AS l
GROUP BY restaurantid) 
AS rcost
(SELECT p.gid as restaurantID, as targetID, l.geom,
st_distance(p.geom,l.geom) as distance
FROM p, damennetwork AS l) 
AS p_to_l
ON rcost.distance = p_to_l.distance
AND rcost.restaurantID = p_to_l.restaurantID

The same must be done to find the start point (source) which is the CTA stop. Change to red value to select a different trainstop.

(SELECT * FROM cta_railstations
WHERE gid = 32)
SELECT row_number() OVER (ORDER BY sourceid.distance)::integer AS gid,
sourceid.* FROM (
SELECT p.gid as restaurantID, l.source as sourceID, l.geom,
st_distance(p.geom,l.geom) as distance
FROM ctastop AS p, damennetwork AS l) as sourceid) as source
WHERE source.gid = 1

This gave me a result of a 292 start point, which I will be plugging into the next SQL (in red). The other numbers are the results from the venue target table formed two SQLs ago.

CREATE TABLE routetestwm AS
SELECT seq,id1 as path, id2 as node, id3 as edge, route.cost,, damennetwork.geom
FROM pgr_kdijkstraPath('
 SELECT gid AS id,
 cost::double precision AS cost
 FROM damennetwork', 292, array[336,457,227,179,179,291,351,515,424,244,427,250,349,457,138,291,291,263,230,117,115,208,325,594,284,301,368,116,91,471,543,398,251,251,251,230,230,365,138,463,404,301,356,212,342,203,203,342,478,286,292,519,41,242,242,242,242,242,242,356,57,57,289,310,286,452,277,251,212,427,251,182,242,471,445,251,251,93,291,286,305,262,46,262,46,262,288,288,371,117,398,105,105,286], 
 false, false ) 
AS route
LEFT JOIN damennetwork
ON route.id3 = damennetwork.gid;

This table can be brought into qgis as the network of paths from a CTA stop to any venue.

All Routes