API Documentation

Graph building and troubleshooting

MissingLinks.semi_to_fully_nodedFunction
semi_to_fully_noded(data...; snap_tolerance=1e-6, split_tolerance=1e-6)

Convert GDAL data sources where the sources are "semi-noded" - that is, coincident ends, or an end coincident with a line segment, count as intersections - into a fully noded graph with nodes at each intersection, ready to be built into a graph.

Data is a vararg, so arbitrarily many layers may be supplied. They will be merged into a single output layer.

The snap tolerance is how close an end has to be to a line to be considered connected. The split tolerance is how close two splits must be to each other to be considered the same split, and should generally be much smaller, to avoid a situation where two ends snap to a line at points near each other, the line is split at one of them, and the split is more than snap_tolerance distance from the other one. I further recommend setting the tolerance when building the graph to snap_tolerance + split_tolerance. These should both be quite small; larger gaps in the network should be addressed with add_short_edges!.

source
MissingLinks.graph_from_gdalFunction
graph_from_gdal(layers...; tolerance=1e-6, max_edge_length=250)

Build a graph from one or more GeoDataFrame layers.

This builds a graph from a fully-noded GIS network. You can pass in an arbitrary number of GeoDataFrame layers, which will be combined into a single graph. They should all be in the same coordinate system.

The tolerance specifies how far apart nodes can be and still be considered connected. This should be smaller; larger gaps should be closed by add_short_edges!; see discussion there for why.

We assume that potential links are always where two edges pass closest to one another. As the length of the edges increases, this assumption gets worse. max_edge_length controls how long edges can be; any edges longer than this will be split.

Both tolerance and max_edge_length are in the same units as the underlying data.

source
MissingLinks.graph_to_graphmlFunction
graph_to_graphml(outfile, graph; pretty=false)

Export the graph to a graphml file specified in outfile, for visualization or manipulation with other tools.

If pretty is set to true, the XML file will be pretty-printed.

source
MissingLinks.graph_to_gisFunction
graph_to_gis(fn, G)

Write graph G to GIS file fn.

The file type is determined by extension; I recommend .gpkg for GeoPackage output.

source
MissingLinks.remove_tiny_islandsFunction
remove_tiny_islands(graph, min_vertices_to_retain)

Remove islands from graph with fewer than min_vertices_to_retain, and return a modified graph.

Often, due to bad data, there will be small islands isolated from the rest of the network. This identifies all components of the graph, and returns a copy of the graph with only those components with at least min_vertices_to_retain.

This should already work for directed graphs, as for directed graphs it uses strongly connected components to determine vertices to remove. See extensive discussion here to see why this is relevant.

source
MissingLinks.add_short_edges!Function
add_short_edges!(graph, max_edge_length)

Add edges to graph in-place between all nodes that are closer than max_edge_length to one another, and currently more than 2 * maxedgelength apart via the network.

You might think, as I did, that snapping nodes during graph build would be an easier solution, but that can be prone to errors because it results in nodes being actually moved. Consider this situation:

---a (0.3 m) b--3.1m---c (0.2m) d----

Assuming the vertices are read in a b c d order, b and c will get snapped to a, and d will not get snapped to anything.

Instead, we recommend only snapping a very small distance during graph build (default 1e-6 meters). We then add edges between nodes close together.

source
MissingLinks.find_disconnected_crossingsFunction
find_disconnected_crossings(G, dmat; tolerance=10)

Find locations where edges of graph G cross without intersecting, and where the network distance between the intersecting points is more than tolerance.

These are often graph errors, but may also be overpasses, tunnels, etc. Returns a GeoDataFrames-compatible point DataFrame with the locations, for further examination in GIS.

source
MissingLinks.find_dead_endsFunction
find_dead_ends(G)

Find locations in graph G where there is a dead end.

Returns a GeoDataFrame. This is useful for network data creation as by opening this layer in GIS you can easily see where there are existing dead ends and determine if they are correct or not.

source
MissingLinks.remove_elevation!Function
remove_elevation!(geom)

Remove the Z component of a geometry, and change the ArchGDAL wrapper type to match.

This does mutate its argument, but in order to re-wrap in a new ArchGDAL type a new wrapper must be returned. So you should use the return value of the function.

Workaround for https://github.com/yeesian/ArchGDAL.jl/issues/333

source

Opportunity data

MissingLinks.create_graph_weightsFunction
create_graph_weights(graph, geodata, weightcols, distance)

Using a graph and a spatial data frame, assign weights to graph nodes.

Each row in the spatial data frame should have a weight associated with it (if they are all equivalent, the weight can be set to a column of ones). For each row, all graph edges within distance of the geometry of that row will be identified (the spatial data frame should be in the same projection as the network data used to create the graph). An equal amount of weight will be assigned to the nodes at each end of all identified edges. (In a corner, twice as much weight will be assigned to the node at the corner, as it will receive weight from each of the links adjacent to the property. The effect may be even more pronounced if the streets on the other side of the intersection from the row are also within distance).

weightcols specifies which columns to use as weights. It must be a vector. Multiple weight columns can be specified. The function will return an n x w matrix, where n is the number of nodes in the graph, and w is the number of weightcols specified. This will contain the weight for each node for each weightcol.

source
MissingLinks.fill_distance_matrix!Function
fill_distance_matrix!(G, mtx::AbstractMatrix{T}; maxdist=5000, origins=1:nv(G))

Fill an already created distance matrix mtx with shortest-path distances based on graph G.

The units are the same as the underlying data, and will be rounded to the resolution of whatever the element type T of mtx is. We usually use UInt16 meters as these can represent quite long trips with reasonable accuracy while minimizing memory consumption. Using other data types is not tested.

To make this faster, you can set a maximum distance maxdist; for destinations beyond this distance (or destinations that are unreachable altogether) the matrix will contain typemax(T).

Will use multiple threads if Julia is started with multiple threads.

source
MissingLinks.identify_potential_missing_linksFunction
identify_potential_missing_links(graph, distance_matrix, max_link_dist, min_net_dist)

Identify locations in graph that are less than max_link_dist apart geographically, but at least min_net_dist apart in network distance (or disconnected entirely).

dmat should be a distance matrix for all nodes in the graph, generally created by fill_distance_matrix!

source
MissingLinks.deduplicate_linksFunction
deduplicate_links(list, distance_matrix, sphere_of_influence_radius)

Deduplicate links using a heuristic.

Arguments are a list of links (e.g. as produced by identify_potential_missing_links), a distance matrix between nodes for the graph used to identify those links, and the radius of the sphere of influence (see below).

Consider the situation below. Lines are existing roads. Here, you have the ends of two blocks in different subdivisions, that do not connect.

--a--o    o--d--
     |    |
     c    e
     |    |
--b--o    o--f--     

Since there are three edges in each subdivision in the graph above, all of which pass within (say) 100m of each other, there are nine possible candidate links: ad, ae, af, bd, be, bf, cd, ce, cf. Clearly, all of these provide essentially the same level of access, so we want to deduplicate them.

We consider two candidate links to be duplicates if they connect pairs of locations within a sphere_of_influence_radius of one another (we use 100m in the paper). To identify these links, we use the following greedy algorithm. Note that since it is a greedy algorithm, the order the links are identified in matters. Currently, link identification is not multithreaded, so the order of the links should be deterministic, and therefore you should get the same results from this deduplication whenever it is run.

For the first candidate link, we identify the "sphere of influence" of that link; the sphere of influence is all nodes that are within 100m of either end of the candidate link. We calculate this using the node-to-node distance matrix and the distances of the link from the start and end of the edges it connects. We do not need to define separate spheres of influence for the start and end of the link—as long as the radius of the sphere of influence is less than half the minimum network distance for proposing a link, any candidate link that connects two nodes in the sphere of influence must have one end near each end of the original link that defined the sphere of influence.

For subsequent candidate links, we determine if the link is within an already-identified sphere of influence. We check to see if one of the ends of each of the edges the candidate link connects is in an already identified sphere of influence. If it is, we further check to see if the network distances between the ends of this candidate link and the corresponding ends of candidate link that defined the sphere of influence are less than the threshold for both ends of the links (note that links are undirected, so we explicitly consider links duplicates even if the start of one is near the end of the other and vice versa). If these network distances are both less than the radius of the sphere of influence, we add the link to this sphere of influence and continue to the next link. If they are not, we continue to the next sphere of influence. If a link is not within any existing sphere of influence, we define a new sphere of influence based on the link.

We then return the link with the shortest geographic distance from each sphere of influence. Note that there may still be links that are close to one another; if two links define adjacent or even overlapping spheres of influence, but with neither link in either sphere of influence, the best links in the two spheres of influence may be very similar ones in between the original two links. You could iterate the algorithm if you wanted to to prevent this, but this could result in links being supplanted by ones more than 100m away.

source
MissingLinks.score_linksFunction
score_links(decay_function, links, distance_matrix, origin_weights, dest_weights, decay_cutoff_m)

Score the contribution of each link in links to aggregate accessibility, using the decay function and weights specified.

decay_function is a Julia function that takes a single argument, a distance, and returns a weight; smaller weights mean less influence. For a two-mile cumulative opportunities function, this could just be e.g. x -> x < 3218. It is the first argument to allow use of Julia do-block syntax.

links is the vector of links (e.g. from deduplicate_links). The distance matrix is the same one used throughout the analysis. origin_weights and dest_weights are vectors of weights associated with each node, e.g. computed by create_graph_weight.

decay_cutoff_m is the distance at which decay_function goes to zero (for a smooth function, we recommend reimplementing as piecewise with a dropoff to zero at some point where additional access is largely immaterial). It is in meters if the input data were in meters (which we recommend as we have not tested with other units to ensure units are not hardcoded anywhere. Just use meters. Be a scientist.)

For each link, we identify how much making this connection would increase the aggregate number of origin-weighted destinations (i.e. the sum across origins of the number of destinations accessible within a threshold distance). We do this in several steps:

  1. Identify all origins and destinations within the threshold of the start of the proposed new link
  2. For each origin and destination: a. generate distances from the origin to the destination via the new link by summing - Distance from origin to link - Length of link - Distance from link to each destination b. if the distance via the new link is longer than the existing distance, continue to the next pair c. otherwise, compute the distance weight using the weighting function in the accessibility metric for the distance via the new link and the existing distance d. take the difference of these weights e. multiply by the origin and destination weights
  3. Sum the results
  4. Reverse the start and end of the link, and repeat
  5. Sum the results
source
MissingLinks.links_to_gisFunction
links_to_gis(graph, links, pairs...)

Convenience function that converts a vector of links (and associated graph) into a GeoDataFrame, suitable for export to a GIS file.

pairs` should be pairs of :col_name=>values that will be added to the output. For example, you will often want to add scores to your output.

source

Service area analysis

MissingLinks.service_areaFunction

Compute a service area starting at a particular location, based on a precomputed distance matrix, graph, and an optional list of links to include.

Return the street segments within threshold distance of the location as a GIS layer.

Supply edgeindex=<result of indexgraph_edges(G)> if you precomputed the edge spatial index to avoid recomputing each time the function is run.

source

Internal

MissingLinks.CandidateLinkType

A CandidateLink represents a link between two existing edges. It stores the vertex codes (not labels) from each end of each edge, as well as the distances from each vertex at the point where the link is.

source
MissingLinks.add_geom_to_graph!Function

This adds edge(s) representing geom to the graph. It may add multiple edges in cases where two edges would otherwise be duplicates because they connect the same intersections (see #2).

Returns a tuple of tuples (frv, tov) - usually just one, but some edges may be split before adding to graph.

source
MissingLinks.add_unless_typemaxFunction

Add y to x, but only if x is not already typemax (used to indicate unreachable). This assumes that if x < typemax(typeof(T)), x + y < typemax(typeof(T)), which is generally reasonable as x and y are in meters and anything we're adding is well less than 65km, unless it's unreachable, which should only be the case for x (coming from the distance matrix). However, we do use a checked_add to throw an error if that assumption is violated.

If x == typemax(T), return x If x < typemax(T) and x + y <= typemax(T), return x + y If x < typemax(T) and x + y > typemax(T), throw OverflowError

source
MissingLinks.find_point_on_edgeFunction

Snap the point origin to the nearest edge in G, returning a namedtuple with fields:

  • src: Source vertex label
  • tgt: Target vertex label
  • distfromsrc: distance from the source to the closest point on the edge
  • distfromtgt: distance from the target to the closest point on the edge
  • perpendicular_dist: distance from the requested point to the closest point on the edge

or nothing if there is no edge within snap_tolerance

source
Base.reverseFunction

Create a reversed version of a candidate link, used in evaluating accessibility to calculate accessibility in both directions.

source