Autogenerated Documentation for utilities

Simple Utilities

Contains (i) nonspatial helper functions, (ii) simple spatial processing functions, and (iii) a couple of shapely-based manipulations

TXHousing.utilities.simple.convert_to_hex(rgba_color)[source]

Converts rgba colors to hexcodes. Adapted from https://stackoverflow.com/questions/35516318/plot-colored-polygons-with-geodataframe-in-folium

TXHousing.utilities.simple.fragment(polygon, horiz=10, vert=10)[source]

Fragment polygon into smaller pieces. This is used to (vastly) improve spatial tree efficiency on large polygons.

Parameters:
  • polygon (shapely polygon) – Polygon to fragment
  • horiz (int) – Number of horizontal fragments, defaults to 10
  • vert (int) – Number of vertical fragments, defaults to 10
Returns:

A list of smaller polygons which are a partition of the input polygon.

TXHousing.utilities.simple.get_urban_core(lat, long, radius, scale=5280, newproj='epsg:2277')[source]

Create a polygon representing the urban core of a city. Based on the shapely’s buffer method, but combined with crs transformations so you get to pick the units.

Parameters:
  • lat – The latitude of the center of the city.
  • long – The longitude of the center of the city.
  • radius – The radius in units of your choice; see the scale and newproj parameters.
  • scale – Defaults to 5280, feet per mile.
  • newproj – The new projection to use to calculate this distance (by default epsg:2277, which is in feet).
Returns:

a geopandas geodataframe with a single column (geometry) of length one (polygon) which represents the urban core.

TXHousing.utilities.simple.make_point_grid(gdf, horiz=20, vert=20, factor=None, by='mean', geometry_column='geometry')[source]

Given a geodataframe of points, partition them into a rectangular grid and calculate either the number of points in each rectangle or the mean or median of a factor associated with the points for each rectangle. This is used to make choropleths out of point data.

Parameters:
  • gdf – Geodataframe, with point geometry presumably.
  • horiz – Number of horizontal boxes
  • vert – Number of vertical boxes
  • factor – The (continuous) value with which to take the mean/median of the points. Defaults to None.
  • by – ‘mean’ or ‘median’. Meaningless unless you have the factor column.
  • geometry_column – The column the points are contained in (these should be shapely points).
Returns:

geodataframe with grid geometry and a ‘value’ column

TXHousing.utilities.simple.process_geometry(gdf, geometry_column='geometry', drop_multipolygons=True)[source]

Processing for polygon-based gdfs: makes geometries valid and possibly drops multipolygons.

TXHousing.utilities.simple.process_points(points, geometry_column='geometry')[source]

Processing for point-based gdfs: ignores invalid points

TXHousing.utilities.simple.retrieve_coords(point)[source]

Retrieves coords and reverses their order for shapely point. (Reverses because folium and GeoPandas use opposite lat/long conventions).

TXHousing.utilities.simple.will_it_float(text)[source]

Checks whether an object can be converted to a float.

Spatial Join Utilities

Fast intersection functions

TXHousing.utilities.spatial_joins.fast_polygon_intersection(small_polygon_gdf, large_polygon_gdf, small_points_spatial_index=None, small_geometry_column='geometry', large_geometry_column='geometry', large_name_column=None, **kwargs)[source]

Given a gdf of small polygons (i.e. parcels) and a gdf of large polygons (i.e. municipal boundaries), calculates the large polygon in which each small polygon lies. This function is based on the points_intersect_multiple_polygons function and therefore assumes that each small polygon will lie in at most one of the large polygons.

Parameters:
  • small_polygon_gdf – A gdf of small polygons (i.e. parcels)
  • large_polygon_gdf – A gdf of large polygons (i.e. municipal boundaries)
  • small_points_spatial_index – The spatial index for the centroids of the small_polygon_gdf. Note that passing the spatial index for the polygons of the small_polygon_gdf is different and could lead to unexpected results. This is optional, and will be generated by the underling points_intersect_multiple_polygons call if not supplied.
  • small_geometry_column – The geometry column of the small_polygon_gdf.
  • large_geometry_column – The geometry column of the large_polygon_gdf.
  • large_name_column – Column for the names of each large polygon; if none will use the index of the large_polygon_gdf.
  • kwargs – Kwargs to pass to the “fragment” function in the TXHousing.utilities.simple module. Fragmenting polygons speeds up the computation for all but very small polygons. If you do not want to fragment the polygons (the only reason to do this is speed, it will not affect the results), pass in horiz = 1 and vert = 1 as kwargs.
Returns:

A pandas series mapping the index of the small polygons to the names of the large polygons. If an index does not appear in the returned series, that is because the small polygon corresponding to that index did not lie inside any of the large polygons.

TXHousing.utilities.spatial_joins.get_averages_by_area(data_source, other_geometries, features, density_flag=False, data_source_geometry_column='geometry', other_geometries_column='geometry', drop_multipolygons=True, account_method=None, horiz=1, vert=1)[source]

Get averages of features from data_source by area. Data_source and other_geometries should have the same crs initially. This is a wrapper for polygons_intersect_single_polygon and is therefore quite accurate.

Parameters:
  • data_source (GeoDataFrame) – The data source, usually block data. Must have polygon geometry.
  • other_geometries – Will calculate features each row of this gdf from the data source. Must have polygon geometry.
  • features (str or list) – The feature in question. Can also be a list of features, i.e. [‘B01001e1’, ‘B01001e2’]
  • density_flag (Boolean) – Default False. If True, will assume that the ‘feature’ is already units per area and will not divide the feature by the area of the data source polygons.
  • data_source_geometry_column – geometry column for data_source
  • other_geometries_column – geometry column for other_geometries
  • account_method – The method by which to account for the % of an area which is not residential (this prevents population-related estimates from being too low). Can either be None, ‘percent_residential’, or ‘percent_land’. Defaults to None (although wrappers of this function may have different defaults).
  • horiz – When fragmenting polygons, number of horizontal fragments to make. Defaults to 1.
  • vert – When fragmenting polygons, number of vertical fragments to make. Defaults to 1.
Returns:

other_geometries but with a new column, feature, which has the averages by area.

TXHousing.utilities.spatial_joins.points_intersect_multiple_polygons(points_gdf, polygons_gdf, points_spatial_index=None, points_geometry_column='geometry', polygons_geometry_column='geometry', polygons_names_column=None, **kwargs)[source]

Given a gdf of points and a gdf of polygons, calculates the polygon in which each point lies. This function assumes that each point will lie in at most one of the polygons. If that assumption is not true, use instead the points_intersect_single_polygon function and apply it to the geometry column of a polygon gdf.

Parameters:
  • points_gdf – A geodataframe of points data.
  • polygons_gdf – A geodataframe of polygons data.
  • points_spatial_index – Optional; the spatial_index of the points geodataframe. If not supplied, the function will automatically generate the spatial index.
  • points_geometry_column – Geometry column for the points data.
  • polygons_geometry_column – Geometry column for the polygon data.
  • polygons_names_column – Column for the names of each polygon; if none will use the index of the polygons_gdf.
  • kwargs – Kwargs to pass to the “fragment” function in the TXHousing.utilities.simple module. Fragmenting polygons speeds up the computation for all but very small polygons. If you do not want to fragment the polygons (the only reason to do this is speed, it will not affect the results), pass in horiz = 1 and vert = 1 as kwargs.
Returns:

A pandas series mapping the index of the points_gdf to the names of the polygons. If an index does not appear in the returned series, that is because the point corresponding to that index did not lie inside any of the polygons.

TXHousing.utilities.spatial_joins.points_intersect_single_polygon(points, polygon, spatial_index, points_geometry_column='geometry', factors=None, categorical=True, by='mean', **kwargs)[source]

Given many points and a polygon, finds one of three things. (1) If factors = None, the number of points inside the polygon, (2) if factors is not None and categorical = True, the number of points inside the polygon conditional on a group of categorical factors, (3) if factors is not None and categorical = False, the summarized value (mean/median) of factors associated with each point of each point inside the polygon.

Parameters:
  • points – A GDF with a points geometry column
  • polygon – The polygon to see whether the points are inside.
  • spatial_index – The spatial index of the points
  • factors – The factors to average over (if continuous) or subset by the cartesian product of (if categorical). This may be a list or a string.
  • categorical – If True, then the factor should be treated as a categorical variable.
  • by – If categorical is False, can either summarize using by = ‘mean’ or by = ‘median’
  • kwargs – Kwargs to pass to the “fragment” function in the TXHousing.utilities.simple module. Fragmenting polygons speeds up the computation for all but very small polygons. If you do not want to fragment the polygons (the only reason to do this is speed, it will not affect the results), pass in horiz = 1 and vert = 1 as kwargs.
Returns:

Pandas series

Note: it is often useful to apply this function to an entire gdf of polygons.

TXHousing.utilities.spatial_joins.polygons_intersect_single_polygon(small_polygons, polygon, spatial_index, geometry_column='geometry', factors=None, categorical=True, account_for_area=True, divide_area_by='polygon', by='mean', **kwargs)[source]

Given many polygons (i.e. parcels) and a larger polygon (i.e. county boundary), finds one of three things. (1) If factor = None, the percent area of the large polygon that is covered by the small polygons (2) If factor is not None and categorical = True, the percent area of the large polygon that is covered by the small polygons conditional on the factor (3) if factor is not None and categorical = False, the summarized value (mean/median) of the factors associated with each polygon inside the polygon.

Parameters:
  • small_polygons – A GDF with a polygon geometry column
  • polygon – The polygon to see whether the small_polygons are inside.
  • spatial_index – The spatial index of the small_polygons
  • factors – The factors to average over (if continuous) or subset by the cartesian product of (if categorical).
  • categorical – If True, factors will be treated as categorical variables.
  • by – If categorical is False, can summarize with by = ‘mean’ or by = ‘median’
  • account_for_area – Default True. If True, instead of returning the mean of the factor, this will return the dot product of the mean and the area of each small_polygon that intersects the large_polygon divided by the area of the large polygon (happens if categorical is False, by = ‘mean’, and account_for_area = True). Also, if factor = None, divides answer by area of polygon.
  • divide_area_by – Defaults to ‘polygon’. This parameter determines what to divide the result by. If divide_area_by = ‘polygon’, then this divides by the area of the polygon. If divide_area_by = ‘nonempty’, it will divide by the total area of the intersection between the polygon and small_polygons. Else, it will simply return without dividing.
  • kwargs – Kwargs to pass to the “fragment” function in the TXHousing.utilities.simple module. Fragmenting polygons speeds up the computation for all but very small polygons. If you do not want to fragment the polygons (the only reason to do this is speed, it will not affect the results), pass in horiz = 1 and vert = 1 as kwargs.
Returns:

Float if factors is None, else Pandas Series

Note: it is often useful to apply this function to an entire gdf of polygons.

Measurement Utilities

Functions which focus on efficienctly measuring distances and area.

TXHousing.utilities.measurements.calculate_dist_to_center(gdf, lat, long, drop_centroids=True, geometry_column='geometry')[source]

Calculates distance to the center of the city using haversine on the centroids of objects

Parameters:
  • gdf – A GeoDataFrame, with lat/long crs. Can either have point or polygon geometry.
  • lat – Latitude of the center of the city
  • long – Longitude of the center of the city.
  • drop_centroids – Boolean, default true. If true, drop the centroids inplace after calculation.
Returns:

Pandas Series of floats (distances from center).

TXHousing.utilities.measurements.get_area_in_units(gdf, geometry_column='geometry', newproj='epsg:2277', scale=3.58701e-08, name='area', final_projection=None, reproject=True)[source]

Get the area of each polygon of a geodataframe in units of your choice (defaults to square miles). This function relies on crs transformations, so for large/complex gdfs, this function is very computationally expensive.

Parameters:
  • gdf – Geodataframe with polygons in the geometry column.
  • geometry_column – Geometry column of the geodataframe, defaults to ‘geometry’
  • newproj – The new projection to use to calculate units. Defaults to epsg:2277, which is probably fine for Austin/Dallas/Houston and is in feet.
  • scale – A scale to multiply by. Defaults to 3.58701*10**(-8) which is the number of square miles in a square foot.
  • name – The name of the new column that will be created to store the area information. Defaults to ‘area’.
  • final_projection – The final projection that the returned gdf should be in. Defaults to the gdf’s current crs.
  • reproject (Boolean) – If False, do not reproject the data after calculating area (this is useful to save time in specific cases).
Returns:

The geodataframe with a column named name (defaults to ‘area’) which has the area of each polygon in

the desired units.

TXHousing.utilities.measurements.haversine(point1, point2, lon1=None, lat1=None, lon2=None, lat2=None)[source]

Haversine function calculates distance (in miles) between two points in lat/long coordinates. See https://gis.stackexchange.com/questions/279109/calculate-distance-between-a-coordinate-and-a-county-in-geopandas

Parameters:
  • point1 – Shapely point. Long then lat.
  • point2 – Shapely point. Long then lat.
  • lat1, lon2, lat2 (lon1,) – Alternatively, supply the longitudes and lattitudes manually.
Returns:

Distance in miles.

TXHousing.utilities.measurements.order_radii(data, inplace=True, feature=None)[source]

Helper function which properly orders wacky indexes/features for pandas dataframes. If inplace = False, works with a copy of the data to prevent global effects.

TXHousing.utilities.measurements.points_intersect_rings(gdf, lat, long, factor=None, step=1, categorical=True, by='mean', geometry_column='geometry', per_square_mile=True, maximum=None)[source]

Given a gdf of points, calculates the distance of each point from the center of the city. Can also group by a categorical variable or alternatively calculate the mean/median of a continuous variable.

Parameters:
  • gdf – GDF in points geometry.
  • lat – The latitude of the center of the rings.
  • long – The longitude of the center of the rings.
  • factor – A factor of the gdf to condition on or calculate means/medians of, e.g. ‘Race’ or’Population’
  • step – Number of miles where the ring radiates outwards.
  • maximum – Max radius (miles)
  • categorical – If true, will only calculate percent land (or % of points) in the radius, conditional on the factor if factor is not None.
  • by – Defaults to “mean”. If categorical = False, use “by” to determine how to calculate averages over points.
  • geometry_column – name of geometry column. Default geometry.
  • per_square_mile – if true, divide by the area of the ring.
  • maximum – float, defualts to None. If not None, will group everything greater than this maximum into a single category.
Returns:

If factor is None, a pd Series which lists the number of points by distance from city center. If categorical = True and factor is None, a pandas Dataframe which lists the number of points by distance from the city center (index) against their categorical value from the factor (columns). Lastly, if categorical = False and factor is not None, returns a pd Series of the mean/median of the factor conditional on distance to city center.

TXHousing.utilities.measurements.polygons_intersect_rings(gdf, lat, long, factor=None, newproj='epsg:2277', step=1, maximum=20, categorical=True, geometry_column='geometry', group_outliers=True, outlier_maximum=35, city=None)[source]

Given a gdf of polygons, groups the polygons by distance from the center of the city and calculates the percent of area of the city that the polygons cover. Can also group by a categorical variable or alternatively calculate the mean/median of a continuous variable (adjusting for the area of the polygons).

Parameters:
  • gdf – Geopandas GeoDataFrame, in polygon geometry.
  • factor – A factor of the gdf to condition on or calculate means/medians of e.g. ‘Race’ or ‘Population’
  • lat – The latitude of the center of the rings.
  • long – The longitude of the center of the rings.
  • newproj – the new projection system necessarily used in this. Defaults to 2277 which is good for Austin and fine for Texas. Note units in this are in feet.
  • step – Number of miles where the ring radiates outwards.
  • maximum – Max radius (miles)
  • geometry_column – name of geometry column. Default geometry.
  • categorical – If true, will only calculate percent land (or % of points) in the radius. Else will calculate mean by area.
  • city – If city is notnot “none”, if factor is “none”, will read the shapefile of the boundaries of the city to ensure more accurate calculations. (Otherwise, for a ring of size 12, the area of the circle might be greater than the area of the city inside the circle).
  • group_outliers – Boolean, defaults to true. If true, group everything with a distance greater than the maximum into one group (of maximum size).
  • outlier_maximum – Float, defaults to 35. For computational efficiency, this function will not consider outliers higher than this distance from the cneter of the city.
Returns:

Dataframe or Series

Note:

To ensure accurate results, this function will break up polygons which straddle the boundary between being (for example) 5-6 miles from the city center as opposed to 6-7 miles from the city center; this makes it computationally expensive. To calculate similar results more cheaply with slightly less accuracy, just take the centroids of the geodataframe and apply points_intersect_rings.