Autogenerated Documentation for data_processing¶

Boundary Data Processing¶

Paths and methods for boundary files (uas, csa, counties, blocks, etc)

class TXHousing.data_processing.boundaries.BlockBoundaries(data_layers, cities=None, get_percent_residential=True)[source]¶

Class for block data, wraps Boundaries class, with a substantially different init method. After initialization, self.data is a gdf with a ‘geometry’ column, a variety of block data columns, and if cities is not None, it has a ‘place’ column as well.

Parameters:	data_layers – Iterable of codes for the data layer of the geodatabase. cities – City name (i.e. ‘Austin’), or iterable of city names to filter by (i.e. [‘Austin’, ‘Dallas’]. Need to be in Texas or this won’t work. Defaults to None.

class TXHousing.data_processing.boundaries.Boundaries(path, crs=None, index_col=None, index_type=None, subset_by=None, subset_to=None, bounding_counties=None, bounding_polygon=None, to_latlong=True)[source]¶

A parent class for boundary files (i.e. counties, places, uas). Note that block data has a different __init__ method because block data comes as a geodatabase not a shapefile.

Parameters:

path – The path of the boundaries shapefile.
crs – Default None. If you cannot read the CRS data from the
index_col – The column in the shapefile to use as the index. Defaults to None.
index_type – Change the index to this type, i.e. ‘str’
subset_by – Subset by this column. Can be ‘index’ or another column name.
subset_to – Subset to one of these values.
bounding_counties – Default None. A list of counties. If not None, __init__ will subset the data to only include boundaries intersecting these counties.
bounding_polygon – Default None. A shapely polygon. If not None, __init__ will subset the data to only include boundaries intersecting the polygon.
to_latlong – Default True. Initially tansform data to lat long.

fast_intersection(gdf, geometry_column='geometry', **kwargs)[source]¶

Given a gdf of polygons or points, will calculate which boundary each polygon/point lies inside. Assumes that if the gdf is full of polygon data, each polygon will only intersect a single zip code.

Parameters:	gdf – A geodataframe, in the same crs as the boundaries data. geometry_column – The geometry column of the gdf. kwargs – kwargs to pass to the underlying fast_polygon_intersection or points_intersect_multiple_polygons functions.
Returns:	A pandas series mapping the index of the gdf to the indexes/names of the boundaries. If an index does not appear in the returned series, that is because the point/polygon for that index did not lie inside any of the boundaries polygons.

process_external_gdf(gdf, geometry_column='geometry', **kwargs)[source]¶: Processes external gdf’s geometry/crs. Passes kwargs to process_geometry or process_points call.

pull_features(gdf, features, geometry_column='geometry', **kwargs)[source]¶

Given features in a non-boundaries gdf, this “pulls” or calculates those features for the boundary data attribute. This a slower alternative to the fast_intersection method.

Parameters:	gdf (GeoDataFrame) – A geodataframe of polygon geometry. features – The factors to calculate from the gdf. geometry_column – The geometry column of the gdf. kwargs – kwargs to pass to the underlying get_averages_by_area function.
Returns:	None but modifies self.data so that it has ‘factors’ columns.

push_features(gdf, features, geometry_column='geometry', **kwargs)[source]¶

Given features in boundaries data, this “pushes” or calculates those features for another gdf of other (usually smaller) polygons. This is effectively a slower alternative to the fast_intersection method.

Parameters:	gdf (GeoDataFrame) – A geodataframe of polygon geometry. features – The factors to calculate geometry_column – The geometry column of the gdf. kwargs – kwargs to pass to the underlying get_averages_by_area function.
Returns:	The gdf but with ‘factors’ columns.

spatial_subset(polygon)[source]¶: Given a polygon, will return a subset of self.data which only includes boundaries which intersect the polygon.

type_of_external_gdf(gdf, geometry_column='geometry')[source]¶: Figures out the type of geometry (point or polygon) of a gdf and raises an error if it’s mixed

class TXHousing.data_processing.boundaries.ZipBoundaries(ziplist=None, bounding_counties=None, bounding_polygon=None, **kwargs)[source]¶

Class for Zipcode Boundaries Data, wraps Boundaries class. You cannot pass subset_by or subset if you pass the ziplist parameter, but otherwise all the init parameters are the same.

param ziplist: A list of zipcodes to subset to.

add_property_data(property_input, features=None, **kwargs)[source]¶

Add demand features from realtor/zillow inputs

Parameters:	demand_input – A property_input class features – Optional features to subset to kwargs – **kwargs to pass to the .join method. Often useful to pass a rsuffix to prevent ValueErrors.
Returns:	None, but changes self.data

TXHousing.data_processing.boundaries.calc_percent_residential_in_block_groups()[source]¶: Calculates the percent zoned as residential in each block group. Should be run in the setup.py. This has been tested for accuracy but is not in the unittests.

Zoning Data Processing¶

Processes zoning data for Austin, Dallas, and all of North Texas

class TXHousing.data_processing.zoning.Zoning_Input(path, feature, separator, base_zones, proj4string=None, lat=0, long=0, zoom=10, crs=None, regulations_path=None)[source]¶

Simple class which binds information about zoning datasets. Also, the lat/long features define the city center for each city.

process_zoning_shapefile(overlay={'-MU': 'Multifamily'}, broaden=True, parse_base_zones=True, regulation_features=None, to_latlong=True, quietly=False)[source]¶

Parameters:

broaden (Boolean) – Default true. If true, will decode zoning data into a broader classification (i.e. sf, mf) as specified by the zoning input class - this processed zoning data will go in a column labelled “broad_zone.”
overlay – A dictionary which maps overlay strings to broadened outputs. This is mostly useful for dealing with mixed use overlays in Austin (there are no mixed-use base zones, so passing in ‘-mu’:’Multifamily’} will help the function recognize that mixed-use zones are multifamily zones.
parse_base_zones – Boolean. Default true. If true, will use the regulations path provided by the input to parse the base zones of the zoning data.
regulation_features – A list of features to retrieve from the regulation data. Default None.
to_latlong – If True, will try to transform the data to lat/long.
quietly (Boolean) – Default false. If true, will suppress all print statements and make the function “quiet”

TXHousing.data_processing.zoning.get_austin_surrounding_zones()[source]¶: Based on the zoning inputs in this file, returns a geodataframe of zoning polygons for Austin AND select surrounding areas, with two columns: broad_zone and geometry. This is purely a convenience function which wraps a variety of Zoning_Input.process_zoning_shapefile calls.

Property Data Processing¶

Processes property data, supplied by Realtor and Zillow

class TXHousing.data_processing.property.Property_Input(path, source, geo_filter=None, geo_filter_values=None, index_col=None)[source]¶

Parameters:

path – Path of the dataset of property data. Data should be in a csv.
source – Source of the data; can either be “Zillow” or “Realtor”
feature – Name of the feature. In the case of Realtor data, this must be the column of the data.
geo_filter – A kwarg. This is the geography level (i.e. city, state, county) by which to filter data, and it should be a column of the dataset. Default depends on the source.
geo_filter_values – A list of values to subset the data to (by geo_filter). Defaults to some flavor of [‘Austin’, ‘Houston’, ‘Dallas’] - but the specific strings are set to the source. If geo_filter_values == ‘all’, then the data will not be filtered.
index_col – A kwarg. The column to use as the index. Default depends on the source.

graph(style='line', date=datetime.date(2018, 4, 1), **kwargs)[source]¶

Parameters:	style – Either ‘bar’ or ‘line’. Note that line graphs are not supported for Realtor data. date – If doing a bar graph of Zillow data, the date to graph. plot – If false, do not actually show the plot of the data. kwargs – kwargs to pass to the process_property_data function.
Returns:	None

process_property_data(features=None, geo_filter_values=None)[source]¶

Parameters:	features – For realtor data, subset to only include these features. Defaults to None. geo_filter_values – For convenience, you have the opportunity to override the self.geo_filter_values value with a new value, e.g. ‘all’ if you don’t want to subset the data.
Returns:	Two pandas dataframes, data and metadata (in that order). For Zillow inputs, data will have geographies in the index and a time series on the column, and metadata will have a variety of information about zip codes. For Realtor data, the data will have geographies in the index and a variety of features in the columns.

Examples:

realtor_hotness_data = Property_Input(path = "data/RDC_MarketHotness_Monthly_Zip.csv", source = 'Realtor')
data, metadata = realtor_hotness_data.process_property_data(features = ['Median DOM (vs CBSA)', 'Views Per Property  (vs CBSA)']

Parcel Data Processing¶

class TXHousing.data_processing.parcel.Parcel(path, account_col, county=None, processing_function=None, geometry_column='geometry', crs=None, name=None, **kwargs)[source]¶

Class for Parcel Data.

Parameters:

path – The path of the data.
account_col – The name of a column containing unique ids for each property.
county – The county that the parcel data is in. Can also be a column of the data which lists the county of each parcel. Defaults to None but this param is highly recommended.
processing_function – A custom processing function used to set as self.data
geometry_column – The geometry column of the data, defaults to ‘geometry’
crs – If necessary, set the crs
name – A name for the entire dataset, used to identify when printing. Defaults to None, at which point it uses the path of the dataset as its name (if this is also None then the name is just None).
kwargs – kwargs to pass to the processing_function

measure_parcels(lat, long, area_feature=None)[source]¶: Calculates the distance to center of the city as well as the area and centroid of each parcel. In effect, it adds four new columns to self.data: lat, long, centroids, area_sqft, and dist_to_center. This also transforms the parcels to lat long.

merge_multiple(merge_paths, right_keys, left_keys=None, **kwargs)[source]¶

A wrapper of the geopandas.merge function, quickly merge with multiple other csvs

Parameters:

merge_paths – If the geodata must be merged with another datasource, this should be an iterable containing the aths of the data to merge it with.
left_keys – If the geodata must be merged with another datasource, this should be an iterable containing the left keys used to merge the data, in the same order of as the ‘merge_paths’. Defaults to None, in which case the merge is performed using the account_feature as the merge key.
right_keys – If the geodata must be merged with another datasource, this should be an iterable containing the right keys used to merge the data, in the same order of as the ‘merge_paths’
kwargs – kwargs to pass to the .merge call

parse_broad_zone(description_column, broad_zone_dictionary)[source]¶

Parses broad zones of parcel data. Updates the self.data attribute to include a ‘broad_zone’ column.

Parameters:

description_column – The column of descriptions to parse
broad_zone_dictionary – A dictionary mapping base zones (i.e. ‘Single Family’) to a list of strings which signal that a description means that base zone. Ex: {‘Single Family’:[‘sf’, ‘single f’], ‘Multifamily’:[‘mf’]} Note that the order of this dictionary DOES matter - the function will return the FIRST key which has a match.

process_parcel_data(broad_zone_feature, broad_zone_dictionary, zoning_input, bounding_counties, area_feature=None, merge_paths=None, left_keys=None, right_keys=None, save_path=None, geo_save_path=None)[source]¶

Wrapper which calls self.merge_multiple, self.parse_broad_zone, self.measure_parcels, and self.pull_geographic_information in that order. Basically, after initializing the parcel data and calling this function, it should have the following features:

account, broad_zone, zone_feature, dist_to_center, area_sqft, lat, long, county, zipcode, place, ua (urban area)

as well as a host of other features that may have been joined with/initially part of the data.

Parameters:

broad_zone_feature – The feature used to parse broad_zones. State cds codes are preferred for consistency.
broad_zone_dictionary – Dictionary mapping broad zones to keywords that we will use to parse the broad_zone_feature and obtain broad zones.
zoning_input – The zoning input for the city of interest; used to calculate distance from city center.
bounding_counties – A list of counties which might intersect the data. It’s computationally relatively inexpensive to add extra counties to this list, by the way.
area_feature – Default None. If the data already lists its area in square feet, then we won’t bother recalculating it (this saves a ton of time because crs transformations are very expensive for parcel data).
merge_paths – If the geodata must be merged with another datasource, this should be an iterable containing the paths of the data to merge it with.
left_keys –
If the geodata must be merged with another datasource, this should be an iterable containing the left keys used to merge the data, in the same order of as the ‘merge_paths’. Defaults to None, in which case

the merge is performed using ‘account’, which is parsed by the init function, as the merge key.
right_keys – If the geodata must be merged with another datasource, this should be an iterable containing the right keys used to merge the data, in the same order of as the ‘merge_paths’
save_path – A csv path at which to save the data. Defaults to None, in which case it will not save the data.
geo_save_path – A .shp path at which to save the data. Defaults to None, in which case it will not save the data.

Returns:

None, but modifies parcel data in place.

TXHousing.data_processing.parcel.cache_all_parcel_data()[source]¶: Creates csvs which store the centroids, area, and other relevant features about each parcel for all of the surrounding counties of each core municipality.

TXHousing.data_processing.parcel.cache_municipal_parcel_data()[source]¶: Creates csvs which store the centroids, area, and other relevant features about each parcel for municipalities.

TXHousing.data_processing.parcel.cache_north_texas_zoning_data()[source]¶: Because we use the north texas zoning data to validate parcel results, we apply the parcel processing functions to the north texas zoning data and then cache the results.(This is a bit hacky, because you wouldn’t expect to see zoning data fit under the parcel class, but so be it.)

TXHousing.data_processing.parcel.process_austin_parcel_data()[source]¶: Reads Austin parcel data and processes base zones.

TXHousing.data_processing.parcel.process_houston_parcel_data(feature_files=['data/Parcel/Harris_Parcel_Land_Features/building_res.txt'], feature_columns_list=[['ACCOUNT', 'USE_CODE', 'BUILDING_NUMBER', 'IMPRV_TYPE', 'BUILDING_STYLE_CODE', 'CLASS_STRUCTURE', 'CLASS_STRUC_DESCRIPTION', 'DEPRECIATION_VALUE', 'CAMA_REPLACEMENT_COST', 'ACCRUED_DEPR_PCT', 'QUALITY', 'QUALITY_DESCRIPTION', 'DATE_ERECTED', 'EFFECTIVE_DATE', 'YR_REMODEL', 'YR_ROLL', 'APPRAISED_BY', 'APPRAISED_DATE', 'NOTE', 'IMPR_SQ_FT', 'ACTUAL_AREA', 'HEAT_AREA', 'GROSS_AREA', 'EFFECTIVE_AREA', 'BASE_AREA', 'PERIMETER', 'PERCENT_COMPLETE', 'NBHD_FACTOR', 'RCNLD', 'SIZE_INDEX', 'LUMP_SUM_ADJ']], process_crs=True, county_level=False)[source]¶

Merge houston or harris parcel data with harris county data.

Parameters:

feature_files – A list of paths of feature files to merge with the data. Defaults to the building_res file path. Can also be a string of one file (doesn’t have to be a list). If you set this equal to None, it will just return the parcel shape data with no other data attached.
feature_columns_list – A list of column headers for each file - you can find these in the Harris_Parcel_Feature_Columns microsoft database under data.
process_crs (bool) – Default True. If True, transform parcel data to latlong. This is very expensive.
county_level – Default False. If False, subset the data to only consider the parcels within the Houston municipal boundaries.

Returns:

GeoDataFrame with the parcel shapes and the merged data.

Permit Data Processing¶

TXHousing.data_processing.permit.correct_dallas_permit_data(api_key=None, address_cache_path=None, dpm_save_path=None, re_geocode=True)[source]¶

Corrects and caches dallas construction permit data. This function worked the first time but is a untested since then. I would not recommend running this again, as it requires getting a google maps account and that’s been getting harder over time. Instead, it’s easier to pull the geocoded addresses from https://github.com/amspector100/TXHousing/tree/reorganization/shared_data

Parameters:	address_cache_path – The path at which to cache the geocoded addresses. dpm_save_path – The path at which to cache the entirely correct permit data. re_geocode (Boolean) – If true, re-pull all the data from the google maps API. Else, just use the cached address path.
Returns:	None

TXHousing.data_processing.permit.get_corrected_dallas_permit_data(path='shared_data/dallas_corrected_permits/dallas_permits_corrected.shp')[source]¶

Get processed & corrected Dallas permit data.

Parameters:	path – the path to read the data from.
Returns:	GDF of Construciton permit data.

TXHousing.data_processing.permit.process_austin_permit_data(searchfor, permittypedesc=None, workclass=None, earliest=None, latest=None)[source]¶

Parameters:

searchfor – List of strings to search for in the ‘PermitClass’ and ‘Description’ columns. Often it’s worth explicitly passing in permit classes, i.e. searchfor = [‘101 single family houses’].
permittypedesc – The permittypedesc to match. Ex: “Building Permit.”
workclass – Workclass to match. Ex: “New”
earliest – Earliest date to consider (inclusive). Data runs from 1971-2018.
latest – Latest date to consider (inclusive). Data runs from 1971-2018.

Returns:

GeoDataFrame of subsetted permit data.

TXHousing.data_processing.permit.process_dallas_permit_data(permit_types, earliest=None, latest=None)[source]¶

Initially process raw dallas permit data. However, the raw permit data has some inaccuracies, and it’s best to simply read the corrected data from dpm_save_path.

Parameters:	permit_types – List of permit types to filter for. Will only consider rows where the permit type is one of these permit types. earliest – Earliest date to consider. Data runs from 2011-2016. Defaults to None. latest – Latest date to consider. Data runs from 2011-2016. Defaults to None.
Raises:	UserWarning; some permit data is incorrect. Construction data has been corrected, and the corrected data is stored at ‘dpm_save_path’ as listed in the inputs.py file.
Returns:	GeoDataFrame of subsetted permit data

TXHousing.data_processing.permit.process_houston_permit_data(searchfor=['NEW S.F.', 'NEW SF', 'NEW SINGLE', 'NEW TOWNHOUSE'], searchin=['PROJ_DESC'], kind='structural', earliest=None, latest=None)[source]¶

Process houston permit data. Note that houston permit data does not specify whether housing is new or not, this requires parsing the descriptions to figure out whether housing is new sf/new mf housing. Thankfully the descriptions are formulaic so this is not too hard to do.

Parameters:

searchfor – A list of keywords to search for, i.e. [‘NEW S.F.’, ‘NEW SF’, ‘NEW SINGLE’, ‘NEW TOWNHOUSE’]). To subset to only include new construction, each string should start with ‘NEW ‘.
searchin – The columns to search in. Defualts to [‘PROJ_DESC’], project description. Note that the function will return rows where ANY of the columns specified by ‘searchin’ contain ANY of the keywords in ‘searchfor’.
kind – The kind of permit data to read in. Defaults to ‘structural’, can either by ‘structural’ or ‘demolition’
earliest – Earliest date to consider.
latest – Latest date to consider.

Returns:

GeoDataFrame of subsetted permit data.

TXHousing.data_processing.permit.scrape_houston_permit_data(target_path='shared_data/houston_permit_statuses.csv', kind='structural', backup_path='shared_data/houston_permit_statuses_backup.csv')[source]¶

Scrapes Houston permit approval data and writes it to the target_path as well as the backup_path. It will never overwrite any pre-existing csv files - it only appends information to csvs. This requires proper installation of the headless chrome webdriver and takes a lot of time (~2 hours), so it’s probably best to pull the information from https://github.com/amspector100/TXHousing/tree/reorganization/shared_data instead of calling this function.

Parameters:	target_path – The path to write the scraped data to. kind – The kind of permit data to scrape. Can either be ‘structural’ or ‘demolition’. backup_path – The path to write the scraped data to as a backup in case something happens to the original.
Returns:	None.