Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: utility to apply a function to each feature in a vector #1595

Open
emlys opened this issue Jun 26, 2024 · 3 comments
Open

Proposal: utility to apply a function to each feature in a vector #1595

emlys opened this issue Jun 26, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@emlys
Copy link
Member

emlys commented Jun 26, 2024

It's a very common pattern that we iterate over each feature in a vector, do something with the feature's attributes and/or geometry, and write new attributes to the feature. A lot of the details of this process could be abstracted away with a wrapper function, something like

def vector_apply(vector_path, op, new_fields=[]):
    vector = gdal.OpenEx(vector_path, gdal.OF_VECTOR | gdal.GA_Update)
    layer = vector.GetLayer()

    for field_name, field_type in new_fields:
        field_defn = ogr.FieldDefn(field_name, field_type)
        field_defn.SetWidth(24)
        field_defn.SetPrecision(11)
        layer.CreateField(field_defn)

    layer.ResetReading()
    layer.StartTransaction()

    # add error handling - finally write to file and save
    # add timed logging
    for feature in layer:
        op(feature)
        layer.SetFeature(feature)

    layer.CommitTransaction()
    layer.SyncToDisk()

Which could be used like this (simple example from AWY):

def compute_water_yield_volume(watershed_results_vector_path):
    vol_name = 'wyield_vol'
    def wyield_vol_op(feat):
        wyield_mn = feat.GetField('wyield_mn')
        # there won't be a wyield_mn value if the polygon feature only
        # covers nodata raster values, so check before doing math.
        if wyield_mn is not None:
            geom = feat.GetGeometryRef()
            # Calculate water yield volume,
            # 1000 is for converting the mm of wyield to meters
            vol = wyield_mn * geom.Area() / 1000
            # Get the volume field index and add value
            feat.SetField(vol_name, vol)

    utils.vector_apply(
        watershed_results_vector_path, wyield_vol_op,
        new_fields=[(vol_name, ogr.OFTReal)])

Additional features could be

  • a boolean arg enumerated, which if True, enables enumeration of the features. If enumerated, op would be called with (index, feature) rather than just (feature). I saw a couple of cases where this would be useful.
  • Error handling to make sure that progress is saved and written to disk if op raises an error
  • Timed progress logging
  • Handle setting shapefile field width and precision for different dtypes

I count several instances in invest where this pattern could simplify existing code, for example:

  • AWY compute_water_yield_volume
  • AWY compute_watershed_valuation
  • AWY compute_rsupply_volume
  • UCM calculate_uhi_result_vector
  • UCM calculate_energy_savings
  • Forest carbon _aggregate_carbon_map

Benefits would be to reduce redundant code and to make sure we're consistently using the best patterns for working with GDAL vectors (opening and closing correctly, saving to disk, using exceptions: #638). While preserving memory efficiency, since features are processed one at a time.

This would be sort-of parallel to pygeoprocessing's raster utilities, which are much more developed than our vector utilities.

@emlys emlys added the enhancement New feature or request label Jun 26, 2024
@phargogh
Copy link
Member

phargogh commented Jun 26, 2024

+1 from me to create a helper function for this! It reminds me a bit of pygeoprocessing.geoprocessing.shapely_geometry_to_vector (at least the attribute portion of it), but with a focus on setting of attributes.

But also, the apply terminology makes me think of pandas/geopandas. At this point, especially since we are using conda for our package management, should we consider moving vector operations like this to geopandas?

@emlys
Copy link
Member Author

emlys commented Jun 27, 2024

My only concern with geopandas is memory usage, it doesn't seem to be designed for efficiency on arbitrarily large vectors. It's possible to read in a subset of a vector with the rows option to geopandas.read_file, but I'm not aware of any way to work with a whole geodataframe without reading it all into memory

@dcdenu4
Copy link
Member

dcdenu4 commented Jul 10, 2024

Thanks for this Emily, it's a +1 from me generally. I would also be concerned with Pandas performance and efficiencies too, as the backend of GeoPandas. Here's a page where they talk about scaling. But maybe it would be interesting to do a side by side comparison with a small, medium, and large vector use case.

This does make me reflect on some conversations we've had about standardizing / building out a vector API in pygeoprocessing. We haven't given vectors the treatment that we've given rasters. I'm not saying we need to tackle that now, but am curious about whether something like this makes sense to live in PGP from the start?

My only feature recommendation would be to provide an optional "copy" argument. I could see not wanting to edit the vector directly but instead make a copy. This could be a separate step beforehand, but might be a nice convenience feature too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants