Skip to content

Add address_id and x,y to permit data #5

@caged

Description

@caged

@ksmpdx Is there a way to get the address_id, and potentially the x,y from the city's address data added to the building permits data? I'd like to see the geo location of the permits, but the data currently doesn't include this information. I've been trying to sync these up for a better part of a day by comparing the full addresses to each other, however I've run into multiple issues.

The first challenge was a set of small issues that I bumped into before actually trying to compare the addresses.

  • Upper case header names with spaces - Not a huge deal, but generally easier to work with snake_case_names.
  • Leading spaces in header names - Some CSV parses don't strip these and you wonder why your comparisons don't work. " Case Number" is different than "Case Number".
  • Not UTF-8 encoded - Easy to work around if CSV parser accepts an encoding flag.
  • Parsing error in line 145573 of the address data - Naive parses will fail on this line. Looks like a quoting issue.

The next challenge in trying to sync these up manually is that there are 615,752 permits in the permit dataset and 416,424 address points. For each permit, we need to potentially scan every address to find one that matches, which is 615752 * 416424. That's a potential 256,413,910,848 iterations in a worst-case scenario.

There are ways to reduce the number of comparisons that have to be made, such as caching the address for the permit permit_case_number, since there are multiple entries for a single permit case number as the particular property has to go through multiple inspections.

Now the problem moves on to comparing the actual addresses with each other. The major problem here is that it's a very finicky process and using a tool like csvjoin requires an exact match. I'm currently in this part of the process and I've hit a couple of snags pretty quickly.

There are double spaces in the full address strings. This can be difficult to spot and doesn't result in an exact match. It's easy enough to collapse spaces for the entire file. After doing this I ended up getting matches for roughly 40k permits, or about 9% of addresses and 6% of permits.

After getting the first batch of matches, I noticed that the reason more addresses aren't matching is because most of the permit data addresses end with " PORTLAND," however the address data file doesn't follow this pattern.

And, that's where I left things. I kinda felt like I went pretty far down the rabbit hole already, and maybe I should stop to see if I could get someone at the city to add this information if it's possible.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions