I've been playing around with the data made available [1] by the states that are part of the Streamlined Sales Tax system [2].
The boundary file address records include, among other things:
address range start
address range end
odd/even/both flag
directional prefix
street name
street suffix
directional postfix
zip code
The address range specifies the range of numerical addresses that the record applies to, and the odd/even/both flag indicates if it is for just the odd address, just the even addresses, or both.
The direction prefix and suffix are supposed to be one of {blank, N, S, E, W, NE, NW, SE, SW}, but some states botch that and I've seen EA, EB, NO, SO, and WE end up in there.
The suffix field is for things like LANE, AVE, ST, and the like. It is a 4 character field, so a street that residents thing of as ELM AVENUE would be name=ELM suffix=AVE in the boundary file. Or it would be name="ELM AVENUE" with no suffix. (There are even some with AVENUE in the name and something else in the suffix, such as ROOSEVELT AVENUE ALY in Albany GA. The ALY is the suffix). (There are a 318 distinct suffixes across all the SST states, for the curious [3]).
Anyway, my point is that the data is kind of loose.
I want to be able to take what a user puts on an address form:
address
city
state
zip code
and find a good match in the boundary files. Most people manage to put something reasonable in the city, state, and zip code fields, and manage to put their street address at the start of the address field, so I can pretty reliably get the zip code and the street number. It's the rest of the address line that is problematic.
At first I was trying to figure out what in there might be prefix, postfix, and suffix to isolate just the street name. But because of both the aforementioned looseness of the boundary data, and differences between how the user might think of their address and how their state thinks of it, this didn't work too well.
I was considering keeping an index that maps words that occur in street names to those street names, and then looking for the longest word in the user supplied address and looking up all streets in the user's zip code that contain that word and that include the user's street number. If that returned more than one match, the idea was then to look at the prefix, postfix, and suffix and see how well those matched with things in the user address. Looking at a few addresses that had given me trouble, though, it looked like this plan to resolve them would still be tricky and unreliable.
So then I decided to try a Levenshtein distance approach. Given zip, street number, street, where street is whatever the user included in the "address" field after the street number, I pull up all records in my boundary DB that match on zip code and street number. I don't even include street name in the query.
Then for each result, I take the prefix, name, suffix, and postfix and join them separated by space. I then simply take the one with the smallest Levenshtein distance from what the user supplied. With this, I no longer care if something is name=ELM, suffix=AVE or name="ELM AVE", suffix="".
This has worked remarkably well. Even if the user throws on stuff that doesn't belong (some people manage to put the city, state, and/or zip in the address box) it usually doesn't throw it off. It raises the Levenshtein distance of the correct match by sometimes a lot...but it also raises the incorrect match distances too, so the correct one still wins.
The boundary files also have secondary address fields, for things like apartments, unit, suites, and the like. These are an abbreviation (APT, e.g.) and low and high ends of a range, and an odd/even/both flag. Unfortunately the ranges for these aren't always numeric. E.g., there is 695 E MORRILL AVE in Columbus, OH, which has secondary fields of "AP", "A", "M", and the odd/even/both flag set to both. I assume that means Apartments A-M.
So far, I've ignored the secondary address fields. Every time I've seen more than one result for the lowest Levenshtein distance, the matches have been close enough to each other to be in the same taxing districts, so would have the same rate and the tax would be allocated to the same entities, so further resolution is not necessary.
At some point I'll take my scripts that clean up the raw rate and boundary files and transform them into a form more suitable for DB use, and that make sqlite and mysql databases from those, and that do the address matching and also sales tax lookups by address, and put 'em on Github, but that will probably not be very soon.
The boundary file address records include, among other things:
The address range specifies the range of numerical addresses that the record applies to, and the odd/even/both flag indicates if it is for just the odd address, just the even addresses, or both.The direction prefix and suffix are supposed to be one of {blank, N, S, E, W, NE, NW, SE, SW}, but some states botch that and I've seen EA, EB, NO, SO, and WE end up in there.
The suffix field is for things like LANE, AVE, ST, and the like. It is a 4 character field, so a street that residents thing of as ELM AVENUE would be name=ELM suffix=AVE in the boundary file. Or it would be name="ELM AVENUE" with no suffix. (There are even some with AVENUE in the name and something else in the suffix, such as ROOSEVELT AVENUE ALY in Albany GA. The ALY is the suffix). (There are a 318 distinct suffixes across all the SST states, for the curious [3]).
Anyway, my point is that the data is kind of loose.
I want to be able to take what a user puts on an address form:
and find a good match in the boundary files. Most people manage to put something reasonable in the city, state, and zip code fields, and manage to put their street address at the start of the address field, so I can pretty reliably get the zip code and the street number. It's the rest of the address line that is problematic.At first I was trying to figure out what in there might be prefix, postfix, and suffix to isolate just the street name. But because of both the aforementioned looseness of the boundary data, and differences between how the user might think of their address and how their state thinks of it, this didn't work too well.
I was considering keeping an index that maps words that occur in street names to those street names, and then looking for the longest word in the user supplied address and looking up all streets in the user's zip code that contain that word and that include the user's street number. If that returned more than one match, the idea was then to look at the prefix, postfix, and suffix and see how well those matched with things in the user address. Looking at a few addresses that had given me trouble, though, it looked like this plan to resolve them would still be tricky and unreliable.
So then I decided to try a Levenshtein distance approach. Given zip, street number, street, where street is whatever the user included in the "address" field after the street number, I pull up all records in my boundary DB that match on zip code and street number. I don't even include street name in the query.
Then for each result, I take the prefix, name, suffix, and postfix and join them separated by space. I then simply take the one with the smallest Levenshtein distance from what the user supplied. With this, I no longer care if something is name=ELM, suffix=AVE or name="ELM AVE", suffix="".
This has worked remarkably well. Even if the user throws on stuff that doesn't belong (some people manage to put the city, state, and/or zip in the address box) it usually doesn't throw it off. It raises the Levenshtein distance of the correct match by sometimes a lot...but it also raises the incorrect match distances too, so the correct one still wins.
The boundary files also have secondary address fields, for things like apartments, unit, suites, and the like. These are an abbreviation (APT, e.g.) and low and high ends of a range, and an odd/even/both flag. Unfortunately the ranges for these aren't always numeric. E.g., there is 695 E MORRILL AVE in Columbus, OH, which has secondary fields of "AP", "A", "M", and the odd/even/both flag set to both. I assume that means Apartments A-M.
So far, I've ignored the secondary address fields. Every time I've seen more than one result for the lowest Levenshtein distance, the matches have been close enough to each other to be in the same taxing districts, so would have the same rate and the tax would be allocated to the same entities, so further resolution is not necessary.
At some point I'll take my scripts that clean up the raw rate and boundary files and transform them into a form more suitable for DB use, and that make sqlite and mysql databases from those, and that do the address matching and also sales tax lookups by address, and put 'em on Github, but that will probably not be very soon.
[1] https://www.streamlinedsalestax.org/Shared-Pages/rate-and-bo...
[2] https://www.streamlinedsalestax.org/
[3] https://pastebin.com/tQLGTGS9