OpenStreetMap Data Case

Completed By: Trenton J. McKinney

Date: 2017/08/10


OSM Map Area


Portland OR, United States (Portland Metro Area)

I live within and am interested in determining what type(s) of interesting information can be gleaned from the Portland Metropolitan OSM file. The map below depicts the area encompassed by the OSM file (black dots) and each purple dot represents the unique zip codes discovered within the ways_tags and nodes_tags.

Black dots outline the area of the OSM data & Purple dots are postcodes from ways_tags and nodes_tags

Notebook to generate map

Before / After Comparison of Corrected City Names

  • project_fix_city_name.py
  • The table shows the types of errors associated with the city names and the result of correction.
def fix_city_name(name, mapping=MAPPING):
    """Splits tag.attrib['v'] and checks each string against MAPPING.
    If there's a value match, the string is changed to the new value."""

    if name in mapping:
        name = name.replace(name, mapping[name])
    return name

Before / After Comparison of Corrected Zip Codes

  • project_fix_zip_code.py
  • The table shows the types of errors associated with the zip codes and the result of correction.
def fix_zip_codes(zip_codes):
    """Expects a string.  Will search the string for a consecutive 5 digits and
    return the string as a zip code or leave blank if there's no match."""

    zip_code = re.compile('\d{5}')
    zip_code = zip_code.findall(zip_codes)

    if zip_code:
        return zip_code[0]
    else:
        return ''

Before / After Comparison of Street Names

def fix_street_name(name, mapping=MAPPING):
    """Splits tag.attrib['v'] and checks each string against MAPPING.
    If there's a value match, the string is changed to the new value."""
    name = name.strip()
    x = name.split()
    for y in x:
        if y in mapping:
            name = name.replace(y, mapping[y])
    return name

Sample of Corrected Street Names

Additional Cleaning

SELECT value
FROM (SELECT * FROM nodes_tags UNION ALL
    SELECT * FROM ways_tags) tags
WHERE key='phone'
GROUP BY value

The table below shows the various formats phone numbers come in. They should be corrected to a standard format for consistency.

File & Database Overview


File Stats

Number of Node

SELECT COUNT(*) FROM nodes;

6,627,751

Number of Ways

SELECT COUNT(*) FROM ways;

865,354

Number of Distinct Contributers

SELECT COUNT(DISTINCT(users.uid))
FROM (SELECT uid FROM nodes UNION ALL
    SELECT uid FROM ways) users;

1,392

Database Exploration


  • This section highlights the basic topics of exploration from the dataset and the associated SQLite queries.

City Name Count

  • The OSM encompasses 74 cities.
SELECT tags.value, COUNT(*) as count
FROM (SELECT * FROM nodes_tags UNION ALL
    SELECT * FROM ways_tags) tags
WHERE tags.key LIKE 'city'
GROUP BY tags.value
ORDER BY count DESC;

Zip Code Count

  • The OSM encompasses 116 zip codes.
SELECT tags.value, COUNT(*) as count
FROM (SELECT * FROM nodes_tags
    UNION ALL
        SELECT * FROM ways_tags) tags
WHERE tags.key='postcode'
GROUP BY tags.value
ORDER BY count DESC;

Top 10 Contributers

  • Total user contributions 7,493,105 by 1,392 users.
  • The top 2 contributers constitute %51.5 of the entries and the top 11, %88.7.
SELECT contrib.user, COUNT(*) as count
FROM (SELECT user FROM nodes
    UNION ALL SELECT user FROM ways) contrib
GROUP BY contrib.user
ORDER BY count DESC
LIMIT 10;

Interesting Explorations


  • Delving into the data shows how much Portland appreciates parking, biking and coffee. Apparently we like swimming too, eventhough it's only sunny for 3 months of the year.

Top Amenities

SELECT tags.value, COUNT(*) as count
FROM (SELECT * FROM nodes_tags UNION ALL
    SELECT * FROM ways_tags) tags
WHERE tags.key='amenity'
GROUP BY tags.value
ORDER BY count DESC;

Top Cuisine

SELECT value, COUNT(*) as count
FROM (SELECT * FROM nodes_tags UNION ALL
    SELECT * FROM ways_tags) tags
WHERE key='cuisine'
GROUP BY value
ORDER BY count DESC;

Sports Facilities

SELECT value, COUNT(*) as count
FROM (SELECT * FROM nodes_tags UNION ALL
    SELECT * FROM ways_tags) tags
WHERE key='sport'
GROUP BY value
ORDER BY count DESC;

Other Ideas About the Dataset


Improving the Dataset

  • Increase the number of contributors, partiularly in rural or less frequented locations. We can see, based upon Top 10 Contributers, most of the data comes from the top 11 users and from City Name Count we can see that of the 74 citys in the dataset, the vast majority of the data is for Portland and that some of the smaller cities only have 1 count. The primary idea behind OSM "... is a map of the world, created by people like you and free to use under an open license." I had never heard of OSM prior to this project requirement, so some type of local outreach like Meetup: OpenStreetMap Portland, but in other communities might increase the user base.
  • Another idea for improving OSM is to import large datasets from other applications with a large number of users and geospatial data such as Google or Apple Maps or Pokemon Go to name a few.

Benefits:

  • The single most obvious benefit is more users equates to more data.

Potential Issues

  • The main issue with attracting more users is probably the process of reaching people that may be interested.
    • Meetups are mostly free, but the volume is low.
    • People have a tendency to ignore website ads
    • Commercials cost money
  • Once a potential new user is found, there are addition roadblocks
  • Large data imports from outside sources:
    • Goes against the idea of a community based map
    • "We are only interested in 'free' data. We must be able to release the data with our OpenStreetMap License"
    • There are additional technical hurdles related to importing data

If You're Interested in Contributing to OpenStreetMaps

Conclusion


Based upon the collected data, as shown in Corrected OSM File Issues, there are a relatively small number of issues. Specifically, only 40 city names and 50 zip codes required standardization. Additionally, fewer that 240 street names were transformed from short form to long form.

As mentioned in Other Ideas About the Dataset, the Portland data is very thorough, but the more rural communities surrounding Portland would benefit from more users and data. Bringing awareness of the OSM project and its benefits in terms of data availability to potential new users seems to be an intergral component to the continued success of OSM.