By now, you would think the concept of a 'nearby button' would be a basic feature online. But it's still hard to get all the relevant information in one place, whether you’re making a small decision like where to eat lunch or big decisions like where to live.
We’re solving this problem by providing a platform that can distribute local information, insights, news and events from hundreds of media partners to any site or app.
It’s actually quite complicated to surface the right content, in the right place, about the right topics, with the right sentiment. The solution has required the development of proprietary machine learning software, advancing what is already being done in the field.
Here’s a closer look at how our machine learning system works today as part of our platform.
First, we partnered with more than 200 media publications, so we can contextualize and distribute their content to other platforms. Then we localized each article through the following layers of metadata:
- Location: down to the latitude and longitude of the exact place mentioned in the article
- Shelf life: one day, one month, or indefinitely
- Topics: main categories, like Arts & Culture, Politics & Government to Crime & Safety, plus more than 50 subcategories
- User persona: local, traveler and global
- Sentiment: positive, neutral and negative
- People: list of people mentioned in the article
- Organizations: list of organizations mentioned in the article
What The Right Tags Look Like
Before we dive into our process further, take a look at some of the correctly-tagged articles already on our platform.
Location: The Westwood area of southwest Denver, Colo.
Category: Real Estate & Development
Published date: May 4, 2017
Expiration date: September 30, 2017
We not only tagged the correct neighborhood, sentiment, persona and category, but we also were able to estimate a shelf life of the post down to the time when the park rebuilding project is supposed to be finished (the end of summer or beginning of fall). So we were conservative and tagged September 30 of this year.
Publisher: The San Jose Mercury News
Locations: 377 Santana Row, San Jose, CA and Pruneyard Shopping Center - 1875 S Bascom Ave, Campbell, CA
Category: Food & Drink
Published Date: May 9, 2017
Expiration Date: January 1, 2018
This article was a bit complicated to tag because it contained two relevant locations where the restaurant is planning to open (future sites on Santana Row in San Jose and nearby in Campbell), and two other locations that are not relevant (Mendocino and Los Angeles). Our model sorted it out right—and because this post is about a new restaurant opening our model accurately classified the sentiment as positive, and appropriate only for locals.
Publisher: 6ABC Action News (WPVI-TV)
Locations: 4700 block of Richmond Street, 2700 block of Jenks Street and 2800 block of Ash Street, Philadelphia, PA
Category: Crime & Safety
Published Date: May 9, 2017
Expiration Date: May 9, 2017
While the story was mostly in video format, we used the text to identify the main blocks named in the article, the local-only focus, and the short shelf life (given the nature of the incident, we only gave it 12 hours on the platform).
Location Extraction At The Neighborhood Level
Coming from an academic background in machine learning, I knew that most generalized problems related to metadata extraction have well-understood solutions already—all you need is a lot of quality data to work with.
The harder part is creating a solution that solves every part of the problem you’re facing. For this reason, we developed our own machine learning infrastructure built on top of open source software and proprietary algorithms. It has already scaled to allow us to provide 180,000 tags per hour to the articles we’re inputting.
We now have proprietary ways for tagging article exact location (latitude and longitude down to a 0.2 mile radius) and shelf-life, two concepts that for our purposes have not been addressed as much within the machine learning community for the local news domain.
Here’s a closer look at how we approached location tagging to make sure the right stories get to the right users. To do this, we needed to extract the appropriate places mentioned in the text of an article, while cutting out any location that was mentioned but was not the focus of the story.
First, it’s important to understand that geotagging news article is not a new problem, at least at the citywide level. For example, researchers at the MIT Center for Civic Media describe a way of using a tool called CLAVIN combined with their own data pipeline tool implementation called CLIFF to figure out the focus city and country. In our tests of the system they described, we could reach 85 percent accuracy on the focus city extraction.
But we do not want only city level focus, we want to extract the exact location at a neighborhood level. Our secret sauce involves using advanced machine learning for finding the right set of locations in the article text, together with a mix of location-point density heuristics calculated on our proprietary GIS.
Finding location names in an article is a machine learning task called Location Entity Extraction from unstructured text. There are several online tools to solve this problem (Google NLP, IBM Watson), but none of them is optimized for our local/news domain.
To solve this, our human team at Hoodline manually provided location tags for a set of articles that we could use to train and improve the machine learning model. This showed what kinds of words and phrases were relevant to improve the model for our domain.
You can see our approach in the examples above. The text in the fire hydrant story read:
Action News is told a truck struck a hydrant in the 4700 block of Richmond Street, sending thousands of gallons of water gushing into the street.
Our system knows that a phrase like “the [address number] block of [street name]” is a strong signal of story location, but to be sure it is an exact location, we try resolve the string by querying our GIS system Atlas. This is also the system we use to know that Bridesburg is a neighborhood in Philadelphia, so we can extract the city as well.
This Named Entity Extraction technique is just a piece of the location extraction engine. We had to implement a toponym resolution based on our GIS system and a focus determination based on more complex heuristics than the simple mention count.
The image above visualizes the correlations between words across our news article database.
Conclusion: Getting Better All The Time
For most of the tag we need as topics, sentiment and persona, we've found the average human accuracy rate was not performing as we expected—our technology is not only overperforming human accuracy by 15 percent, but it allows us to tag in real time and at scale.
By testing and iterating on our location model, we’ve built an algorithm that today reaches 90 percent accuracy when predicting the exact focus city and it is also able extract all relevant exact locations mentioned by an article within a city.
Our system also provides a complete feedback loop, where editors improve the tags, and machines learn from the new tags, and push the bar higher.
This has allowed us to scale the process to automatically tag thousands of articles about thousands of locations coming from hundreds of publishers, distributing the right information to the relevant site or app.
While we're happy with our progress so far, it's just the start of what we have planned. We're working on more sophisticated localization algorithms, deeper analysis of local data, and much more.