People don’t just want to read local news on the web, mobile devices or social networks. We’re finding that they want to read it on all sorts of online and mobile platforms that might not be so obvious, from real estate sites to local shopping and dining apps, and more.
So we’ve built a technology solution, that lets us automatically distribute thousands of articles per day from hundreds of media partners by location to cities across the country.
Here’s more about the problem and how we’ve solved it.
The need for distributed local news
Local news articles about businesses, things to do, city issues and culture are a key way that people research how they spend their money and time.
We’ve been learning this a couple ways in recent years. On the Hoodline side, we built the Neighborhood Kit product that we launched last year. It’s an embed or API-accessible feed of local news story organized by topics and freshness for neighborhoods across San Francisco.
On the Ripple side, we began distributing feeds of content from ourselves and partner news organizations across the country on tech platforms, like the Uber test we talked about last fall.
We’ve realized that this concept is a lot bigger through conversations with technology companies that provide local-related services. We’ve been scaling up this notion of local news on-demand for all sorts of use cases, across the country, and testing it with tech partners.
Today, you can see articles from Hoodline and local media partners linked on real estate listings sites around the city, helping prospective buyers and renters to learn more about places they’re considering moving into. And you can see Hoodline articles appearing in Eventbrite’s weekly emails to its Bay Area users.
This required us to scale our API to include:
- Feeds from more than 200 partner local news organizations
- Organized across 20 cities
- Including metadata about locations, topics and sentiment
- Partner control over metadata (do they want only food news? only crime news?)
- In formats that could be presented well on partner web and mobile interfaces
The content pipeline
RSS feeds are the most uniform and ubiquitous online content formatting standard available for receiving updates about new stories from publishers (even though the tech may still not be familiar to most internet users).
So for each publisher relationship, we ask partners for all the feeds that they want us to include in our system.
Our RSS aggregator constantly tracks all the available feeds to find updates. Every time a new story is added to the feed, we ingest specific information like: link, title, description, main image, etc. We also enhance every link with information provided by the partner in the article page, like open graph metatags, canonical url and AMP link.
We also differentiate between post types, so for example, a partner like ABC News, who publishes video and text based stories, has its content categorized in different ways according to the format of the published story.
The moment the story is normalized and added to our system it goes through our tagging pipeline. Initially, we try to tag the story using machine learning models (which we’ll share more about in a separate post).
If that’s not possible, it goes to the second step where real people read and tag the story using predefined guidelines.
Take this ABC story, for example: "Horseback Rides Offered At San Francisco's Golden Gate Park." We can see that it talks about something you can do in a particular location. So as part of our taxonomy the category we set for this story is "Events / Things to do", with "positive" sentiment and relevant for someone living or traveling to the city.
The event will happen in the Golden Gate Park, so that is the focus location for this story, not a particular latitude/longitude, but a polygon representing the park boundaries.
The shelf-life, or, for how long this story is relevant, is defined by the end date of the event, in this case, May 20th, 2017. After this date, the story will be considered expired in our system.
Stories tagged by humans are then used in future machine learning training/test sets, continuously improving the system.
As soon we have a story tagged, it becomes available to partners via our public API, which allows them to slice and dice the entire content repository using complex queries like: show me all crime stories, that happened last year, at a distance not bigger than 0.5 miles from San Francisco City Hall; or, show me all the positive stories published in the last 48 hours in downtown Manhattan.
Here at Hoodline, we believe that providing better location information can help people understand the world around them, get involved in their communities, and make better decisions for themselves and the world.
Stay tuned for more, including ways we'll be bringing more stories to each location, and how we're digging into local data to really figure out how cities really work.
Main technologies and services used in our new platform:
Heroku: Deploying, managing and scaling servers should not consume time of a small team like ours. We are trying to move fast, and focus on our current product. Heroku allows us to do that.
Github + CircleCI + Codecov: We should be able to develop, iterate and deploy reliably at any time, assuring quality and zero disruption. These services are the cornerstone of our development / deployment process.