Transitland, towards a more robust open-source transit software stack — interview with Drew Dara-Abrams

Transit Land Logo

Transitland is opening up doors to conveniently find and use transit data (see Thomas’s earlier June 2016 Trillium blog post on Transitland). To get a full understanding of the Transitland project, its ambitions, and the opportunities it is opening, I talked with Drew Dara-Abrams (Twitter: @drewdaraabrams), head of mobility products at Mapzen. Below is the full text of our interview. Thank you, Drew, for your time, and contributions to this impressive and important project!

Here are a few quick links to particular themes below:

What are the goals of Transitland?

Aaron: Thanks for participating in this interview! Transitland is very ambitious. What are its goals in the broadest possible sense?

Drew: Transitland is coming onto this scene of transit data, the GTFS spec and a whole ecosystem of transit apps and software for transit agencies that goes back a good ten years if not more. Our goals with Transitland are to take advantage of all this shared interest of producing and consuming transit data and build a center of gravity both in terms of technical functionality and in bringing together a community that can collaboratively improve data in one place.

The GTFS [General Transit Feed Specification] has become very well-adopted, but for better or worse the ease of adoption has also meant that these .ZIP files are just floating around the internet. We’ve seen a number of projects over the years to provide some structure; Transitland is a continuation of those projects. We aim to help aggregate GTFS feeds from authoritative sources, make them queryable, make it possible for people to contribute crowd-sourced information, and really just build a center of gravity for all these important activities.

Aaron: So what I hear you saying is that a broad goal of Transitland is to make it less necessary for everyone to keep reinventing wheels. That was one of the original purposes of GTFS. Still, those of us working with transit data are building the same tools again and again. Is that an apt summary?

Drew: Yes, that’s a fair way of putting it. We can think of this in terms of layers in a stack. The GTFS has really worked very well for data interoperability. It’s to all our benefit that TriMet, Google and partners made that specification freely available. What’s been an open question is infrastructure for collecting, aggregating, validating and improving those feeds. Google and others have built systems like this internally; they’ve each solved the problems of aggregation as it relates to their product and their business needs, and it’s totally understandable that these organizations keep their administrative systems private. Google is more than generous in sharing the GTFS. With Transitland, we are looking for opportunities and needs where we see these same costs being incurred to do data wrangling across the industry.

Every time an app developer has wanted to build a transit app they’ve put in a lot of energy to build up this infrastructure. There’s a manual merging necessary to bring multiple feeds together, build the scripts and the server architectures necessary to do it repeatedly. It’s all straightforward, but it does involve engineering effort and it’s a real cost. That’s not where any unique functionality of an app or a product comes from. So we see a pattern of either small efforts that do well for some time and then fizzle out or successful startups that grow and then are acquired by another company and then go quiet.

The proposition of Mapzen is to bring together a bunch of partners and solve this together. Let’s work together and share the burden with open-source software and open data. Then we can all get on with doing what makes each of our products unique in terms of user experience or functionality, or the unique ways we offer our services on top of this common foundation.

What is the strategy and vision behind Transitland?

Aaron: So Mapzen is reducing the barrier to entry for transit applications — making tools available that a lot of other companies are developing and using internally. Why is Mapzen doing that? What are the broadest goals of the organization?

Drew: We want to up everyone’s game. Users can build on top of Transitland. We can go beyond another app to see when the next bus coming and can start tackling more sophisticated questions like looking at transit frequency across entire metropolitan regions. This is a little harder to describe in words (although Jarrett Walker is the most articulate explainer of these concepts), it’s much more interesting to look at some visuals:

NYC bus - saturday frequency

[See blog posts: Made using Transitland: An interactive visualization of New York City transit frequency and Transit dimensions: Announcing the Datastore schedule API.]

Drew: At the same time Mapzen is building its own products that make use of transit data. We offer a complementary product called Mapzen Turn-by-Turn. It’s a hosted multi-modal routing engine that makes use of transit data from Transitland. It’s powered by its own open source project called Valhalla.

Our goal of equipping external users to build more sophisticated apps, analyses, visualizations and maps, is complementary with our effort build up the infrastructure necessary to offer some of our own services. This is no different than companies like MapQuest, Google, Apple, Microsoft and HERE have done already. It’s just a bit unique that Mapzen offers plug-in points at many different levels of the stack rather than only a tightly integrated consumer app. Google Maps and similar applications are just tip of the of the iceberg; we want to open up more of the infrastructure that powers those types of applications.

Aaron: Right, and many companies are competing based on their access to transit and street network data but it doesn’t sound as though that is Mapzen’s model. I understand Samsung funds Mapzen. Can you comment more generally on that model and the strategy and where you see this taking the organization?

Drew: Mapzen is a startup, about three years old at this time, and our mandate is to build all the pieces of a stack that developers would need to power their own applications for the web or mobile devices. We’re part of Samsung’s accelerator program. Samsung has been working over the past few years in the U.S. to fund or work with startups that are pursuing new approaches to software. Samsung has many strengths across the range of hardware and components and the many parts of the company are now getting more creative at how to build interesting new pieces of software, applications, and experiences. Mapzen operates independently. We work with a number of other corporate partners, so it’s a unique combination. It means that we can be building software and data projects fully in the open so that they can be used by any number of our corporate partners as well as hobbyists and smaller developers and startups that are looking for this kind of tooling.

I’ll also add that the geo-world is big. Open-source and open data solve a lot of problems but at the same time there still is some real value in super-polished, super-targeted experiences like Google’s and Apple’s and Microsoft’s and Esri’s. Mapzen’s motto is start where you are.” We are trying to meet developers where they are and it means we can do a lot more work together but it may not be as polished or fulfill the same exact needs as proprietary approaches.

How does Transitland help support scale and make transit data more seamless?

Aaron: I’m hearing that it’s not your goal to do everything and be all things to all people. Transitland is aiming to fill some gaps and overcome some barriers. What are those gaps and barriers? What are the specific barriers to the proliferation of transit information apps right now, or rather, what are the barriers to developing novel and improved functions and apps?

Drew: I think a lot of the challenges we’re trying to address with Transitland come from situations where you want to work with more than one GTFS feed at a time. With Transitland we have worked with a number of research groups and organizations to figure out what we call the Onestop ID scheme. It’s an experimental approach for merging and identifying stop locations that are served by multiple transit operators and giving globally unique ID’s to feeds, stop locations, routes and some of the other aspects that the GTFS datasets that are currently only unique in the context of one feed. Things get more involved when you want to bring together multiple or hundreds or thousands of feeds. Transitland helps with this. To get an idea of the sort of magnitude of effort we need to help support, Google currently reports that transit information for 18,000 cities in the world is available in Google Maps.

The Onestop ID scheme and the merging logic behind it can be really useful, but it’s not yet a full solution and it doesn’t always work. We’re currently building out some quality checks and editing tools. We’re looking forward to sharing that with more users.

To support scale, we’ve also been putting effort into figuring out the legal patchwork of licensing around transit data feeds. This is an issue with open data in general, and we see some particular challenges with transit. We engaged a lawyer who focuses just on open data and reviewed licenses for a couple dozen of the major agencies and found how many unique variants there are.

There are a couple out there that just aren’t permissive — you aren’t able to do that much with that feed. And that’s well within the agency’s rights but it poses complications if you want to bring lots of agencies together. Maybe you want to bring ten feeds in at one time and they each have some varied permissions, so we’ve been putting in effort to at least catalog what is possible with each feed legally. We’ve also been continuing a number of conversations with organizations that played a long game of nudging agencies in a positive way towards open data licenses (see model license).

Aaron: That makes a lot of sense. It seems as though, for most agencies, if they release open data under a particular license, they likely desire to see applications created and to see that data be put to use. But if licenses are all unique without any standardization, then that patchwork is creating a barrier, and, in a sense, defeating the purpose of open data. It sounds like you’re working and contributing to a larger effort to solve that problem.

Drew: Yeah, that is one case where we felt that it’s important for us to make use of some of our resources to do the research to better identify the [licensing] issue. We want to make sure that everything we do is seen as positive and helpful by the agencies that make this data; I’m sure you know better than I do but I imagine many of these are transit agency staff who already have too many demands on them and don’t need consumers of their data making their jobs harder.

What can we do with the Datastore API?

Aaron: Let’s talk in more depth about Transitland’s Datastore API. Am I correct in understanding that the Datastore API really is a central piece of Transitland? What else is in there, what are some of its unique functions that other API’s don’t have?

Drew: The core of the Datastore is aggregation of feeds and operators, and these can be browsed through a couple of different frontend interfaces. We have one app called the Feed Registry, a catalog of many feeds with a mechanism to add more. There’s a really straightforward process for community members or transit agencies to come in and add their feed. This includes an abundance of feeds supplied by Trillium (list of feeds outputted in JSON). So that gives the baseline catalog. The Datastore has background jobs that run on that catalog throughout the day for checking for new versions of the feeds. The “source of truth” still resides on an agency server. We go out and check to see if any changes have been made and then download those feeds and import the data. That’s where the power of the API comes from. Then, you can drill down into stops, routes, or route variants. We do some processing on the geometry to pull out all the different stop patterns within a route, and schedule data. We actually abstract some aspects of the GTFS models, so we’re presenting all of the same data but we’re making it even easier to query.

In a raw GTFS feed if you wanted to answer certain questions you might have to do a lot of joined queries, working through all the trips that a particular route takes and you’d have to look up all their stop times. The Transitland Datastore abstracts that model and lets you run queries that cut some corners.

Aaron: So give me some examples — can I ask what’s the span and frequency for Route 3? Can I ask at what point does the frequency change frequently on route 3?

Drew: Those questions you just ticked off are actually things on our roadmap right now. We’ve written that kind of aggregate frequency analysis with example scripts, and that’s what’s used to build a frequent network map. We have plans to bake that into the API itself, so you don’t have to run the little script locally.

Aaron: Very cool.

Drew: I’m looking forward to that because the idea of a map of the most frequent transit service in an area — a backbone of dependable service — it’s a good concept with legs.

Aaron: Right now, could I use a Datastore API query to discover that a route runs in two directions, and then query the set of stops based on direction? Or could I determine a route’s primary pattern and deviations? Give us a feel for other responses the API can return.

Drew: Right now, the API does support drilling down through all of the stop patterns within a route. One of our engineers has also built out a cool frontend interface called the Route Spotter that demos that functionality.

The API also currently enables things like looking for routes and for stops that are wheelchair-accessible or allow bikes. Those flags are stored in GTFS. It makes for a interesting exploration on top of the aggregate data and with Onestop IDs. It means that you can cook up a query and pretty easily move back and forth just by using the same ID’s.

Aaron: Fantastic! It’s really cool to think about. One hypothetical application this brings to mind was described in my GTFS Today and Tomorrow blog post: there is a need to survey and compare GTFS usage practice. There are a number of optional and proposed fields in GTFS. Sometimes there are also many different possible approaches to describe the same transit schedule — do you use frequencies.txt or write out all of the schedule in the trips.txt and stop_times.txt files, for example? I’m wondering, do you see Transitland as a tool to do these large-scale comparisons of GTFS usage or is that a separate project? I do know that some of the big organizations have their own internal tool for doing just that, but everyone outside of those organizations often has to resort to educated speculation, selective sampling, or manual brute force in order to draw conclusions on GTFS usage.

Drew: Some of our engineers and some of our collaborators have been slowly putting in descriptive overviews of feed content, validations, and quality checks. I would love to see Transitland serve many more needs in this regard because the platform has the right plug-in points to make that happen. So far we’re just figuring out: there are certain types of frequency-based schedules that Transitland is not yet able to fully handle and others that it is. We also go through and figure out whether there are shapes or not.

Transitland really has all the plug-in points to provide these type of statistics, but it would be great to have input and participation from the people who know exactly which statistics they need so we can build those things together rather than sketch something that might be collecting the wrong stats. It would be great to do it in an incremental way.

Who is using Transitland?

Aaron: Are you seeing some applications that are really actively using Transitland and starting to put the rubber on the road, so to speak, or is that still ramping up?

Drew: So far, we have seen an interesting variety of projects. One we are very intentionally pursuing within Mapzen is to make use of Transitland to power our turn-by-turn routing service, so now you can route against a couple hundred feeds from Transitland. I mentioned that first just because we found it really helpful to be our own data consumers with that routing engine. It’s a pretty hefty customer, so it really pounds the Transitland API and that’s forced us to do optimization early.

We’ve had a steady stream of interesting pleasant surprises. On the Transitland blog there’s a write up about an open data group in Italy that registered a number of local feeds with Transitland and then built a chat bot on top of that so you can chat back-and-forth with this bot — send it your location and get back service times (see blog post). This was such a pleasant surprise. The bot was built with Italy in mind so all the text was in Italian. We had to get one colleague who could read Italian to translate but we were able to start using the app here in the Bay Area and the East Coast because it’s hitting the Transitland API. It works anywhere there’s coverage.

Also some interesting analyses of transit frequency. Also on the Transitland blog, there’s a post by a talented developer who put together an overview of frequent service across the New York metro area. And a project in DC looking at transit stops versus bike share locations.

Aaron: It certainly whets the appetite to see these kind of ideas coming up. You’re talking about the bot and I’m immediately thinking of what can we do with that with Amazon’s Alexa (see Alexa Voice Service), or Apple’s Siri (SiriKit), because I think they are planning to open that up to more third party developers in conjunction with Transitland.

Drew: One quick final thought on your comment about exciting project ideas: At Mapzen we’ve cooked up so many ideas like that ourselves and then have actually consciously decided not to do a number of them because we figure the platform will be most successful if we give room to outsiders to take their own unique take on it rather than preclude all the possibilities.

Aaron: Yeah, that’s an important signal to send out to the industry.

Drew: We’ve also did take a bit of a gamble by not pre-seeding coverage, so we built the system and made it work in the San Francisco and New York metro areas which are pretty complex places with overlapping agencies so we could test out the aggregation model and the Onestop ID’s. Instead of hiring contractors to go out and build a huge amount of coverage we kind of deliberately just opened the door. I think part of why the Italian bot example is so great is because the open data group submitted the feed because they wanted to use it. So the patchwork coverage at Transitland shows not just where we happened to have found data but where our users want to use it. This is similar to the way OpenStreetMap has grown, too, and I think has really thrived because it’s both technically useful but also because it keeps engaging contributors.

Transitland and OpenStreetMap

Aaron: How do Transitland and OSM fit together?

Drew: Transitland is built to work together with OpenStreetMap. I remember back to a panel I organized at last year’s State of the Map conference where we had creators of a number of these type of projects that aren’t built within OSM but sit alongside OSM and have loose connections to OSM data. Part of that decision is driven by licensing reasons — OSM can be a challenge for certain types of uses within business settings. Also OSM is built for geographic data only, but a lot of transit data is also temporal. Though contributors do try to squeeze it in into OSM tags, Transitland is really purpose-built for transit data.

One of the background processes we run is to conflate all stop locations with OSM so for each stop you can get an for an ID for an OSM pedestrian way that’s closest to that stop. That’s used by Mapzen’s routing engine to connect together the transit routing graph and the pedestrian routing graph. We have also seen some interesting analyses and ideas come up that make use of both Transitland and OSM data. In any case this idea is to support a loose coupling so users are free to use both together but if they have licensing restrictions they can choose one or the other.

Aaron: As an aside, I was wondering if a way of getting street names from OSM for a route or route alignment is part of the Datastore API?.

Drew: That’s a great idea. That may actually be possible in the future because of two other Mapzen projects; one is Mapzen Turn-by-Turn, the routing engine powers a number of transit routing apps. Behind the scenes, Remix calls Mapzen Turn-by-Turn whenever users are dragging and dropping to place a bus line on one of their maps.

Just in the last week, we started announcing some work we’ve done with Mapillary, a startup that does an open equivalent to Google Street View. They’ve extended the Valhalla project that powers Mapzen turn-by-turn, so now it can do matching so you can upload GPS tracks and get back geometry that traces an OSM road network.

Aaron: Right now Transitland is using static GTFS. Is there any thought of incorporating real time in some way or is that outside the scope of this project?

Drew: It’s been on our drawing board. One of Mapzen’s engineers has done some experiments. We have also chatted with a number of groups with more experience with real-time data. I definitely see a place for it and I hope we can do that in the future. I think that it will be a natural addition on top of this of stable foundation of the Feed Registry and the Onestop ID scheme, and that as that grows and matures being able to share real-time data with the same ID reference scheme will be much more straightforward. Again, if we are going to do that in the future it would be this same type of model of aggregating from authoritative sources.

Aaron: The big theme I’m hearing here is you’re interested in building stable foundations first. We need a better foundation for static transit data even before we can tackle real time on a broad scale.

Drew: I agree. I think this is all about having parallel progress on specific necessary applications — where we all join together to figure out how to implement that. Then we step back from the individual applications and look at a foundation that will serve all of them at the same time. And we prioritize so that we’re appropriately ambitious but finding commonalities and identifying core infrastructure that’s going to be useful for a wide range of consumers and contributors.

Aaron: How do you see official and crowd-sourced transit data fitting together, either or both in Transitland and broadly? Is there a role for crowd-sourced data? What is it?

Drew: I think there are three different varieties of crowd-sourced data here. One is just the catalog of feeds itself; we are making use of crowd-sourcing mechanisms to do that. That’s part of why we can definitely say the Transitland Feed Registry is in the public domain because we asked everyone who contributes to that to agree to those terms. That does not include the feed files, but that catalog. That’s also how the OpenAddresses project has grown — building that kind of a registry.

Then I think there is a place for data contributed where there is no authoritative coverage. Transitland is moving in the direction of offering tools that can be used to do very simple additions — here’s a stop that is not represented in a GTFS feed. This is not about building up an authoritative feed or doing the level of detail necessary for what an agency needs. But we are working towards that sort of “greenfield” crowd-sourcing. I think you’re right to put your finger on a combo approach — an agency with an authoritative feed that community contributors want to improve or fix — that’s an interesting and useful middle ground.

Aaron: I think there’s compelling need for it.

Drew: Yes. So an important fundamental decision of the Transitland architecture is an assumption that the source of truth always resides on an agency-run web-server, so they retain ownership and management of the feed file. That file ideally lives in a nice stable URL. When it’s brought into the Transitland Datastore, we’re doing a certain amount of automated work to merge that together with other feeds. That’s what the Onestop ID scheme is, and so it’s a bit more about building an overlay on top of that feed. We also have some tools that can be used to fix route geometries and stop locations that are very much out of alignment and can’t snap together. It’s about adjusting the data in this overlay, preserving IDs to match it back up with its original and authoritative source, but making it so that within the Transitland Datastore it can blend well with the other feed files.

We’re reaching out to agencies when we see errors or places where we think they could improve their feed. You and I and others have talked about ways to scale that or automate that. I think that second path of the process — how do we take an improvement made within the Transitland Datastore and make it useful to the agency staff in some way that can scale? — is definitely an open question. We’re deliberately taking that question slowly because we want to first be sure that the overly of data in Transitland is actually useful and valuable before we’re trying to engage agency staff more often.

Getting involved and the future of Transitland

Aaron: Right, there’s a lot of fun stuff to think about there. So, how does one get involved? Let’s say one wants to use Transitland — should people depend upon the hosted version of Transitland or use the source code to install their own instance? What’s the best way of approaching the project?

Drew: It depends a bit on the type of user. For agency staff and local transit advocacy groups I’d recommend looking at the Transitland Feed Registry to see if their favorite agency is already represented. If not, there’s a pretty simple way to add it to that list.

If the feed is already in the system I’d recommend to planners or transportation domain experts that they try out the Transitland Playground. It’s a very simple user interface for quickly browsing stops and routes and information like that, for someone who is unfamiliar with the acronym “API.”

For developers, the API is the place to start. We have some documentation on the range of queries there.

And to your question asking if the hosted API is the one to use — yes. Transitland could be called a “web property” in the same way that is the central place to query and download street map data. Users wouldn’t likely run their own mirror unless they had an edge use-case.

For the sake of keeping the system sustainable, we impose some API rate limits in the Datastore API. If users find themselves hitting that limit we’re glad to create custom keys and increase the limit. Right now it’s a manual process but we’ll be automating that because we really do want people to make free use of the hosted API. Mapzen has absolutely no plans to charge for that API use now or in the future. But if there are needs that certain users have — bulk downloads, that’s some functionality we only have partially covered. I bet there will be situations where we will make it a bit more easy to extract a certain part of Transitland, especially for people who want to run their own routing engine or do an analysis for running a metro area.

Aaron: Is there a plan for Transitland to have a board and be a kind of organization like OpenStreetMap?

Drew: We think of Transitland as a project sponsored, but not owned, by Mapzen. We’ve always made sure to brand it separately than Mapzen. It has a different terms-of-use than the hosted API, and it’s on a different domain. The right home for Transitland in the long run has been an open question — a question that we want to consider with involvement from users and contributors — but not one that we’re going to rush.

Aaron: I think right now Transitland has a very stable home in Mapzen. Taking it out of its nest is obviously not a decision to be taken lightly at all.

Drew: Yeah. We want to publically make it clear that it’s sponsored by Mapzen not owned by Mapzen. We’ll continue to, in parallel, develop the platform using Mapzen’s resources and increase the number of other individuals and organizations contributing to the platform and its data coverage.

Aaron: I think it’s good to be patient there, but you guys are doing an awesome job of striking the right chords, sending the right signals.

Okay, did we miss anything important? Would you care to talk generally about where this project is headed and the future of transit data directories and API’s?

Drew: Off the top of my head I think it’s all about expanding the coverage geographically and the depth of functionality; the real advantage of a platform like this is expanding in one direction instantly scales in the other. That is, adding a new feed gets all the same functionality as the existing feeds, and adding one new type of query applies to all feeds.

Aaron: Thank you so much for your time, Drew, and for your work on this project.

Drew: Sure, my pleasure. Transitland is an exciting projects to work on with many talented colleagues at Mapzen and many outside partners like Trillium. It’s always fun to get a chance to talk about our work.

Aaron is the founding principal of Trillium Solutions, Inc. He brings experience that includes 12 years of web-development with 8 years in public transportation, with knowledge of fixed-route transportation, paratransit, rural transportation, and active transportation modes. Aaron is a recognized expert in developing data standards, web-application design, digital communications, and online marketing strategy. He originally developed Trillium’s GTFS Manager, and has played a key role in the development of the GTFS data specification since 2007.