GTFS: A Standard that needs a new vision
Reflecting on 20 Years of Transit Data Evolution, there have been some major steps forward in GTFS and surrounding specifications in open transportation data standards, but have we stopped innovating and pushing the standard?
When GTFS came out in 2006, it fit the needs of agencies and was a compromise between technologically literate and a workforce that had access to a limited set of tools. The initial solution was just meant to solve the routing problem that agencies faced on modern day routing engines like newly revealed Google Maps.
The technology landscape had barely heard of cloud based computing and it wouldn't be till 2011 when the term had become ubiquitous with the future of software. REST was barely standardized, PostGIS had its first stable release just a year earlier and “SaaS” wasn’t even a mainstream term yet. The specification was designed to be a solution out of necessity and convenience with the lack of tools. There were no companies that were offering GTFS exports and those that were produced schedules that were imperfect for SaaS companies to integrate with.
What GTFS Got Right
I personally built GTFS by hand in 2012 using ArcMap and Excel with QGIS since ESRI did not include line explosions with our standard license. Before this PART had never documented all of the stop locations and routing for any of our routes. And we were one of the lucky ones having expertise in house.
At the time, that lowest common denominator was exactly what the industry needed and what our customers were asking for. A digital way for riders to get the most up to date schedule for our riders.
On the opposite side we as agencies had to fully standardize our service something that smaller agencies did not necessarily do in the past and could create confusion about routing depending on the driver performing the service.
20 Years Later: What’s Changed
Since then we have a seen a global adoption to this standard, created real time information standards to latch on the existing fixed route schedules and has inspired other to build standardized static files to go along with it. But until we break free from the standard file standard we will not be able to move forward.
Five Reasons GTFS Isn’t Enough Anymore
1. It’s Not Built for Querying Over Time
Transit schedules don't just change on a regular schedule. There are times when humans get something wrong or a tweak needs to be made to the schedule or responding to a dramatic event. New trips are added. Stops move. Routes are realigned. And yet GTFS, at its core, offers no built-in way to track, append, ammend or query those changes over time.
GTFS was built to routing engines to completely replace the schedule's database. Just dropping all of the rows and pulling in new rows makes sense if you are doing a full update, but is overkill and counterintuitive when you want to create a living record. There's no version history, no change log, no temporal context beyond the start and end dates baked into the feed which can be overlapping and would return multiple stops, routes and trips confusing the user and causing improper calculations.
If you're trying to answer questions like:
- "When did we move this stop?"
- "What was the span of service on this route last fall?"
- "How many times has this schedule been updated in the past year?"
You’re either manually saving and changing the GTFS feeds start and end dates, breaking them out by schedule change.If you try to reconstruct what actually happened it is going to be difficult and there is no guidance on how to do this.
That’s a problem. Modern agencies need to understand how their service has evolved, not just where it is today. They need to analyze trends, report on service delivery, and audit past decisions. And GTFS gives them no real tools to do that.
GTFS doesn’t remember. And that’s becoming one of its biggest limitations.
2. No Canonical ID System
Assigning id's to routes, stops, trips and shapes is great when it comes down to being able to join data across schedule information tables, but over time the id's are not enforced over schedule changes. There can be id's that should be standardized like the route_id over schedule changes to track across changes. There’s no required global or persistent identifier schema. One schedule update can completely change the IDs for routes or stops, even if the physical service hasn’t changed at all. Without that structure and rules it is impossible to have a realistic idea on:
- “How many times has this stop been relocated?”
- “Has this route changed its span of service over the past two years?”
- “What did this shape look like in Q2 before the detour?”
Instead, agencies and developers often resort to hacks: matching on stop names, geographic coordinates, or route long names. Personally I often rely on route_long_name rather than route_id to keep things in order when running queries over time. A renamed route, a misspelled stop, or an overlapping set of coordinates can break your entire system.
The result is a brittle ecosystem where tools are constantly guessing about identity instead of trusting it.
If GTFS wants to grow up, it needs to act like a database — and that means supporting canonical, persistent IDs across time.
3. It Can’t Handle Overlapping Schedules
Typically when you create a GTFS feed that is either, two schedule changes, the current and the upcoming. This is done so that the schedule goes into affect before the GTFS goes live. Or there can be and amendment that would take place immediately with just one schedule.
While there are start and end dates for these schedules they don't account for or help agencies use GTFS for anything other than a routing dataset.
- Prevent overlapping date ranges across multiple feeds,
- Handle conflicting trips for the same route,
- Or designate which version of a feed is “active,” “archived,” or “in testing.”
Instead, you’re left with a patchwork of assumptions. Agencies often overwrite old feeds with new ones. Vendors might publish future GTFS datasets in completely separate folders, or embed preview data inside a current feed. There’s no official way to label a schedule as "proposed," "active," or "deprecated."
This leads to:
- Version confusion between stakeholders,
- Routing errors when multiple overlapping feeds are ingested,
- And no reliable way to stage or QA a future schedule rollout.
GTFS assumes there's one truth at a time. But transit agencies live in a world where multiple versions of “truth” need to coexist.
4. It’s Not a Database
It’s just a bunch of zipped CSVs, designed to be "simple" enough to create with accessible tools. That’s fine if you only ever need to answer the question: “How do I get from A to B today?” But there have been so many applications that have expanded beyond this original goal that should be addressed. Agencies are looking to use this data to:
- Analyzing trends in service delivery,
- Tracking route performance across years,
- Validating service changes before they go live,
- Supporting federal reporting and internal accountability.
All of that requires a database. A real one. With persistent IDs, relational joins, temporal queries, and audit trails.
The irony is that most agencies treat GTFS as a database — they just don’t have the tools or standards to make it act like one.
It’s time to change that. Instead of forcing a flat-file format to impersonate a database we need to be able to have a enforceable schema that contains migrations to update, append, overwrite or amend anything in the schedule rather than going scorched earth on the database tables.
5. Passing around static files is antiquated
I thing the most apt comparison between how far we have come and the way that we are behind on the times you can look to the mighty shapefile. A shapefile is just a bunch of static files that hold together the vectorized locations, and attributes for those attributes. This standard was meant to serve as a means to share GIS files between systems and is interopperable between QGIS and ArcMAP.
But this standard has so many limitations to the speed and size that we can assign to it. It can be emailed back and forth, but the slightest error or someone not zipping up all of the files can break the whole system.
Instead they have moved forward with OGC API's that standardize the way that API's are queried by time and its database table columns.
Data standards are meant to guide agencies on how to structure their data
Since we have created saturation across the world with the GTFS standard. We even have FTA requiring it for fixed route service for reporting purposes. That type of saturation is great and we should not just settle for GTFS to be used for routing engines, but an interconnected way to document service over time.
When GTFS first was introduced I feel like the promise was to create a standard that can be used across all of our systems, but we have stopped thinking beyond static files and ad hoc updates.
Member discussion