Home

Awesome

gtfs-utils

Utilities to process GTFS data sets.

npm version ISC-licensed minimum Node.js version support me via GitHub Sponsors chat with me on Twitter

Design goals

streaming/iterative on sorted data

As public transportation systems will hopefully become more integrated over time, GTFS datasets will often be multiple GBs large. GTFS processing should work in memory-constrained Raspberry Pis or FaaS environments as well.

Whenever possible, all gtfs-utils tools will only read as little data into memory as possible. For this, the individual files in a GTFS dataset need to be sorted in a way that allows iterative processing.

Read more in the performance section.

data-source-agnostic

gtfs-utils does not make assumptions about where you read the GTFS data from. Although it has a built-in tool to read CSV from files on disk, anything is possible: .zip archives, HTTP requests, in-memory buffers, dat/IPFS, etc.

There are too many half-done, slightly opinionated GTFS processing tools out there, so gtfs-utils tries to be as universal as possible.

correct

Aside from new features of the ever-expanded GTFS spec that change the expected behavior of old ones (and bugs of course), gtfs-utils tries to follow the spec closely.

For example, it will, when computing the absolute timestamp/instant of an arrival at a stop, always take into account stop_timezone or the user-supplied timezone, because stop_times.txt uses "wall clock time".

Installing

npm install gtfs-utils

Usage

API documentation

sorted GTFS files

gtfs-utils assumes that the files in your GTFS dataset are sorted in a particular way; This allows it to compute some data aggregations more memory-efficiently, which means that you can use it to process very large datasets. For example, if trips.txt and stop_times.txt are both sorted by trip_id, computeStopovers() can read each file incrementally, only rows for one trip_id at a time.

Miller and sponge work very well for this:

mlr --csv sort -f agency_id agency.txt | sponge agency.txt
mlr --csv sort -f parent_station -nr location_type stops.txt | sponge stops.txt
mlr --csv sort -f route_id routes.txt | sponge routes.txt
mlr --csv sort -f trip_id trips.txt | sponge trips.txt
mlr --csv sort -f trip_id -n stop_sequence stop_times.txt | sponge stop_times.txt
mlr --csv sort -f service_id calendar.txt | sponge calendar.txt
mlr --csv sort -f service_id,date calendar_dates.txt | sponge calendar_dates.txt
mlr --csv sort -f trip_id,start_time frequencies.txt | sponge frequencies.txt

There's also a sort.sh script included in the npm package, which executes the commands above.

Note: For read-only sources (like HTTP requests), sorting the files is not an option. You can solve this by spawning mlr and piping data through it.

Note: With a bit of extra code, you can also use gtfs-utils with a .zip archive or with a remote feed.

basic example

Given our sample GTFS dataset, we'll answer the following question: On a specific day, which vehicles of which lines stop at a specific station?

We define a function readFile that reads our GTFS data into a readable stream/async iterable. In this case we'll read CSV files from disk using the built-in readCsv helper:

const readCsv = require('gtfs-utils/read-csv')

const readFile = (file) => {
	return readCsv(require.resolve('sample-gtfs-feed/gtfs/' + file + '.txt'))
}

computerStopovers() will read calendar.txt, calendar_dates.txt, trips.txt, stop_times.txt & frequencies.txt and return all stopovers of all trips across the full time frame of the dataset.

It returns an async generator function (which thus is async-iterable), so we can use for await.

In the following example, we're going to print all stopovers at airport on the 5th of May 2019:

const {DateTime} = require('luxon')
const computeStopovers = require('gtfs-utils/compute-stopovers')

const day = '2019-05-15'
const isOnDay = (t) => {
	const iso = DateTime.fromMillis(t * 1000, {zone: 'Europe/Berlin'}).toISO()
	return String(t).slice(0, day.length) === day
}

const stopovers = await computeStopovers(readFile, 'Europe/Berlin')
for await (const stopover of stopovers) {
	if (stopover.stop_id !== 'airport') continue
	if (!isOnDay(stopover.arrival)) continue
	console.log(stopover)
}
{
	stop_id: 'airport',
	trip_id: 'a-downtown-all-day',
	service_id: 'all-day',
	route_id: 'A',
	start_of_trip: 1557871200,
	arrival: 1557926580,
	departure: 1557926640,
}
{
	stop_id: 'airport',
	trip_id: 'a-outbound-all-day',
	service_id: 'all-day',
	route_id: 'A',
	start_of_trip: 1557871200,
	arrival: 1557933900,
	departure: 1557933960,
}
// …
{
	stop_id: 'airport',
	trip_id: 'c-downtown-all-day',
	service_id: 'all-day',
	route_id: 'C',
	start_of_trip: 1557871200,
	arrival: 1557926820,
	departure: 1557926880,
}

For more examples, check the API documentation.

Performance

By default, gtfs-utils verifies that the input files are sorted correctly. You can disable this to improve performance slightly by running with the CHECK_GTFS_SORTING=false environment variable.

gtfs-utils should be fast enough for small to medium-sized GTFS datasets. It won't be as fast as other GTFS tools because it

On my M1 Macbook Air, with the 180mb 2022-02-03 HVV GTFS dataset (17k stops.txt rows, 91k trips.txt rows, 2m stop_times.txt rows, ~500m stopovers), computeStopovers computes 18k stopovers per second, and finishes in several hours.

Note: If you want a faster way to query and transform GTFS datasets, I suggest you to use gtfs-via-postgres to leverage PostgreSQL's query optimizer. Once you have imported the data, it is usually orders of magnitude faster.

Related

Contributing

If you have a question or have difficulties using gtfs-utils, please double-check your code and setup first. If you think you have found a bug or want to propose a feature, refer to the issues page.