Awesome
gtfs-utils
Utilities to process GTFS data sets.
- ✅ supports
frequencies.txt
- ✅ works in the browser
- ✅ fully asynchronous/streaming
Design goals
streaming/iterative on sorted data
As public transportation systems will hopefully become more integrated over time, GTFS datasets will often be multiple GBs large. GTFS processing should work in memory-constrained Raspberry Pis or FaaS environments as well.
Whenever possible, all gtfs-utils
tools will only read as little data into memory as possible. For this, the individual files in a GTFS dataset need to be sorted in a way that allows iterative processing.
Read more in the performance section.
data-source-agnostic
gtfs-utils
does not make assumptions about where you read the GTFS data from. Although it has a built-in tool to read CSV from files on disk, anything is possible: .zip
archives, HTTP requests, in-memory buffers, dat/IPFS, etc.
There are too many half-done, slightly opinionated GTFS processing tools out there, so gtfs-utils
tries to be as universal as possible.
correct
Aside from new features of the ever-expanded GTFS spec that change the expected behavior of old ones (and bugs of course), gtfs-utils
tries to follow the spec closely.
For example, it will, when computing the absolute timestamp/instant of an arrival at a stop, always take into account stop_timezone
or the user-supplied timezone, because stop_times.txt
uses "wall clock time".
Installing
npm install gtfs-utils
Usage
sorted GTFS files
gtfs-utils
assumes that the files in your GTFS dataset are sorted in a particular way; This allows it to compute some data aggregations more memory-efficiently, which means that you can use it to process very large datasets. For example, if trips.txt
and stop_times.txt
are both sorted by trip_id
, computeStopovers()
can read each file incrementally, only rows for one trip_id
at a time.
Miller and sponge
work very well for this:
mlr --csv sort -f agency_id agency.txt | sponge agency.txt
mlr --csv sort -f parent_station -nr location_type stops.txt | sponge stops.txt
mlr --csv sort -f route_id routes.txt | sponge routes.txt
mlr --csv sort -f trip_id trips.txt | sponge trips.txt
mlr --csv sort -f trip_id -n stop_sequence stop_times.txt | sponge stop_times.txt
mlr --csv sort -f service_id calendar.txt | sponge calendar.txt
mlr --csv sort -f service_id,date calendar_dates.txt | sponge calendar_dates.txt
mlr --csv sort -f trip_id,start_time frequencies.txt | sponge frequencies.txt
There's also a sort.sh
script included in the npm package, which executes the commands above.
Note: For read-only sources (like HTTP requests), sorting the files is not an option. You can solve this by spawning mlr
and piping data through it.
Note: With a bit of extra code, you can also use gtfs-utils
with a .zip
archive or with a remote feed.
basic example
Given our sample GTFS dataset, we'll answer the following question: On a specific day, which vehicles of which lines stop at a specific station?
We define a function readFile
that reads our GTFS data into a readable stream/async iterable. In this case we'll read CSV files from disk using the built-in readCsv
helper:
const readCsv = require('gtfs-utils/read-csv')
const readFile = (file) => {
return readCsv(require.resolve('sample-gtfs-feed/gtfs/' + file + '.txt'))
}
computerStopovers()
will read calendar.txt
, calendar_dates.txt
, trips.txt
, stop_times.txt
& frequencies.txt
and return all stopovers of all trips across the full time frame of the dataset.
It returns an async generator function (which thus is async-iterable), so we can use for await
.
In the following example, we're going to print all stopovers at airport
on the 5th of May 2019:
const {DateTime} = require('luxon')
const computeStopovers = require('gtfs-utils/compute-stopovers')
const day = '2019-05-15'
const isOnDay = (t) => {
const iso = DateTime.fromMillis(t * 1000, {zone: 'Europe/Berlin'}).toISO()
return String(t).slice(0, day.length) === day
}
const stopovers = await computeStopovers(readFile, 'Europe/Berlin')
for await (const stopover of stopovers) {
if (stopover.stop_id !== 'airport') continue
if (!isOnDay(stopover.arrival)) continue
console.log(stopover)
}
{
stop_id: 'airport',
trip_id: 'a-downtown-all-day',
service_id: 'all-day',
route_id: 'A',
start_of_trip: 1557871200,
arrival: 1557926580,
departure: 1557926640,
}
{
stop_id: 'airport',
trip_id: 'a-outbound-all-day',
service_id: 'all-day',
route_id: 'A',
start_of_trip: 1557871200,
arrival: 1557933900,
departure: 1557933960,
}
// …
{
stop_id: 'airport',
trip_id: 'c-downtown-all-day',
service_id: 'all-day',
route_id: 'C',
start_of_trip: 1557871200,
arrival: 1557926820,
departure: 1557926880,
}
For more examples, check the API documentation.
Performance
By default, gtfs-utils
verifies that the input files are sorted correctly. You can disable this to improve performance slightly by running with the CHECK_GTFS_SORTING=false
environment variable.
gtfs-utils
should be fast enough for small to medium-sized GTFS datasets. It won't be as fast as other GTFS tools because it
- uses async iteration extensively for memory-efficiency and an easy-of-use, which currently has significant performance penalties in v8.
- is written in JavaScript, so it cannot optimise the memory layout of its data structures.
- parses all columns of a file it needs information from, into a JavaScript object.
On my M1 Macbook Air, with the 180mb 2022-02-03
HVV GTFS dataset (17k stops.txt
rows, 91k trips.txt
rows, 2m stop_times.txt
rows, ~500m stopovers), computeStopovers
computes 18k stopovers per second, and finishes in several hours.
Note: If you want a faster way to query and transform GTFS datasets, I suggest you to use gtfs-via-postgres
to leverage PostgreSQL's query optimizer. Once you have imported the data, it is usually orders of magnitude faster.
Related
- gtfstidy – Go command line tool for validating and tidying GTFS feeds.
- gtfs-stream – Streaming GTFS and GTFS-RT parser for node
- mapzen-gtfs – Python library for reading and writing GTFS feeds. (Python)
- gtfspy – Public transport network analysis using Python
- extract-gtfs-shapes – Command-line tool to extract shapes from a GTFS dataset.
- extract-gtfs-pathways – Command-line tool to extract pathways from a GTFS dataset.
- Awesome GTFS: Frameworks and Libraries – A collection of libraries for working with GTFS.
Contributing
If you have a question or have difficulties using gtfs-utils
, please double-check your code and setup first. If you think you have found a bug or want to propose a feature, refer to the issues page.