That's one approach - I'm always a little wary of treating a rich format like JSON as <something> deliminated text - I'd be curious if using jq in streaming mode is much different in run-time. I believe this snippet, the core of which we lifted from stack overflow or somewhere does the same thing; split a valid JSON array into ndjson (with tweaks to hopefully generate similar splits:
It's too bad there aren't more streaming JSON parsers like oboe.js[1]. It would be nice if parsing libraries always supplied an event-based approach like this in addition to parsers that build up the entire data structure in memory.
I recently looked a little at streaming large JSON files in ruby - but ran into some problems trying to stream from and to gzipped files via layering ruby io-objects. In theory it should just be stacking streams, but in practice it was convoluted, a little brittle and quite slow.
Thanks! This is a near perfect use case for Ractors since we chunked all the files and there’s no need for the file processing function to share any context.
Duckdb is amazing, I've been using it in the last few weeks to analyze data I generate with datalog/souffle and I was completely blown away by the performance and QOL features. I seriously don't understand how it can be this fast...
Nice!
> However, for reasons unknown to me, they wrap these neatly separated rows with brackets ([ and ]) and add a comma to each line
Well, the reason (misguided or not) is as you say, I imagine:
> so it’s a valid, JSON array containing 100+ million items.
> We are not going to attempt to load a this massive array. Instead, we’re running this command:
That's one approach - I'm always a little wary of treating a rich format like JSON as <something> deliminated text - I'd be curious if using jq in streaming mode is much different in run-time. I believe this snippet, the core of which we lifted from stack overflow or somewhere does the same thing; split a valid JSON array into ndjson (with tweaks to hopefully generate similar splits: Note on MacOS zcat might not be gunzip, hence the change.It's too bad there aren't more streaming JSON parsers like oboe.js[1]. It would be nice if parsing libraries always supplied an event-based approach like this in addition to parsers that build up the entire data structure in memory.
[1]: https://github.com/jimhigson/oboe.js
EDIT: looking around a bit, I found json-stream ( https://github.com/dgraham/json-stream ) for Ruby.
I recently looked a little at streaming large JSON files in ruby - but ran into some problems trying to stream from and to gzipped files via layering ruby io-objects. In theory it should just be stacking streams, but in practice it was convoluted, a little brittle and quite slow.
There's even more alternatives at the bottom https://github.com/dgraham/json-stream?tab=readme-ov-file#al...
Interesting article, I think this is the first time I've seen someone pick Ractors over Parallel gem, cool!
I love seeing these quick and dirty Ruby scripts used for data processing / filtering or whatever, this is what it is good at!
Thanks! This is a near perfect use case for Ractors since we chunked all the files and there’s no need for the file processing function to share any context.
Hey cool article, thanks! Might be time to finally dive in to DuckDB
Duckdb is amazing, I've been using it in the last few weeks to analyze data I generate with datalog/souffle and I was completely blown away by the performance and QOL features. I seriously don't understand how it can be this fast...