Hey HN, Arroyo is a modern, open-source stream processing engine, that lets anyone write complex queries on event streams just by writing SQL—windowing, aggregating, and joining events with sub-second latency. Today data processing typically happens in batch data warehouses like BigQuery and Snowflake despite the fact that most of the data is coming in as streams. Data teams have to build complex orchestration systems to handle late-arriving data and job failures while trying to minimize latency. Stream processing offers an alternative approach, where the query is compiled into a streaming program that constantly updates as new data comes in, providing low-latency results as soon as the data is available. I started the Arroyo project after spending the past five years building real-time platforms at Lyft and Splunk. I saw first hand how hard it is for users to build correct, reliable pipelines on top of existing systems like Flink and Spark Streaming, and how hard those pipelines are to operate for infra teams. I saw the need for a new system that would be easy enough for any data team to adopt, built on modern foundations and with the lessons of the past decade of research and industry development. Arroyo works by taking SQL queries and compiling them into an optimized streaming dataflow program, a distributed DAG of computation with nodes that read from sources (like Kafka), perform stateful computations, and eventually write results to sinks. That state is consistently snapshotted using a variation of the Chandy-Lamport checkpointing algorithm for fault-tolerance and to enable fast rescaling and updates of the pipelines. The entire system is easy to self-host on Kubernetes and Nomad. See it in action here: https://www.youtube.com/watch?v=X1Nv0gQy9TA or follow the getting started guide ( https://doc.arroyo.dev/getting-started ) to run it locally.
Story Published at: June 6, 2023 at 04:31PM
Story Published at: June 6, 2023 at 04:31PM