Data Engineer
Software Technology Inc. - dallas, TX
Apply NowJob Description
We encourage candidates who are able to work on a W2 basis to apply for this position.Overview: their team supports data movements from several key processesworkflows: cost processing, S3 data, network traffic (VPC low flows), Cloudtrail data (API captured as a record).- Their pipelines collect this data, enrich it with human information, and load it into a unified data store in ClickHouse for reporting and visualization purposes.• Current project needs between net new development of pipelines and optimization and maintenance of existing ones.• Pipelines built in Scala, PySpark and moving some over into Lambdas (Python backed) where they can. Net new work might involve Lambda development.• Spark clusters all handle the different types of data, various structured, unstructured data sets• Data pipelines running on EMR infrastructure, should have understanding of EMR from perspective of data distribution, scalability, performance• Majority of pipelines are real-time streaming, costing is predominantly batch.• Candidates should have strong experience in not only building but suggesting performance enhancements for the pipelines.• All code is integrated into their CICD pipeline, orchestrated by Jenkins• Monitoring through Cloudwatch, some Ganglia (NTH)Must Have:• Scala• PySpark• Data pipeline engineering and optimization• AWS (specifically Lambdas and EMR)• SQLNice to Have:• ClickHouse database experience• Ganglia
Created: 2024-10-21