Etched is building AI chips that are hard-coded for individual model architectures to provide an order of magnitude more throughput and lower latency than existing solutions like B200, enabling new product possibilities such as real-time video generation and extremely deep & parallel chain-of-thought reasoning agents.
Requirements
- Proficiency in C++ or Rust.
- Understanding of performance-sensitive or complex distributed software systems like Linux internals, accelerator architectures (e.g. GPUs, TPUs), Compilers, or high-speed interconnects (e.g. NVLink, InfiniBand).
- Familiarity with PyTorch or JAX.
- Ported applications to non-standard accelerator hardware or hardware platforms.
- Developed low-latency, high-performance applications using both kernel-level and user-space networking stacks.
- Deep understanding of distributed systems concepts, algorithms, and challenges, including consensus protocols, consistency models, and communication patterns.
- Solid grasp of Transformer architectures, particularly Mixture-of-Experts (MoE).
- Built applications with extensive SIMD (Single Instruction, Multiple Data) optimizations for performance-critical paths.
Responsibilities
- Support porting state-of-the-art models to our architecture. Help build programming abstractions and testing capabilities to rapidly iterate on model porting.
- Build, enhance, and scale Sohu’s runtime, including multi-node inference, intra-node execution, state management, and robust error handling.
- Optimize routing and communication layers using Sohu’s collectives.
- Utilize performance profiling and debugging tools to identify bottlenecks and correctness issues.
Other
- We are a fully in-person team in San Jose and Taipei, and greatly value engineering skills.
- We do not have boundaries between engineering and research, and we expect all of our technical staff to contribute to both as needed.
- Relocation support for those who are moving