Ensure the seamless deployment, monitoring, and optimization of AI models in production for Sev1Tech LLC.
Requirements
- Proven experience deploying models in production using MLflow, Kubeflow, or cloud platforms (AWS SageMaker, Azure ML).
- Hands-on experience with observability tools like Prometheus, Grafana, or Datadog for real-time monitoring.
- Proficiency in Python and SQL; familiarity with JavaScript or Go is a plus.
- Expertise in containerization (Docker, Kubernetes) and CI/CD tools (GitHub Actions, Jenkins).
- Knowledge of time-series databases (e.g., InfluxDB, TimescaleDB) and logging frameworks (e.g., ELK Stack, OpenTelemetry).
- Experience with drift detection tools (e.g., Evidently AI, Alibi Detect) and visualization libraries (e.g., Plotly, Seaborn).
- Understanding of model performance metrics (e.g., precision, recall, AUC) and drift detection methods (e.g., KS test, PSI).
Responsibilities
- Deploy and manage machine learning models in production using tools like MLflow, Kubeflow, or AWS SageMaker, ensuring scalability and low latency.
- Build and maintain dashboards using Grafana, Prometheus, or Kibana to track real-time model health (e.g., accuracy, latency) and historical trends.
- Implement drift detection pipelines using tools like Evidently AI or Alibi Detect to identify shifts in data distributions and trigger alerts or retraining.
- Set up centralized logging with ELK Stack or OpenTelemetry to capture AI inference events, errors, and audit trails for debugging and compliance.
- Develop CI/CD pipelines with GitHub Actions or Jenkins to automate model updates, testing, and deployment.
- Apply secure-by-design principles to protect data pipelines and models, using encryption, access controls, and compliance with regulations like GDPR or NIST AI RMF.
- Optimize models for production (e.g., via quantization or pruning) and ensure efficient resource usage on cloud platforms like AWS, Azure, or Google Cloud.
Other
- 5+ years in MLOps, DevOps, or software engineering with a focus on AI/ML systems.
- Strong problem-solving and debugging skills for resolving pipeline and monitoring issues.
- Excellent collaboration and communication skills to work with cross-functional teams.
- Attention to detail for ensuring accurate and secure dashboard reporting.
- Experience with LLM monitoring tools like LangSmith or Helicone for generative AI applications.