Site Reliability Engineer-Cloud Infrastructure (Big Data) - Dublin

  • Dublin
  • Tiktok
TikTok is the leading destination for short-form mobile video. At TikTok, our mission is to inspire creativity and bring joy. TikTok's global headquarters are in Los Angeles and Singapore, and its offices include New York, London, Dublin, Paris, Berlin, Dubai, Jakarta, Seoul, and Tokyo. Why Join UsCreation is the core of TikTok's purpose. Our platform is built to help imaginations thrive. This is doubly true of the teams that make TikTok possible. Together, we inspire creativity and bring joy - a mission we all believe in and aim towards achieving every day. To us, every challenge, no matter how difficult, is an opportunity; to learn, to innovate, and to grow as one team. Status quo? Never. Courage? Always. At TikTok, we create together and grow together. That's how we drive impact - for ourselves, our company, and the communities we serve. Join us. SRE team is responsible for managing the whole video infrastructure and applications. Our mission is to ensure all production systems can support our fast growing world-wide user base as well as keep the entire systems, efficient and cost effective. We manage deployments, system capacity, traffic scheduling, fault tolerance, disaster recovery, emergency response, automations, operation platforms development, etc. Our team is full of diversity. We have team members in Singapore, USA and Australia. Now we are extending our teams to Ireland. We are looking forward to seeing new talents joining our team and together helping TikTok grow. Responsibilities: Be responsible for the basic engineering construction of byte infrastructure products & components, focusing on infrastructure O&M architecture optimization, automated O&M platform research and development, data and intelligent O&M. Through the methodology of software engineering and digital intelligence, O&M, around the O&M requirements of infrastructure products & components, built a layered and systematic O&M platform to solve the problem of ultra-large-scale cluster O&M management. (Goals) To provide efficient, and low-cost serverless infrastructure facilities for Mid-Platform & aim to be the leading SRE team across the industry。 - Reliability: Ensure the stability and reliability of our core infrastructure, with a keen emphasis on system high availability, performance, and capacity management, particularly in the context of Big Data technologies. Establish and enforce O&M standards and SOP processes to maintain system reliability.- Troubleshooting and Optimization: Proactively troubleshoot technical issues, collaborate with cross-functional teams to develop and implement capacity planning, performance testing, anomaly analysis, and fault diagnosis strategies specific to Big Data systems. Optimize the performance of Hadoop, HDFS, YARN, Apache Flink, and other relevant Big Data components.- Efficiency: Research and evaluate large-scale system architectures and cutting-edge technologies, focusing on Big Data solutions, to enhance existing systems and processes. Utilize tools and methodologies to streamline data processing, storage, and analysis workflows in Big Data environments.- Automated O&M: Design and deploy advanced O&M platforms tailored to Big Data infrastructure, incorporating intelligent algorithms and automation techniques to streamline system maintenance tasks. Develop automated monitoring and alerting systems for detecting and addressing issues in real-time.- Cost Optimization: Develop and enforce delivery standards for mass production system scales, with a specific focus on optimizing costs associated with Big Data infrastructure. Implement efficient resource allocation and utilization strategies to minimize expenses while maximizing performance.- Compliance: Design and implement compliant IDC environments, including robust data protection plans specific to Big Data systems, to meet industry standards and regulatory requirements. Ensure data governance and security measures are implemented effectively across all Big Data platforms. Minimum Qualification:- Bachelor's / Master's Degree in Computer Science or related major, with at least 3 years of relevant experience.- Solid basic knowledge of computer software, understanding of Linux operating system, storage, network IO and other related principles.- Familiar with one or more programming languages, such as Python, Go, and Java. Knowledge of design patterns and coding principles is necessary. Preferred Qualifications:- Experience with storage systems and relevant technologies such as KV stores, relational databases (., MySQL), NoSQL databases (., MongoDB), and messaging queues (., Kafka).- Expertise in computing and big data technologies, including Hadoop, HDFS (Hadoop Distributed File System), YARN (Yet Another Resource Negotiator), Apache Flink for stream processing, Elasticsearch (ES) for indexing and search, and metrics collection and monitoring tools.