Site Reliability Engineer - Government Digital Service - G7
Government Digital & Data -
Location
Bristol, London, Manchester
About the job
Job summary
The Government Digital Service (GDS) is the digital centre of government. We are responsible for setting, leading and delivering the vision for a modern digital government.
Our priorities are to drive a modern digital government, by:
- joining up public sector services
- harnessing the power of AI for the public good
- strengthening and extending our digital and data public infrastructure
- elevating leadership and investing in talent
- funding for outcomes and procuring for growth and innovation
- committing to transparency and driving accountability
We are home to the Incubator for Artificial Intelligence (I.AI), the world-leading GOV.UK and at the forefront of coordinating the UK’s geospatial strategy and activity. We lead the Government Digital and Data function and champion the work of digital teams across government.
We’re part of the Department for Science, Innovation and Technology (DSIT) and employ more than 1,000 people all over the UK, with hubs in Manchester, London and Bristol.
The Government Digital Service is where talent translates into impact. From your first day, you’ll be working with some of the world’s most highly-skilled digital professionals, all contributing their knowledge to make change on a national scale.
Join us for rewarding work that makes a difference across the UK. You'll solve some of the nation’s highest-priority digital challenges, helping millions of people access services they need
Sometimes described as the most strategic programme in government, GOV.UK One Login represents a once in a career opportunity to work on a software product that will be used by the majority of the people living in the UK. It’s a fast paced, dynamic and challenging environment that is sure to offer you career satisfaction as well as a chance to develop and enhance your skills.
GOV.UK One Login is being designed and built for the many, not the few. It will unite services across government, revolutionising the way government departments interact digitally with users. One Login will deliver an accessible and essential function that will change lives and help millions.
If this sounds like the next role for you on your career journey then we’d love to hear from you.
Find out more at the GDS Blog.
Job description
Site Reliability Engineers in One Login develop infrastructure and support application teams. This involves working with a diverse range of other technologists and non-technical stakeholders so communication and empathy are as important as technical capability.
As a Site Reliability Engineer you'll
- be part of one of our multidisciplinary service teams working with and supporting front-end and back-end developers, delivery and product managers, tech writers and architects
- build and maintain resilient, highly available and secure systems to meet the needs of our users
- take responsibility for solving complex and interesting problems
- create infrastructure as code to ensure our infrastructure and deployment pipelines are reusable, repeatable and reliable
- ensure our systems are appropriately monitored and instrumented to enable our teams to identify and respond to operational issues quickly and effectively
- build CI/CD pipelines to enable our developers to get their code into production as quickly and safely as possible
- act as a digital ambassador, sharing experiences through public speaking and blog posts
- participate in our in-hours 2nd line and out-of-hours support rotas to gain empathy for users and awareness of operational concerns
- share knowledge of tools and practices with your wider team and peers to drive consistency and maintain our high engineering standards
Person specification
We're interested in people who have:
- a high level of proficiency in at least one programming language (we use Java, Typescript, and Python), modern development standards, awareness of development process optimisation and strong Git skills
- a working knowledge of Agile best practices, the benefits of focussing on user needs and have used quantitative and qualitative data about users to turn user focus into outcomes
- a deep understanding of Linux operating system internals and are comfortable working with Linux virtual machines or containers, including the ability to identify, locate and fix service faults, capacity management and availability strategies
- strong experience of supporting large production services with infrastructure technologies including databases, web servers, DNS, CDNs, reverse proxies, message queues and load balancers. Systems design experience, working with well understood technology and identifying appropriate patterns. Have established design patterns and iterated them
- extensive experience of building and maintaining services in the cloud (preferably AWS), creating infrastructure as code using Terraform and CloudFormation, and using container orchestration systems like Kubernetes or ECS or serverless application design with AWS Lambda.
- knowledge of setting up pipelines in a CI/CD tool like Github Actions or AWS Codepipeline, and will have built and tested simple interfaces between systems and worked on more complex integrations as part of a wider team
- a good understanding of information security principles and how to keep large operational services secure.