drew jess

Monitoring mTLS Protected APIs Without a Client Certificate

drew — Mon, 17 Jun 2024 16:30:41 GMT

We run a bunch of APIs secured by mTLS at my current place. Third parties establish connections to our API webservers by presenting a client certificate. We check that we consider it valid and if so, permit access to the relevant resources.

As our part of the mTLS handshake, we present our server certificate when clients connect. We've got a vested interest in knowing how long is left on that certificate's validity, so that we can plan to renew it in plenty of time.

We're big users of DataDog, and for most TLS monitoring, we've found the tls check that ships with the DataDog Agent to be superb. It connects to an endpoint, calculates tls.days_left and we can use that to configure up a monitor that alerts us in plenty of time. It even supports mTLS, with options provided for tls_cert and tls_private_key to allow our agent to present a client cert. Great, right?

Couple of problems here though:

Issuing yourself a client cert to perform "monitoring" is one more set of credentials that might have access to your app - effectively a "back door". Sure, you could write some logic into your app or webserver config to treat certain certs differently. But it's a bit clunky and easy to mess up.
Monitoring expiry of a server cert with a client cert? Now you have two things to monitor. How long is your monitoring cert valid for? Don't say forever! You need to also make sure you're monitoring and renewing that client cert appropriately alongside your server cert. Failure to do so means your monitoring solution is kaput. Twice the work.

Using the DataDog tls check without tls_cert and tls_private_key to monitor an mTLS API isn't an option, either. Try it and you'll see the following:

$ datadog-agent check tls
...
{
  "check": "tls.cert_validation",
  ...
  "status": 2,
  "message": "[SSL: SSLV3_ALERT_BAD_CERTIFICATE] sslv3 alert bad certificate (_ssl.c:1129)",
  ...
}

And it makes sense on the face of it. DataDog's not keen on reporting the success of the TLS connection until a full handshake's been made. But for mTLS APIs, we need that client cert to complete the handshake. So the tls check is no good here.

Where do we go from here? Well, the openssl CLI tool seems to be able to fetch a server certificate just fine when you point it at the same API:

$ openssl s_client -showcerts -servername foo.com \
  -connect foo.com:443 2>/dev/null

Fetching a server cert from an mTLS enabled server without a client cert (source)

Why can openssl do this where DataDog can't?

I think it's due to the ordering of the handshake operation. Namely these steps:

Client connects to server
Server presents its TLS certificate (ding, ding, ding!)
Client verifies the server's certificate
Client presents its TLS certificate (here's where we fall down...)
...etc...
Handshake established!

All the info we need is in step 2. We don't need to get to step 6. The openssl tool clearly knows that and doesn't press the issue. So how do we get Datadog to do the same?

Well, we can write a DataDog custom check to submit our own metric(s). This framework makes it super-easy to write some Python to run, calculate some metric and submit the value up to DataDog for easy use. What code to write, though?

Luckily, the embedded DataDog Agent's Python comes with pyOpenSSL available to use. This is a Python library that's a wrapper around openssl, and operates differently enough to the standard Python ssl library (which DataDog and its tls check uses) to be useful to us when writing a custom check.

Namely, we can attempt to make a connection to an mTLS enabled server, catch the sslv3 alert bad certificate exception that will undoubtedly result, then use the information we've gathered already without worrying about the overall status of the handshake - we weren't going to meaningfully send any data to the server, anyway. We've already been sent the server cert and only care about its validity - so let's use it.

We have everything we need to decide whether we trust the presented cert already, too - our local CA truststore. We don't need a full handshake for trust purposes.

A straight-up Python script to do this might look something like:

#!/usr/bin/python3
import OpenSSL
import socket
import datetime

host = "foo.com"
port = 443

sock = sock.create_connection((host, port))
context = OpenSSL.SSL.Context(OpenSSL.SSL.TLS_METHOD)
connection = OpenSSL.SSL.Connection(context, sock)
connection.set_connect_state()

# think of the below as the `-servername` arg to `openssl`
# if you need to adjust it, feel free to!
connection.set_tlsext_host_name(host.encode('utf-8'))

# catch the expected 'sslv3 alert bad certificate' error on mTLS APIs
# and continue anyways, knowing we've been sent the server cert already anyway
try:
  connection.do_handshake()
  cert = connection.get_peer_certificate()
except:
  cert = connection.get_peer_certificate()
  
# calculate expiry and output to screen
today = datetime.datetime.now()
exp = datetime.datetime.strptime(
    cert.get_notAfter().decode('ascii'), 
    '%Y%m%d%H%M%SZ'
  )

left_secs = int(exp - today).total_seconds())
left_days = left_secs / 60 / 60 / 24

print(cert.get_subject().commonName)
print(left_secs)
print(left_days)

No handshake? No problem! Let's proceed regardless (source)

Translate this into a DataDog check using their docs, and ship it up as a metric with self.gauge() and you've got your validity metrics for an mTLS API's server cert without having to use a client cert. Use metrics these in your cert monitors.

It'll work with non-mTLS APIs too (where the handshake completes), but in those cases I'd probably recommend just using the tls check - it's likely more bulletproof.

Hope this helps make someone else's life a little easier.



Launching Encrypted EBS Volumes via CloudFormation
drew — Mon, 07 Dec 2020 15:43:57 GMT
Today, I was launching a CloudFormation template which contained an EBS volume which I wanted to be encrypted with an AWS Key Management Service (KMS) Customer Managed Key (CMK).
The resource in the template looked like this:
"EncryptedVolume": {
  "Type": "AWS::EC2::Volume",
  "DeletionPolicy": "Snapshot",
  "Properties": {
    "AvailabilityZone": "eu-west-1a",
    "Encrypted": true,
    "KmsKeyId": "arn:aws:kms:eu-west-1:123456780123:key/blah-blah",
    "Size": 64,
    "VolumeType": "gp2"
  }
}

I created the stack containing this resource and waited for the resource to create.
It seemed to take a while. And when CloudFormation resource creations take a long time, you can bet your bottom dollar that it's not a good thing. Half-an-hour later, my stack entered the ROLLBACK_COMPLETE state. My only hints:

EncryptedVolume: Volume vol-12345678abcd1234 is still creating
CREATE_FAILED: Resource creation cancelled
The following resource(s) failed to create: EncryptedVolume. Rollback requested by user.

What happened?  It wasn't clear.  Checking CloudTrail showed that the ec2:CreateVolume call definitely fired from CloudFormation.  I even had a volume ID returned.  But checking the EC2 -> Volumes console for that ID showed up nothing. It's as if it didn't exist.
I've experienced in the past that AWS services sometimes act really weird if they're asked to use a KMS key that they don't have permission to.  For example, if you ask an AutoScaling Group to launch volumes with boot volumes encrypted by an AWS KMS CMK, you're gonna have a bad time unless you configure your key policy properly. But this at least is relatively well documented through troubleshooting documentation that you can find when searching for the instance launch failure.
This time around, I didn't even have an error to work with. But while searching for something completely unrelated (volume deletion policies, actually), I happily and quite accidentally found some relevant looking information in the AWS Cloud Development Kit (CDK) EC2 Volume docs.
Turns out that for EC2 to create an EBS Volume encrypted with a CMK, the principal creating the volume has to have permission to call the following policy actions (note the conditions required to lock these permissions down using the principle of least privilege):
{
...
  "Action": [
    "kms:DescribeKey",
    "kms:GenerateDataKeyWithoutPlainText",
  ],
  "Condition": {
    "StringEquals": {
      "kms:ViaService": "ec2..amazonaws.com",
      "kms:CallerAccount": ""
    }
  }
...
}

I definitely had the IAM permissions in place to do this, as I was running as a user with the AdministratorAccess policy attached.  However, I remembered that in order to allow IAM to work on a KMS CMK, the CMK's key policy had to be configured to grant those actions to the root principal of the account that the key was created in.
However, I'd created my KMS CMK via the AWS Cloud Development Kit (CDK) which should contain a sensible, out-of-the-box key policy to allow the key to be used by resources via IAM.
I checked the key policy of my KMS CMK that I had created.  It turns out CDK had created the following key policy:
{
    "Effect": "Allow",
    "Principal": {
        "AWS": "arn:aws:iam::123456789012:root"
    },
    "Action": [
        "kms:Create*",
        "kms:Describe*",
        "kms:Enable*",
        "kms:List*",
        "kms:Put*",
        "kms:Update*",
        "kms:Revoke*",
        "kms:Disable*",
        "kms:Get*",
        "kms:Delete*",
        "kms:ScheduleKeyDeletion",
        "kms:CancelKeyDeletion",
        "kms:GenerateDataKey",
        "kms:TagResource",
        "kms:UntagResource"
    ],
    "Resource": "*"
}

Sure enough, we were missing the GenerateDataKeyWithoutPlainText policy action in the key policy.  I followed the documentation I had found to add the relevant policy action (with the correct conditions, as written earlier in this post). After doing that, recreating the stack yielded me with a freshly-minted, encrypted volume. Problem solved.
I'm hoping that in the future, AWS can implement faster failing (without being stuck in CREATE_IN_PROGRESS for half-an-hour) and better error reporting (even a searchable string!) for this type of error.  


AWS AutoScalingPlans, RollingUpdates and CloudFormation
drew — Mon, 18 May 2020 00:00:00 GMT
We’ve been exploring the world of AWS AutoScaling Plans recently. Being captivated by the tasty-looking Predictive Scaling for EC2, we were wooed into replacing all of our AWS::AutoScaling::ScalingPolicy resources for equivalent, shiny new AWS::AutoScalingPlans::ScalingPlan resources.
“There’s scaling,” says Jeff Barr, “all under one roof!”
These seemed great at first, but we ended up hitting a lot of gotchas with how AutoScalingPlans interact with AutoScalingGroups, particularly where CloudFormation was concerned.
TL;DR
This article will interest you if you’re considering moving to ScalingPlans and are a heavy user of AWS AutoScaling RollingUpdates. And hopefully save you some time and heartache.
You may also have ended up here because you’ve seen any of the below error messages as part of a CloudFormation update:
UPDATE_ROLLBACK_IN_PROGRESS
The following resource(s) failed to update: [AutoScalingPlan].

UPDATE_FAILED
Scaling  plan has been updated but failed to be applied to all resources.  Problems were encountered for 1 resource. See scaling plan resources for  the failure details.
And subsequently maybe also the following in the AWS AutoScaling console:
ActiveWithProblems
The resource was not updated because it was found to have different min/max capacity than what the scaling instruction indicates
Read on to find out more. Spoiler alert: there’s no happy ending.
Getting Started
On the face of it, moving to ScalingPlans seemed simple. We aren’t super-complicated people; we’d implemented TargetTracking policies on some of our services to track and add/remove instances in accordance with our AutoScalingGroup’s ASGTotalCPUUtilization metric. This old policy mapped perfectly to an equivalent ScalingInstruction TargetTrackingConfiguration property in the new resource. So we had feature-parity immediately.
Setting the PredictiveScalingMode to ForecastOnly allowed AWS Auto Scaling start to generate pretty looking graphs using Machine Learning™!
You too might be able to be as efficient as AWS suggest you can be!
So, we set the PredictiveScalingMode to ForecastAndScale, basking in wonder at the dozens of AutoScaling ScheduledActions that were implemented on our AutoScalingGroup. Wow! Every hour, the size of the AutoScalingGroup would be enforced! You could even use the ScheduledActionBufferTime property to scale a set amount of time before the hour, in case you usually get really busy when the clock strikes 12, just like we do! Amazing!
For one reason or another, we went back to ForecastOnly. Our services are frequently victims to short, sudden, bursts of traffic. So the reality was that the capacity levels that AWS suggested was fine for regular traffic, but not enough if we saw an unexpected, unpredictable spike of traffic.
We just ended up setting our MinCapacity ever-higher until we were overprovisioned enough to deal with these spikes. This led to some ugly PredictiveScaling graphs — the blue line in the image above was continually flatlined above the orange line. A constant reminder of “wasted” compute.
So yeah, we turned predictive scaling off, thinking that we’d maybe enable it again sometime. We kept the new resources, because they were newer (hence better) and we still had the TargetTracking policies set up, so had feature parity, right? This was true, but we encountered some unexpected gotchas...
Changing AutoScalingGroup Capacity
We frequently want to change the size of our AutoScalingGroups. In production, we might scale up for a once-off event where we expect greater traffic. In our performance testing environments, we want to scale down at the end of the day to save money.
We also try to be good systems engineers. That means keeping our AWS infrastructure definitions within CloudFormation templates, for all the good reasons that entails (maintainability, reviewability, recreatability, etc).
So when implementing the new ScalingPlans, we did so by removing our old AWS::AutoScaling::ScalingPolicy CloudFormation resources and replacing them with AWS::AutoScalingPlans::ScalingPlan resources. The latter requires that you provide a MinCapacity and a MaxCapacity for the ScalingPlan to enforce upon the AutoScalingGroup:
"AutoScalingPlan": {
      "Type" : "AWS::AutoScalingPlans::ScalingPlan",
      "Properties" : {
        <...snip...>
        "ScalingInstructions" : [ {
          "DisableDynamicScaling" : false,
          "MaxCapacity" : ,
          "MinCapacity" : ,
        <...snip...>
}
This looked familiar, as we also set MinSize and MaxSize on our AutoScalingGroup resource using a couple of CloudFormation Parameters:
"AutoScalingGroup": {
      "Type": "AWS::AutoScaling::AutoScalingGroup",
      "Properties": {
        <...snip...>
        "MinSize": { "Ref": "MinSize" },
        "MaxSize": { "Ref": "MaxSize" },
        <...snip...>
}
We could just reuse those, implementing as so:
"AutoScalingPlan": {
      "Type" : "AWS::AutoScalingPlans::ScalingPlan",
      "Properties" : {
        <...snip...>
        "ScalingInstructions" : [ {
          "DisableDynamicScaling" : false,
          "MaxCapacity" : { "Ref": "MinSize" },
          "MinCapacity" : { "Ref": "MaxSize" },
        <...snip...>
}
Now, to change the number of instances running, it should be as simple as adjusting the value of MinSize and MaxSize CloudFormation Parameters. Let’s do it!
Wait, what? At least… it’s… not.. red?
I think CloudFormation just told me what it really thinks of me.
The above images show result of adjusting either of the shared MinSize or MaxSize CloudFormation parameter values and then performing a CloudFormation update. What happens seems to be the following:
The AutoScalingPlan resource transitions into UPDATE_IN_PROGRESS.
In the AWS AutoScaling console, the ScalingPlan reports that it is: 
ActiveWithProblems — The resource was not updated because it was found to have different min/max capacity than what the scaling instruction indicates.
The AutoScalingPlan resource transitions into UPDATE_FAILED and the CloudFormation Stack begins to roll back.
When the stack enters ROLLBACK_COMPLETE, the MinSize/MaxSize parameter you changed is set back to its original value. The size of the AutoScalingGroup also hasn’t changed.
This is not great — let’s see what AWS has to say on the matter in the Other Considerations in the Best Practices for AWS Auto Scaling Plans documentation…
Your customized settings for minimum and maximum capacity, along with other settings used for dynamic scaling, show up in other consoles. However, we recommend that after you create a scaling plan, you do not modify these settings from other consoles because your scaling plan does not receive the updates from other consoles.
Emphasis mine. If we go back to our CloudFormation Events above we can see that the AutoScalingGroup resource was updated by CloudFormation before the AutoScalingPlan resource was. That’ll be a modification from an “other console”, then.
Sound of Separation
OK, fine. It’s a bit messy, but we can use seperate CloudFormation parameters:
AutoScalingGroupMinSize: MinSize of the AutoScalingGroup resource
AutoScalingGroupMaxSize: MaxSize of the AutoScalingGroup resource
AutoScalingPlanMinSize: MinCapacity of the AutoScalingPlan resource
AutoScalingPlanMaxSize: MinCapacity of the AutoScalingPlan resource
We can’t omit any; MinSize and MaxSize are Required according to the AWS::AutoScaling::AutoScalingGroup documentation. Same deal for MinCapacity and MaxCapacity of AWS::AutoScalingPlans::ScalingPlan ScalingInstructions.
We’ll just set AutoScalingGroupMinSize and AutoScalingGroupMaxSize to 0, then scale up our service by adjusting AutoScalingPlanMinSize and AutoScalingPlanMaxSize. That works just fine.
At face value, this has fixed our issue. We can now adjust AutoScalingPlanMinSize and AutoScalingPlanMaxSize freely. This updates the size of the AutoScalingGroup without adjusting it via CloudFormation.
Aww yiss.
“And they all lived happily ever after”, right? Here’s where I tell you the follow-up story of how we’re heavy users of AWS AutoScaling RollingUpdates. When we deploy a new version of a service, it looks like this:
We create a new Amazon Machine Image (AMI) which contains a new version of our application.
We update our AWS::AutoScaling::LaunchConfiguration resource with this new AMI ID as part of a CloudFormation stack update. This forces a recreation of the LaunchConfiguration resource which in turn cascades an update to the AutoScalingGroup resource.
Our AutoScalingGroup has an UpdatePolicy configured to perform RollingUpdates. Replacing the LaunchConfiguration triggers the replacement of each one of the AutoScalingGroup’s running instances, replacing the old instances with new instances. When the RollingUpdate is complete, so is our deployment.
RollingUpdates and ScalingPlans appear to co-exist, in that you won’t see a RollingUpdate fail. It’ll succeed just fine. Then, when next you come to adjust the size of your AutoScalingGroup via the AutoScalingPlanMinSize and AutoScalingPlanMaxSize parameters, CloudFormation will scream at you:
Why do you hate me, ScalingPlan?
UPDATE_FAILED
Scaling plan has been updated but failed to be applied to all resources. Problems were encountered for 1 resource. See scaling plan resources for the failure details.
It’ll rollback immediately. If you’re quick, you’ll see the same status messages in the AWS AutoScaling console from when we tried using the shared MinSize/MaxSize parameters:
ActiveWithProblems — The resource was not updated because it was found to have different min/max capacity than what the scaling instruction indicates.
What happened here? Well, during the first stack update where we performed the RollingUpdate, I observed the following:
Before doing anything, Note the size of your AutoScalingGroup. It should match AutoScalingPlanMinSize and AutoScalingPlanMaxSize.
Start a CloudFormation update that triggers a RollingUpdate (e.g. alter the AMI ID of the LaunchConfiguration). Watch it succeed.
Note the size of your AutoScalingGroup at this point. You’ll notice that it has changed to match AutoScalingGroupMinSize and AutoScalingGroupMaxSize.
This is unexpected. And confirms that a RollingUpdate will reapply whatever properties your CloudFormation template’s AutoScalingGroup has set.
The next time you come along to adjust the capacity of your AutoScalingGroup (be that in five minutes or in five months) by adjusting AutoScalingPlanMinSize or AutoScalingPlanMaxSize, you’ll be set up to fail from the get-go.
As far as the AutoScalingPlan is concerned, your AutoScalingGroup has changed from underneath it. From an “other console”, as the Best Practices for AWS Auto Scaling Plans liked to put it.
The AutoScalingPlan is never updated about this change, and will never automatically fix itself. In practice, the only way to fix this is to either:
Manually adjust the size of the AutoScalingGroup back to the same values as the last known adjustment that the AutoScalingPlan made.
or
Delete and recreate the ScalingPlan resource. 
(Note: when you do this, your AutoScalingGroup size will revert to AutoScalingGroupMinSize and AutoScalingGroupMaxSize. 
Sure hope you hadn’t configured those as 0…)
Speaking as a guy who’s deleted a few dozen ScalingPlans to fix these issues, let me tell you that deleting them is what you’ll be doing most often.
Why? Well, while you can determine what the current size of an AutoScalingGroup is, and while you can determine what the current capacities of an AutoScalingPlan are, you cannot determine what an AutoScalingPlan thinks that the AutoScalingGroup should be set to. That information is not exposed to us mere mortals.
Conclusions
It’s a tale as old as time — new AWS services, existing AWS services and CloudFormation don’t always mix or fit tightly enough together. In this case, there’s a lack of integration between AutoScalingGroups and AutoScalingPlans, and an unreasonable expectation that AutoScalingPlans should be the only resources permitted to change the size of an AutoScalingGroup.
The root cause of our problem is that CloudFormation always rolls back. In reality, all we’d need to fix our issue is one additional property named ForceCurrentCapacity on an AWS::AutoScalingPlans::ScalingPlan ScalingInstruction resource. When true, the AutoScalingPlan could say: “yeah, I’m gonna enforce whatever my MinCapacity and MaxCapacity are on the AutoScalingGroup, regardless or not whether they seem to have changed”.
But we don’t have that. And instead, due to continued rollbacks while trying to adjust the size of our AutoScalingGroup resources following service deployments, we reluctantly went back to AutoScaling::ScalingPolicy resources. Although they’re not as fancy, they work without any real fuss and scale our services when we need it.
And while we don’t have the option of turning PredictiveScaling back on or seeing the pretty graphs that the AWS Auto Scaling console provides, I’ll take working deployments and predictable CloudFormation updates any day.




Declarative Jenkinsfiles: Dynamic Parallel Stages
drew — Wed, 29 Apr 2020 00:00:00 GMT
Been working on replacing a crusty old Jenkins job recently that’s used for scaling up a load-testing environment. Currently, it’s a good ol’ handcranked freestyle job with a 300-line bash script pasted into the Execute Shell field. When it needs debugged or changed it’s fun for all the family.
I’ve wanted to rewrite this for a while. Right now, the script has a bunch of configuration at the top of the file which dictates how many instances to launch, what sizes of instances to use, etc. And it’s pretty repetitive:
# config for standard load tests
SERVICEONE_NORMALLOAD_INSTANCETYPE=c5.xlarge
SERVICEONE_NORMALLOAD_MINSIZE=3
SERVICEONE_NORMALLOAD_MAXSIZE=3
SERVICETWO_NORMALLOAD_INSTANCETYPE=c5.large
SERVICETWO_NORMALLOAD_MINSIZE=6
SERVICETWO_NORMALLOAD_MAXSIZE=6

# config for high load tests
SERVICEONE_HIGHLOAD_INSTANCETYPE=c5.xlarge
SERVICEONE_HIGHLOAD_MINSIZE=6
SERVICEONE_HIGHLOAD_MAXSIZE=6
SERVICETWO_HIGHLOAD_INSTANCETYPE=c5.large
SERVICETWO_HIGHLOAD_MINSIZE=9
SERVICETWO_HIGHLOAD_MAXSIZE=9

if [[ $SCALING_LEVEL = 'NORMAL' ]] ; then
    # perform scaling actions with the first set of config...
    ...

elif [[ $SCALING_LEVEL = 'HIGH' ]] ; then
    # perform scaling actions with the second set of config...
    ...

fi
We don’t have just two services, of course. Or just two load settings. So you can see how this script gets pretty gnarly, pretty quickly. Adding a new service? Sure hope you got all the variable names right. Something not working? Better step through every single command to see what’s not as you expected.
Additionally, this script works through every service sequentially. This increases the time to scale the environment drastically if you have a bunch of services. The total time taken is the product of the number of services you have, and how long it takes to scale a service. Worse if one takes longer than another. There are no actions performed in parallel.
How could I make this better? For config, I envisioned something like this:
def SERVICES = [
    "service-1": [
        "NORMAL": [ 
            "InstanceType": "c5.xlarge", 
            "MinSize": 3, 
            "MaxSize": 3
        ],
        "HIGH": [
            "InstanceType": "c5.xlarge", 
            "MinSize": 6, 
            "MaxSize": 6
        ]
    ],
    "service-2": [
        "NORMAL": [ 
            "InstanceType": "c5.large", 
            "MinSize": 6, 
            "MaxSize": 6
        ],
        "HIGH": [
            "InstanceType": "c5.large", 
            "MinSize": 9, 
            "MaxSize": 9
        ]
    ]
]
Should be achievable in a Jenkins Declarative Pipeline, right? It’s easy enough to define a list of nested maps. By doing this, I can set all of my configuration values at the top of my Jenkinsfile, once per service. And people can come along and easily see what’s set to what. Great.
It got me thinking though. Every service scales up in the same way. We update a Launch Configuration and an Autoscaling Group resource in a CloudFormation Stack. Do we have to do this sequentially? Can we kick off each service’s scaling job in parallel to save time?
I ended up doing just this, implementing the parallelism with Parallel Stages in my Jenkinsfile. However, instead of writing out each stage individually, I wrote a function that generated one stage per item in my configuration map. Then, I could use the parallel directive to kick off every stage at once, at the same time.
My Jenkinsfile looked something like this:
// set up a config map as per the above snippet
def SERVICES = [ ...see above... ] 

// generate a pipeline stage for each item in the config map
def parallelStagesMap = SERVICES.collectEntries {
    [ "${it}": generateStage(it) ]
}

// here's how we generate identical stages
def generateStage(service) {
    def service_name = service.key
    return {
        stage("Service: ${service_name}") {
            size = service.value[env.LEVEL]['InstanceType']
            max = service.value[env.LEVEL]['MaxSize']
            min = service.value[env.LEVEL]['MinSize']

            // insert your scaling commands in here
            sh( script: """
                    echo "Action: ${env.LEVEL}"    && \
                    echo "Service: ${service_name}" && \
                    echo "InstanceType: ${size}" && \
                    echo "MaxSize: ${max}" && \
                    echo "MinSize: ${min}"
                """
            )
        }
    }
}

pipeline {
    agent any

    parameters {
        choice(
            name: 'LEVEL',
            description: 'What level do you want to scale to?',
            choices: [
                'NORMAL',
                'HIGH'
            ]
        )
    }
    stages {
        stage('Scale services') {
            steps {
                script {
                    // kick off all the stages we generated
                    parallel parallelStagesMap
                }
            }
        }
    }
}
When we run the job, it works and looks great:
Each item in our configuration hash has its own stage, executed in parallel.
Or if you prefer the fancier BlueOcean pipeline view…
The logs for each stage are pretty useful too.
This method is especially useful if one particular service has a problem scaling up — the error will show clearly on the stage related to that service, not halfway down a shell script somewhere that you have to dig into the logs of.
The configuration hash at the top also serves as self-documentation for our scaling levels. How many levels do we scale to? Oh, there’s two levels defined in the hash. How many services do we scale? They’re all there, unobscured and easy to see. No hidden settings, or secret numbers.
When our scaling levels change due to a projected increase in traffic or otherwise, we’re now one Pull Request away from having the updated levels ready-to-go — peer reviewed, merged once. New service launched? Cool, raise a PR. No more need to hunt through multiple Jenkins jobs, wondering which one is live and which one isn’t. The next run of the job will pull in the latest Jenkinsfile, along with the latest config.
This also opens us up to being able to schedule the scaling up and scaling down of our load testing environment using the Parameterized Scheduler Plugin and the ubiquitous cron syntax. In the pipeline section of our Jenkinsfile, we can set something like this:
triggers {
    parameterizedCron('''
        # scale up in the morning, scale down at night (weekdays)
        H 6 * * 1-5 % LEVEL=NORMAL        
        H 18 * * 1-5 % LEVEL=ZERO
       ''')
    }
This takes the legwork out of manually making sure the environment is available each morning and ensures we’re saving as much as we can on instance costs during evenings and weekends. We also don’t have to mess around with AutoScalingGroup ScheduledActions, which is nice.
You can see some extremely abstract versions of Jenkinsfiles I’ve been working on recently at my GitHub, recorded because I realise it’s probably not the last time I’ll need to write something like this.
Happy Jenkinsfile-ing!




Macs, Control Characters, and iTerm2
drew — Tue, 31 Mar 2020 00:00:00 GMT
I always forget how to do this. And I always lose Brad Erickson’s excellent article that taught me how to fix it. So here it is, for posterity.
Whenever I switch to a new Mac, be that through replacement of machine or job, I tend to wrestle with my laptop for a while. Mostly, this is due to Mac systems using the Command (⌘) key for shortcuts (copy, paste, etc.) while my favoured terminal emulator, iTerm2, sensibly uses Ctrl (^) as the modifier key for sending control characters (^C, etc).
I’m also a stubborn critic of modern laptop keyboard layouts in general. I don’t want the bottom-left key on my laptop to be Fn. I want it to be Ctrl, like how it was when I grew up, dammit! So of course, I use an external keyboard which has everything laid out the way I like it.
So, in summary, I want my keyboard to do the following, for example:
Use Ctrl + c to copy in most applications.
Use Ctrl + c to send control characters in iTerm2.
This presents a small challenge, because these are different things. One uses by default, Command (⌘). The other uses Ctrl (^).
Fix one thing, break another
When I first connect my external keyboard to a new Mac, copying, pasting, etc. requires that I use Super (“Windows”) + c for copy, for example. But iTerm2 works perfectly as I expect it to. No good.
I know what you’re thinking. Just go to System Preferences → Keyboard → Modifier Keys… and make sure you select the correct keyboard from the drop-down box, then swap the keys around like this:
This should work, right?
This fixes my first requirement. I can now copy, paste, etc. with what my keyboard considers Ctrl. But it’s broken the second one! Now I need to use Super (“Windows”) + c to send a control character via iTerm2. Oh no!
Is there a setting for that?
How about the iTerm2 → Preferences → Keys → Remap Modifiers menu?
As soon as you start playing with this, iTerm2 asks for Accessibility Access (Events) privileges on your laptop.
Okay.
If you grant this, you’re presented with a menu that looks suspiciously familiar to the one that we messed around with earlier:
This looks promising, but ultimately proves futile.
For one, this is a bit of a mind-melter for me. What if I’ve adjusted the System Preferences menu from earlier? Does this override those? Perform another remap on top? Something else? My brain hurts just thinking about it.
For two, no matter what I adjust here, it doesn’t actually do anything to my external keyboard. I’m left to assume that this menu is hard-coded to adjust settings for the laptop’s built-in keyboard, as it seems to be lacking the Select keyboard drop-down box from the earlier menu.
Let’s not mess around with this, and let’s find another solution.
iTerm2 Key Bindings to the rescue!
Brad Erickson seems to be the only other person on the entire Internet that this behaviour frustrates, and he wrote a great article on how to fix it using iTerm2‘s Key Bindings configuration.
Here’s how I got it working to my liking:
Hit up iTerm2 → Preferences → Keys. The default tab in this window is Key Bindings. In the bottom left corner, press the small plus sign (+) to add a new binding. That’ll bring up a menu asking you for a shortcut and an action.
Here’s where I press Ctrl + c, which now shows up as Command + c (⌘c) after my previous adventures in System Preferences. As for an action, I want to Send Hex Code. Then, I can use Wikipedia’s handy ASCII Control Characters table to select the hex code that represents the control character that I want to send. I want ^C, which has a hex code of 03, formatted properly as 0x3 as far as iTerm2 is concerned. Turns out this control character is actually called the End-of-Text character. TIL.
Configuring a Key Binding to send the ^C control character in iTerm2
As soon as the OK button is pressed, my shell’s behaviour is as I expect.
How about other control characters?
You can use the same mechanism to add the control characters that you send most often. Over a year and half, I’ve found that I only commonly use these:
^C (0x3): Ctrl + c (End-of-Text, or “send interrupt”)
^D (0x4): Ctrl + d (End-of-Transmission, or “logout”)
^Z (0x1a): Ctrl + z (Substitute, or “background job”)
^] (0x1d): Ctrl + ] (Group Separator, or “exit telnet”)
I’ll be the first to admit that setting these up manually isn’t ideal, but given the infrequency with which I move machines, I’m happy to take the hit since it’s truly a “set and forget” solution.
Hopefully this will help someone out there. And hopefully now you can control your terminal properly and escape any stress that this situation was putting you under!




Timeouts Loading Data from S3 to Aurora
drew — Fri, 20 Apr 2018 00:00:00 GMT
Been working on a data migration recently which involves loading a pretty big tab-separated file into an Amazon Aurora (MySQL) database and hit an interesting issue in one of our sub-accounts.
We’d gone through the documented steps of creating an IAM Role, associating its ARN with the cluster using aws rds add-role-to-db-cluster and adding its ARN as the value of the aurora_load_from_s3_role DB cluster parameter.
We’d added the s3:GetObject and s3:ListBucket permissions to the bucket policy of the data source, using the IAM Role’s ARN as the principal (we decided to use a bucket policy for ease of clean-up after the migration had been completed).
But still, no matter what we tried, our LOAD DATA FROM S3 FILE query would always hang for approximately five minutes and then throw us an error:
ERROR 1815 (HY000): Internal error: Unable to initialize S3Stream
What was going on? It felt like a timeout, due to the severe hang before the command bombed out. Checking our tables afterwards verified that no data had been written by the query, either.
An extra-strange thing is that the query worked just fine in our main AWS account (using a different S3 bucket and Role ARN, but with otherwise functionally identical configuration).
We only experienced the issue when trying to run the query from one of our sub-accounts, which we use for performance testing.
Double and triple checking our S3 and IAM setups didn’t flag up anything obvious. So we delved into the documentation, and the most interesting thing appeared to be the fifth point:
Configure your Aurora MySQL DB cluster to allow outbound connections to Amazon S3.
Checking the page page linked to in this section implied that our cluster was likely to be misconfigured if we encountered an error like so:
ERROR 1873 (HY000): Lambda API returned error: Network Connection. Unable to connect to endpoint
However, this didn’t seem to be the case for us. We weren’t seeing this error. Still, my instinct was that we were hitting some sort of network problem so I decided to check our our VPC Endpoint configuration in the console, accessed via VPC -> Endpoints.
Sure enough, I didn’t have a VPC Endpoint configured for the S3 service in this account. Checking our main account showed that we did have an S3 endpoint, which gave another hint as to why it worked there.
A VPC Endpoint is a resource within AWS that allows you to privately connect a VPC to a particular AWS service. This means that services which are normally hosted in private subnets, like database clusters can be permitted to communicate with AWS services that expect to be communicated with via the Internet, like Amazon S3. Configuring a VPC Endpoint inserts a statement into Routing Tables that you specify, meaning you can control exactly what subnets you are granting this access to.
I created a new endpoint, selecting com.amazonaws.us-east-1.s3 as the Service Name. I provided the VPC ID where my Amazon Aurora cluster was located. Finally, I provided the Route Table IDs that contained the subnets used by my Amazon Aurora cluster, and decided to use the default policy that was provided.
After creating the endpoint I reran the LOAD DATA FROM S3 query. Our query hit five minutes and didn’t bomb out like before. The anticipation started to build — our datafile was rather large (approximately 13m rows), so we expected it to take a bit of time. Then, finally:
Query OK, 12537768 rows affected (10 min 50.39 sec)
Records: 12537768 Deleted: 0 Skipped: 0 Warnings: 0
It worked! And it was pretty damn fast, too.
Lesson learnt: even if you really, really hope the problem isn’t your VPC configuration, sometimes it is. That, and sometimes you need to use a bit of imagination with AWS’s documentation.
Hopefully I’ll get to write a bit more about my experiences performing data migrations in AWS soon; in particular we’ve been having fun exporting data from Amazon Redshift to S3, where we then import that data into an Amazon Aurora cluster. I’ve been really impressed with the simplicity and speed that can be achieved and hope never have to load data from anywhere else ever again.




Puppet CA Host Certificate Expiry
drew — Wed, 25 Oct 2017 00:01:00 GMT
So you read the title and thought “Hey, I just read that article!”, right? Nah, here’s another article about Puppet CA Server certificates. Similar, but different. I’m just being cheeky by submitting them a couple of hours apart.
CA certs… Host certs… What?!
So, when a normal node runs Puppet for the first time, it contacts its Puppet CA server for a certificate that proves it’s allowed to be a client of that master. A detailed step through of what happens is:
It generates a new SSL keypair ($::hostprivkey and $::hostpubkey)
Using its newly generated $::hostprivkey, it generates a certificate signing request ($::hostcsr) for a certificate and sends that off to whatever’s configured as its Puppet CA server ($::ca_server). No catalog is compiled at this point, because the node isn’t considered a trusted client of the Puppet master.
The CSR from the node ends up in the $::csrdir of the Puppet CA server. If the Puppet CA server is configured to allow auto-signing of the node that made the request, it signs it with its CA certificate ($::cacert) and pops the signed cert into $::signeddir. The CSR gets deleted.
On the node’s next Puppet run, it gets a copy of the signed certificate from the Puppet CA server and is allowed to continue with its run. It gets a catalog and nice things happen to it for the rest of its little life. Or for five years. Which is how long the signed certs are valid for, by default.
So we’ve described a situation where a node requests what I’ll call a host certificate from the CA, which gives one back signed by its CA cert.
Okay, okay. But what about the CA?
You might already know this, but the Puppet CA server is usually a client of itself.
So when it was first spun up, the Puppet CA generated itself an SSL keypair ($::cakey and $::capub) and a CA certificate ($::cacert). This certificate is self-signed, and is marked as a CA cert, meaning it’s allowed to sign other certificate requests.
That’s on the master side. But Puppet masters also run the Puppet agent.
When the Puppet CA ran puppet agent -t for the first time, it went through the steps we went discussed above. Generated itself a new keypair ($::hostprivkey and $::hostpubkey)… Looked to see who its CA server was… (itself!) and submitted a CSR to them ($::hostcsr) and waited until it got a response back (from itself!).
So yeah, the upshot is that the Puppet CA server has two sets of certs. Its CA certs, and its host certs. We can illustrate that clearly here:
# puppet config print hostcert hostprivkey hostpubkey cacert cakey capub | awk '{print $NF}' | xargs md5sum
63c251f890893690df65f79d36eb5d62  /var/lib/puppet/ssl/certs/puppetca.example.com.pem
e2c7e9bb4b5a68c812989c2da702b971  /var/lib/puppet/ssl/private_keys/puppetca.example.com.pem
3b8874ae570cecb3470384b1fc307503  /var/lib/puppet/ssl/public_keys/puppetca.example.com.pem
f4288c68efa8a4f9bd07d0bac62710a6  /var/lib/puppet/ssl/ca/ca_crt.pem
af2f94391cefa3e8d296d0648ebeb5d5  /var/lib/puppet/ssl/ca/ca_key.pem
7c17a8d5cc920b1d923f4b18deefea51  /var/lib/puppet/ssl/ca/ca_pub.pem
In the vast majority of cases, these certificates will have been generated at roughly the same time (usually the same day, almost certainly the same week). That means that they also expire at roughly the same time. But be warned, the content of the two certs is completely different, as the checksums above show!
So you might have went through the steps from my previous post, renewed your CA certs, managed to get everything working just nicely, perform your first Puppet run after the big scary change and be faced with this redness:
Warning: Certificate 'puppetca.example.com' will expire on 2017-11-22T22:08:37UTC
Believe it or not, this is referring to the host cert of your Puppet CA cert. Check out the expiry of it if you don’t believe me:
# openssl x509 -noout -text -in $(puppet config print hostcert) | grep -A2 Valid
        Validity
            Not Before: Nov 22 22:08:37 2017 GMT
            Not After : Nov 22 22:08:37 2022 GMT
Fine, sounds legit. What’s the impact?
Honestly? Not much. It means that once the “Not After” date has expired, the client cert used by your Puppet agent won’t be accepted by the CA any more. That means no catalogues. Boo.
On nodes that aren’t important, fixing this is a cinch. But not only is it a little bit embarrassing when your most important Puppet server can’t be an agent of itself, it comes with some additional configurations. You can’t just nuke $::ssldir and walk away like you usually would.
What’s the big deal, then?
Usually when you want to regenerate a Puppet client’s certs you’ll do this:
root@client# rm -rf $(puppet config print ssldir)

root@puppetca# puppet cert clean client

root@client# puppet agent -t




This is a bad idea to run on your Puppet CA. Like, a really bad idea.
It’ll delete the entirety of your Puppet CA’s SSL directory. That means your CA certs, CA keys and all of your estate’s signed certificates gone, up in smoke. (Have I mentioned that it’s a good idea to back this up?)
It’s not so bad if you’ve already deleted all of your SSL data, but if you were to run puppet cert clean puppetca when you still had all of your certs, you’d revoke both your host certs and your CA certs. They’d get appended to your CA’s CRL and none of the nodes in your estate would be able to check in with your Puppet masters any more. This is because your Puppet CA would not believe its own certs are valid any more.
So yeah, don’t do any of that.
Instead, we can perform a bit of surgery on your Puppet CA’s SSL directory and only remove the keys and certs that we mean to. We can even make it so that we can revert things if we mess it up, somehow! Let’s go.
Alternative DNS Names
First up, let’s see what your CA thinks about its own certs:
root@puppetca# puppet cert list --all | grep $(uname -n)
+ "puppetca.example.com"                                         (SHA256) A8:7A:68:3E:E9:61:CB:7A:FF:E9:42:48:FC:57:04:F8:BF:2E:05:8C:4D:EA:AF:F7:E5:3C:D6:5B:EF:8C:E1:6C (alt names: "DNS:puppet", "DNS:puppet-1.example.com", "DNS:puppet.example.com")
The alt names section tells me that the cert that’s already issued has X.509 Subject Alternative Name extensions baked in. This is a list of alternative DNS names that are also valid for the cert that’s been signed. It’s used so that your clients could hit, for example, a friendly CNAME DNS record that references your Puppet CA instead of the CA’s FQDN itself.
We’re gonna make sure that these are configured on your Puppet agent, too. Just so that we’re replacing like-with-like.
root@puppetca# puppet config print dns_alt_names

root@puppetca#
If, like above, you don’t have any output, you’re missing a bit of config. dns_alt_names is be configured in /etc/puppet/puppet.conf under the [main] section. So add something appropriate like this:
dns_alt_names=puppet,puppet-1.example.com,puppet.example.com
Re-running puppet config print dns_alt_names should have more success now.
Back it up, back it up!
Make sure and CYA by backing up everything you touch. And maybe even some stuff you don’t. I went a bit overboard:
# create backup directories
mkdir -p $(puppet config print ssldir)/backup/{host,ca,signed}

# back up our ca certs and crls (local and ca) because we're paranoid
cp $(puppet config print cacert) $(puppet config print ssldir)/backup/ca/cacert.pem
cp $(puppet config print cacrl) $(puppet config print ssldir)/backup/ca/cacrl.pem
cp $(puppet config print localcacert) $(puppet config print ssldir)/backup/ca/localcacert.pem
cp $(puppet config print hostcrl) $(puppet config print ssldir)/backup/ca/hostcrl.pem

# back up our agent-side certs
cp $(puppet config print hostcert) $(puppet config print ssldir)/backup/host/hostcert.pem
cp $(puppet config print hostprivkey) $(puppet config print ssldir)/backup/host/hostprivkey.pem
cp $(puppet config print hostpubkey) $(puppet config print ssldir)/backup/host/hostpubkey.pem

# remove up our ca-side signed cert, backing it up ofc
cp $(puppet config print signeddir)/$(hostname -f).pem $(puppet config print ssldir)/backup/signed/signed.pem

# make sure they're there
find $(puppet config print ssldir)/backup | xargs ls -ld
Now you should have a pretty list of files backed up in `$(puppet config print ssldir)/backup, should the worst happen.
Precision cut
Now for the action. For the avoidance of doubt, all of this is performed on your Puppet CA server.
# 1. stop puppet agent
service puppet stop 

# 2. move the host certs that Puppet Agent knows about
mv $(puppet config print hostcert) /tmp/hostcert.pem
mv $(puppet config print hostprivkey) /tmp/hostprivkey.pem
mv $(puppet config print hostpubkey) /tmp/hostpubkey.pem

# 3. move the signed version of the ca's host cert
mv $(puppet config print signeddir)/$(hostname -f).pem /tmp

# 4. regenerate our host ssl keypair, create csr and submit to ca
puppet agent -t

# 5. get the ca to sign our csr
puppet cert sign $(hostname -f)

# 6. get the agent to retrieve the signed cert and perform a run
puppet agent -t

# 7. restart puppet agent
service puppet start
All going well, you should have new host certs on your Puppet CA server, with the actual CA certs themselves being left well alone!
Now, you can check the validity of your new host cert.
# openssl x509 -noout -text -in $(puppet config print hostcert) | grep -A2 Valid
        Validity
            Not Before: Oct 18 11:55:40 2017 GMT
            Not After : Oct 18 11:55:40 2022 GMT
Compare it to the validity dates of your CA cert if you want to. They’ll differ:
# openssl x509 -noout -text -in $(puppet config print cacert) | grep -A2 Valid
        Validity
            Not Before: Oct 18 10:22:30 2017 GMT
            Not After : Oct 18 10:22:30 2022 GMT
Now we can hopefully put this sorry affair behind us. And get on with some real work wrangling some clouds, eh?
See you in five years!




Puppet CA Certificate Expiry
drew — Wed, 25 Oct 2017 00:00:00 GMT
Hard to believe it, but it’s coming up on five years since our company stood up its first Puppet master. No cake here. Instead, all we get is a lousy old CA certificate expiry.
At first, I inhaled sharply and mentally prepared myself for the impending chaos of our client certificates becoming invalid.
Instead, I was pleasantly surprised when I stumbled across puppetlabs/certregen which promised to make the whole sorry affair disappear painlessly. It‘s even Puppet 3 compatible!
The secret-sauce is all around the use of a clever Puppet Face to provide the puppet certgen command on your CA server. This simplifies the regeneration of your master’s CA cert to one command, creating a replacement from the exact same key used to create your expiring cert. The new cert’s effectively a drop-in replacement!
The other piece of magic is a simple, super-portable class that manages the contents of $::localcacert on your client nodes to be the same as whatever’s in your master’s $settings::cacert file. It does this by calling the file()function on the Puppet master on catalog compilation, grabbing what the master believes its CA cert to be. So, we update the CA cert on the master and then the nodes take care of themselves!
Of course, because the replacement CA certificate is generated from the same keys as your expiring CA cert, this tool is not suitable for use when your CA’s keys have been compromised.
So how does it work in practice? The module’s README is pretty hot at explaining everything, but for the sake of completeness, here’s the process I used. It assumes that your CA cert hasn’t expired yet.
Install the module
We use Puppetfile, so I added the following to ours:
mod 'puppetlabs-certregen', '0.2.0'
Get the expiring cert’s serial
Running puppet certregen ca won’t do anything off the bat — it’ll error and spit out something like so:
$ sudo puppet certregen ca
Error: The serial number of the CA certificate to rotate must be provided. If you are sure that you want to rotate the CA certificate, rerun this command with --ca_serial 01
Actually regenerate the CA certificate
When you’re for-reals ready to regenerate your Puppet master’s CA cert, re-run the above command providing the serial number you gleaned previously:
$ sudo puppet certregen ca --ca_serial 01
Notice: Backing up current CA certificate to /var/lib/puppet/ssl/ca/ca_crt.1508247754.pem
Notice: Signed certificate request for ca
CA expiration is now 2022-10-16 13:42:34 UTC
CRL next update is now 2022-10-16 13:42:34 UTC
As you can see, this backs up the current (expiring) CA cert, just in case. Then it spits out the new expiration date of its replacement. Great!
To be clear — your CA server now has a new certificate, but all of the nodes in your Puppet deployment still have the old, expiring version. Let’s fix that!
Enforce node management of CA cert
Now we’ve gotta get some mechanism in place where your nodes can get the updated CA certificate. This is where the certregen::client class comes into play. It defines a File resource that manages each node’s CA certificate (whose path is determined by consulting the $::localcacert variable).
We use Roles & Profiles, so I added the following to our base profile:
include certregen::client
The important thing is that all of your nodes get this into their catalogs somehow, so you can use the main manifest or whatever.
This change needs to make its way into all of the Puppet directory environments that you care about. In my case, I made the change to our production environment and retroactively applied it to a handful of static environments that we care about. I consider pretty much every other environment as disposable.
Update compile masters
Got any non-CA, compile-only Puppet masters? Get them to complete a Puppet run against your CA as a priority. This’ll update the CA used by these masters to the new one, along with the associated CRL.
drew@pmaster-1 $ sudo puppet agent -t --server puppetca.example.com
...
Notice: /Stage[main]/Certregen::Client/File[/var/lib/puppet/ssl/certs/ca.pem]/content: content changed '{md5}35d987adf79eaaed26780cd0e69f7d52' to '{md5}ed6cb1e53b7ad78ac41629fb8e3e2c38'
Notice: /Stage[main]/Certregen::Client/File[/var/lib/puppet/ssl/crl.pem]/content: content changed '{md5}910a663110f54bf974a26b750c3a8a90' to '{md5}a774a54175b68251b11d0b2e264eec37'
...

Notice: Finished catalog run in 6.42 seconds
If you don’t do this first, any nodes that compile catalogs using these Puppet masters will get the CA cert that the compile master believes to be current — in this case, the old, expiring one! And that’s no good!
Don’t have any compile masters? Then you’re good.
Update CA cert on client nodes
As the nodes across your deployment check in with your Puppet masters, you’ll notice their CA certificate and associated CRL update to the new one:
drew@host1 $ sudo grep 'content changed' /var/log/syslog | grep 'Certregen::Client'
Oct 17 04:06:59 host1 puppet-agent[6556]: (/Stage[main]/Certregen::Client/File[/var/lib/puppet/ssl/certs/ca.pem]/content) content changed '{md5}35d987adf79eaaed26780cd0e69f7d52' to '{md5}ed6cb1e53b7ad78ac41629fb8e3e2c38'
Oct 17 04:06:59 host1 puppet-agent[6556]: (/Stage[main]/Certregen::Client/File[/var/lib/puppet/ssl/crl.pem]/content) content changed '{md5}910a663110f54bf974a26b750c3a8a90' to '{md5}a774a54175b68251b11d0b2e264eec37'
If in doubt, you can check that the CA certificate serials of your Puppet CA and your client nodess match!
root@puppetca # openssl x509 -in $(puppet config print cacert) -noout -text | grep Serial
        Serial Number: 7 (0x7)

root@node1 # openssl x509 -in $(puppet config print localcacert) -noout -text | grep Serial
        Serial Number: 7 (0x7)
And whew, you’ve just avoided what could have been a major problem. Just like that. Thanks, Puppet!
Things to remember (or is it “things I found out”?)
After including the ::certregen::client class, the content of each node’s$::localcacert and $::hostcrl files will be managed by Puppet, and kept up to date with the most recent versions located on your Puppetmasters.
That means that if a certificate is revoked on your CA, an entry will be added to its CRL. And that new version of the CRL will subsequently be pushed out across all of the nodes with the ::certregen::client class included in their manifests.
More importantly, consider whether ::certregen::client is going to affect your Puppet CA’s own catalog. Consider where your Puppet CA gets its catalog from.
Is it itself? Then you’re good.
Is it a different master or a load balancer containing a number of compile masters? If so, things could get messy.
If your Puppet CA hits another master for its catalog, it may receive an older version of the CRL than it has itself. In reality, this just means that $::cacert and $::localcacert will differ on your Puppet CA (as will $::cacrl and $::hostcrl), which may add some confusion if you start inspecting certs.
For me, I had a rouge $::cacert living on one of my compile masters, which really added some spice into the mix. It all worked out in the end, though.
To avoid any of these potential issues, I’d make sure the value of server in /etc/puppet/puppet.conf for your Puppet CA is itself. Probably worth making sure your compile masters point at your Puppet CA for their catalogs, too.




Connecting the dots with scp
drew — Fri, 18 Nov 2016 00:00:00 GMT
Today, one of my colleagues was having trouble copying some files between Linux boxes. As a result, I learnt something about scp that I didn’t know before — by default, using scp between two remote hosts from a third party host means that SSH connections “leapfrog” through the box where the source file is located.
Let me set the scene — imagine three hosts:
A host calledbastionthat scp copies are initiated from.
A host called sourcebox that has a file you want to copy (/tmp/source_file).
A host called destbox that has a place you want to copy the file to (/tmp/destfile).
Imagine that our private key (~/.ssh/id_rsa) is physically present on bastion, but not on either sourcebox or destbox.
Our corresponding public key (~/.ssh/id_rsa.pub) is configured in the~/.ssh/authorized_keys file of all three hosts. We can therefore authenticate to each of these three boxes via ssh by using our private key.
Our /tmp/source_file exists on sourcebox and is accessible from bastion.
drew@bastion:~$ ssh -i ~/.ssh/id_rsa sourcebox 'cat /tmp/source_file'
content of my file
Likewise, we can tell that there’s no file called /tmp/dest_file on destbox.
drew@bastion:~$ ssh -i ~/.ssh/id_rsa destbox 'ls -ld /tmp/dest_file'
ls: cannot access /tmp/dest_file: No such file or directory
Okay, let’s try and copy our file by using scp from bastion:
drew@bastion:~$ scp -i ~/.ssh/id_rsa sourcebox:/tmp/source_file sourcebox:/tmp/dest_file
Host key verification failed.
lost connection
Huh? But… ssh worked before between bastion and both of our other hosts... What gives? Both of these hosts are located in the ~/.ssh/known_hosts file of bastion, too…
drew@bastion:~$ ssh-keygen -F sourcebox | grep found
# Host sourcebox found: line 3 type ECDSA

drew@bastion:~$ ssh-keygen -F destbox | grep found
# Host destbox found: line 4 type ECDSA
Let’s dig a bit deeper with a bit of ssh's -vvv option… (I’ve omitted a lot of the noise with ellipsis to make it easier to read)
drew@bastion:~$ scp -vvv -i ~/.ssh/id_rsa sourcebox:/tmp/source_file destbox:/tmp/dest_file
...
debug1: Host 'sourcebox' is known and matches the ECDSA host key.
...
debug1: Authentication succeeded (publickey).
Authenticated to sourcebox ([192.168.10.12]:22).
...
debug1: Connecting to destbox [192.168.10.13] port 22.
...
debug1: Sending command: scp -v /tmp/source_file destbox:/tmp/dest_file
...
Host key verification failed.
lost connection
So, it seems like when we invoke scp from bastion, we first connect to sourcebox and then initiate a connection from there to destbox.
As it stands, sourcebox doesn’t know about destbox, so we can fix that by adding destbox's host keys…
drew@sourcebox:~$ ssh-keyscan destbox >> ~/.ssh/known_hosts
# destbox SSH-2.0-OpenSSH_6.6p1 Ubuntu-2ubuntu1
# destbox SSH-2.0-OpenSSH_6.6p1 Ubuntu-2ubuntu1

drew@sourcebox:~$ ssh-keygen -F destbox | grep found
# Host destbox found: line 1 type RSA
# Host destbox found: line 2 type ECDSA
Let’s try again and see where we stand…
drew@bastion:~$ scp -i ~/.ssh/id_rsa sourcebox:/tmp/source_file destbox:/tmp/dest_file
Permission denied, please try again.
Permission denied, please try again.
Permission denied (publickey,password).
lost connection
Our host key problem is now resolved. Our sourcebox host now successfully validates the host key of destbox. But we have another problem…
Now we’re getting authentication failure messages. Basically, the private key that we’re using from bastion isn’t present on sourcebox, so we’re effectively trying to authenticate to destbox from sourceboxwithout a password, an attempt which is swiftly rebuffed by destbox.
You can fix this from bastion by using a feature of ssh called AgentForwarding. First, you load your private key into ssh-agentand tell scpto forward it by using -o ForwardAgent=yes...
drew@bastion:~$ eval $(ssh-agent)
Agent pid 3117

drew@bastion:~$ ssh-add
Identity added: /home/drew/.ssh/id_rsa (/home/drew/.ssh/id_rsa)

drew@bastion:~$ scp -o ForwardAgent=true sourcebox:/tmp/source_file destbox:/tmp/dest_file

drew@bastion:~$ ssh destbox 'cat /tmp/dest_file'
content of my file
Another way around this without using ssh-agent or having to configure anything on sourcebox would be to use the -3 option that comes with scp. From scp(1):
-3 Copies between two remote hosts are transferred through the local host. Without this option the data is copied directly between the two remote hosts. Note that this option disables the progress meter.
This would change the connections that are made by scp from this:
bastion -> sourcebox    # initiate connection to sourcebox
sourcebox -> destbox    # copy from sourcebox to destbox
to this:
bastion -> sourcebox    # copy from sourcebox to bastion
bastion -> destbox      # copy from bastion to destbox
Both of these are a bit more convenient than manually setting up SSH tunnels or anything like that. And both options help to reduce the number of places you have to plonk your private key, which is always good.
I’ve been using these tools for a long time now, but for some reason in my head I always assumed that connections were made from the place that the command was invoked. TIL, I guess.




Deleting EBS Volumes While Snapshotting
drew — Tue, 26 Jul 2016 00:00:00 GMT
I’ve been working with AWS quite heavily recently. A lot of the time, I find myself thinking, “I wonder what happens if I do this…”
Sometimes, I can find the answer in Amazon’s fairly extensive documentation library. But sometimes, I want to perform events in a certain order where the consequences aren’t described clearly.
For example, “What happens if I delete an EBS volume while a snapshot operation is taking place?”. Advice that I could find implied that snapshots were asynchronous operations and that volumes were safe to use after the snapshot was initiated (incurring a performance penalty, of course). However, nothing was said about deletions.
I tested it out. Turns out that if you delete a volume while a snapshot’s being taken, the volume remains in a Deleting state until the snapshot’s completed. Once the snapshot is finished, the volume is fully deleted.
This makes sense. I’m glad it happens this way and we don’t end up with failed snapshots or anything silly like that. +1 for sensibility, AWS.




Shimming a disk into a VxVM disk group
drew — Fri, 25 Sep 2015 00:00:00 GMT
We had a problem with an Oracle database on a four node Veritas Cluster Server (VCS) cluster recently. The symptom of our problem was that the database had been brought down with VCS and just wouldn’t come back up again. The problem was ultimately traced down to the Veritas Volume Manager (VxVM) disk group level.
Initial analysis
The original diagnosis was that a SAN disk (LUN) had been accidentally unmapped from the host and put back hurriedly, resulting in it disappearing from the host. Volume migrations on our IBM XIV arrays have caused things like this in the past, and regular SCSI bus scans don’t result in the LUN coming back in.
So, we thought we’d take an outage, reboot and see if the LUN would come back in. It’d been up for a couple of years and did have a lot of SAN storage presented to it (subject to relatively frequent change). Maybe the host’s SCSI subsystem was being funky. We could patch while we were at it, too.
One reboot later and still no dice — the missing LUN was nowhere to be found. I had a closer look at the disk group’s configuration. Our missing LUN was only used in one of the VxVM volumes. It was a 256GB sub-disk tacked neatly onto the end of a healthy 1.4TB sub-disk.
There was only one data plex, so no mirroring in place. There’d be no way to recover any data that was on this disk if the disk group was deemed to be borked.
So, to the immediate question: “Where is the disk?”
If we could find it, reattaching it and recovering the volume would be a simple enough task. Through sheer luck, I found a text file containing a list of all LUNs configured visible from the host from a couple of months back. Searching for the DA name `xiv1_1234` threw up the LUN’s human-friendly name (as configured on the array) along with its size and serial number.
Next stop was the array. I queried the array for a LUN of the same name. Nothing. Same size? Nothing. Serial number? Nothing.
There were only a few possibilities at this point:
The LUN has been renamed and unmapped from the host.
The LUN has been deleted.
At this stage, I turned to our storage guys. I gave them all of the details I had gathered. Turns out that the LUN had been deleted erroneously as part of a decommission of a completely different database some months earlier! And as it was a development database, there were no full backups to speak of.
However, I did think it was strange that our database hadn’t noticed that one of its LUNs had disappeared! It had been running merrily for the past four months, oblivious. The cluster software had shut down it cleanly and deported its disks without a problem. It was only when trying to bring its storage back in again did we encounter our problem.
I speculated that since the database had been running fine before it was shut down, it must not have had any of its data located on the deleted LUN. If the database had tried to issue any read or write operations to the deleted LUN, there’d have been I/O errors everywhere.
VxVM usually writes sequentially to its volumes. Our deleted LUN had been right at the end of our impacted volume. A colleague's `df` output increased my confidence in this, although it was a pretty close call!
Given all of this information, I wondered — what if i grabbed a blank LUN of the same size and tried to fool VxVM into importing the disk group by using it as a shim? Worth a shot, right? So, I had the storage guys present a new LUN and take a snapshot of the surviving disk (in case i royally borked stuff up) and got to work.
You’ve gotta destroy to rebuild…
Because our volume had only one plex, it would be impossible for VxVM to rebuild any data that it thought was on our failed sub-disk. Therefore, standard disk recovery commands such as `vxrecover` wouldn’t help us.
However, Symantec describe a method of destroying and recreating a disk group in place for the purposes of changing its unique disk group ID (DGID). We could use this method to recreate our disk group, too! Of course, we’d be making a few adjustments…
VxVM stores a lot of metadata to describe the layout of its disk groups and volumes. For example, the sub-disk configuration contains the path to the block device of the multipath device that it’s stored on, which Veritas Dynamic Multipather (VxDMP) then distributes I/O to the storage paths. The metadata completely represents these data structures — if you have all of the metadata, it’s possible to recreate a disk group that’s been destroyed, as long as nothing’s damaged them.
Because I was able to partially import the disk group, I was able to dump the configuration of that disk group (we had one good LUN, remember?).
# vxprint -g datadb_DEV -hmvps >> /var/tmp/datadb_DEV.layout
Even if I wasn’t able to do this, there’d be a good chance that `vxbackupd` would have stored some backup configuration copies from some point in `/etc/vx/cbr/bk`.
My troublesome volume was made up of two sub-disks. What a lucky situation — since I had one good sub-disk and one bad sub-disk, I was able to compare the metadata information and figure out what I had to change to make my bad sub-disk look good again…
Great! I know what I need to add, and I know what I need to change to make a good sub-disk. I just needed a few more pieces of information, first.
One thing that I didn’t have available to me already was the `dev` string for my newly assigned shim disk. I also needed the unique `guid` for the disk, too. These were easy to find, but I needed to temporarily configure the disk to find it out:
I also needed what VxVM believed to be the correct associations between Disk Access (DA) or Dynamic Multipath (DMP) names — the multipath block device — and DM names — a name assigned by humans to each disk.
Finally, I needed the public and private region offsets for each LUN in the disk group. These are required so that when we reinitialise our disks, we can tell VxVM where its disk header (private region) is, and where the data starts (the public region). Without these exact values, we run the risk of overwriting areas that used to contain data with new headers, instantly corrupting the disk. For good.
These values were simple to get from the healthy LUN (`vxdisk list`). However, for the failed LUN, I had to go through a bunch of old configuration copy backups located in `/etc/vx/cbr/bk` and hope that I’d find config for my failed disk. It was my lucky day, again.
With my information in hand, I edited the following fields within my `/var/tmp/datadb_DEV.layout` file to match my newly assigned LUN. Here are the fields that I changed or added to my file:
I now had everything I needed to destroy and recreate my diskgroup.
Put it all together and what do you get?
Ensure the disk group is deported
Reinitialise all disks with the correct public and private region offsets
Use our updated metadata to recreate our disk group
Activate our VxVM volumes and `fsck` them
Mount our filesystems
???
Profit!
After this, the troublesome volume was marked as ENABLED. The filesystem checks returned okay. Everything mounted successfully. I asked a couple of our database guys to have a look and try to open the database instance.
The database opened! So, the shim worked! VxVM had no idea that the disk had been swapped out from under it!
I have to concede that in this situation, I was extremely lucky. What were the chances of an unused (but configured) disk being deleted from an array? And what were the chances of the database not trying to write to that disk for a number of months afterwards? Most amazingly, what were the chances of being able pulling off some brazen disk butchery without a hitch?
Time for a beer!




Using dlpiping to test VCS heartbeats
drew — Wed, 16 Sep 2015 00:00:00 GMT
Imagine you have a couple of hosts with Veritas Cluster Server (VCS) installed, configured with the recommended two heartbeat interfaces. Only, you’re not sure whether they’re able to communicate or not!
In the normal TCP/IP world, you’d just ping the hosts via the relevant network interfaces. However, VCS heartbeats run over the Low Latency Transport (LLT) protocol, which operates at layer 2 of the OSI model. Therefore, its interfaces don’t require IP addresses to be configured on them. Enter dlpiping! A utility that allows us to test connectivity using the LLT protocol!
The VRTSllt package installs the dlipiping command in /opt/VRTSllt, so it requires that you’ve gone through the VCS installation process.
First, decide the host that you want to test connectivity to. Identify one of your heartbeat interfaces. In my case, we’re using eth5. Grab the hardware address of the interface. Let’s imagine that’s 01:23:45:67:89:ab in our case.
Now, execute dlpiping in server mode.
root@host1# /opt/VRTSllt/dlpiping -vs eth5
using packet size = 78
dlpiping: binding ping SAP 0xf00e
Now, on the host you want to test connectivity from, send a request!
root@host2# /opt/VRTSllt/dlpiping -vc eth5 01:23:45:67:89:ab
using packet size = 78
dlpiping: sent a request to 01:23:45:67:89:ab
dlpiping: received a packet from 01:23:45:67:89:ab
01:23:45:67:89:ab is alive
Hooray! On your first host, you’ll see a response pop up, too.
dlpiping: sent a response
This is an invaluable tool in troubleshooting those times when your cluster just won’t come up, no matter what you do! This will give you reassurance that at least your network is working. Or alternatively, confirmation that you’ve got lower level issues!




Capturing CDP Packets with tcpdump
drew — Thu, 03 Sep 2015 00:00:00 GMT
An oldie, but a goodie.
Using tcpdump, you can filter for the first CDP packet received by a given network interface. This can contain super-useful information for troubleshooting network issues.
tcpdump -nn -v -i eth0 -s 1500 -c 1 'ether[20:2] == 0x2000'
Big ones for me are Native VLAN ID and Port-ID, both invaluable when figuring out why you can’t hit that elusive default gateway!




When VxVM snapshots fail
drew — Wed, 12 Aug 2015 00:00:00 GMT
So, I was casually using the snapshot function of Vertias Volume Manager (VxVM) the other day and learned something new. Let me tell you about it!
If you don’t know about them, VxVM snapshots allow you to create a temporary mirror of a volume’s primary data plex, the idea being that it’s going to be split off into separate volume later on. This is useful when used in combination with vxdg split — effectively allowing you to clone a disk group with very little effort (as long as you have the disk space!).
You need a Storage Foundation Enterprise license to use the vxdg split functionality. But in practice, the workflow looks like:
Now, you have a new disk group containing a copy of your source volume. The disk group has a new disk group ID too, so there’s no fear of clashing. Both can reside happily on the same host, imported simultaneously.
However, I came across a gotcha when performing the snapshot operation:
No amount of searching Symantec’s documentation could shed light on this. The error message was obscure and didn’t seem to point me in any particular direction.
Thankfully, one of my colleagues (thanks, Susan W.!) pointed out the following little known piece of information regarding VxVM volumes:
Disk group and volume names can only contain up to 31 characters.
Huh. Let’s see…
Argh. My _new suffix had put me over the line! After changing my target volume name to something with fewer characters, everything worked as expected.




Error in vxdg configuration copies
drew — Tue, 11 Aug 2015 00:00:00 GMT
A colleague and I experienced a pretty non-trivial issue with Veritas Volume Manager (VxVM) today. The host was running Red Hat Enterprise Linux 5.8 (an HP ProLiant DL380 G7, hooked up to some IBM XIV SAN storage via fibre channel).
It was a pretty old host and had been up for about 600 days. vxdisk -eo alldgs list was reporting that one of my standard volumes was actually a snapshot!
What the heck? Had someone added a read/write snapshot to the diskgroup and extended onto it? Surely not!
I checked the array. Both disks were standard volumes. Hmm. And whew!
My next thought went to VxVM. The host had been up a while. Chances were that the storage configuration had changed quite frequently without sysadmins performing the correct disk removal / reassignment commands typically associated with that sort of work.
My next thought was, “Let’s restart VxVM! That’s impactless, right?”. Freezing my cluster service groups first, I performed the following:
Uhoh. That doesn’t look too hot — vxconfigd wouldn’t come back. Not a lot of info as to why here. Let’s check /var/log/messages…
Good ol’ Veritas data corruption protection. Somewhere along the line, one of the LUNs presented to this system had been unpresented and then represented with a new LUN ID, causing all kinds of havoc.
It’s always a pain to resolve, but I’m in a particular pickle here, as VxVM wants me to run vxdisk rm to remove the badly behaved multipath devices from its configuration. I can’t issue this command until vxconfigd is running and in enabled mode, but I also can’t get vxconfigd running until I do this! A true chicken and egg scenario…
Looking more closely at the log output, only three block devices were causing the problems: sdcy, sdcn and sdy. What disk group do they belong to?
a quick egrep of my disk group backups in /etc/vx/cbr/bk (caution with these; they can be out of date depending on how often administrative disk operations take place and how vxconfigbackupd is configured) showed that they were members of the disk group that had been causing me problems: the one with the standard volume and apparent ‘snapshot’.
I knew for a fact that:
No I/O was passing through these disks (No filesystems were mounted)
Disks in this disk group had six redundant paths to storage each. This meant that each SAN volume had six block devices that the OS could use to route I/O requests.
I only wanted to get rid of three block devices.
Therefore, I was relatively happy performing the following:
After that, I gingerly started VxVM again and tried to run vxdisk -eo alldgs list again…
It’s back! And VxVM agrees that our ‘snapshot’ is now a standard volume! If we compared the output of vxdisk print xiv0_001 from before and after our action, we’d probably see that the names of the block devices that represented our storage devices had changed slightly, too.
Clearly something had been done at the storage layer with this particular disk. Exactly what will probably always remain a mystery. But if nothing else, this proves that performing sanity reboots before starting a migration is a pretty good idea.