Episode 18 - Benchmarking AWS Flow Logs

Welcome to Episode 18 of the “Head in the Clouds” Video Series. I am Ken Hartman, a SANS Certified Instructor and content creator for the SANS Cloud Security Curriculum.

Today’s episode is titled: “Benchmarking AWS Flow Logs”

In many of the of the courses in the SANS Cloud Curriculum, we will have one or more labs that involves flow logs. However, a portion of the lab time is spent waiting for the flow logs to get provisioned. It’s kind of like watching and waiting for water to boil. Then, once the flow logs are provisioned, it takes time for the measured traffic to show up. Therefore, I thought it would be very interesting to determine what one should expect via experimentation and observation. Along the way, I made some very interesting discoveries, and I cannot wait to show them to you.

Objectives

Anytime we use a tool, technique, or service, we should be aware of its limitations and possible pitfalls. As such, here are the objectives for today:

Experimentally determine the typical time to provision flow logs to an S3 bucket.
Experimentally determine the typical time to provision flow logs to a CloudWatch logs group.
Experimentally determine the typical delay for ongoing traffic to be captured in the flow log record.
Contemplate the implications of our findings.

Setup

For this episode, we will assume that you have an AWS EC2 virtual machine running in a VPC that has a “Name” tag of “hitc-vpc.” The t2.micro type that is in the free tier will do just fine. I recommend that you use the Ubuntu 20.04 AMI for your region, as this will allow me to provide you with the exact commands to install some additional software later in this episode.

I used Terraform to deploy my environment as described in the HitC Episode titled Episode 10 - Demonstration of Terraform Modules; Deploy a VM in AWS, Azure and GCP at once as I had originally planned to benchmark all three cloud service providers. However, as I got deeper and deeper into my research into AWS flow logs, I decided to stay focused on just AWS and may circle back to Azure and GCP flow logs in one or more future episodes.

In this episode, we will be making extensive use of the AWS command line interface. This has two benefits:

It will make it very easy for you to reproduce my work, and
It will allow us to measure the time that it takes for things to happen.

The machine on which you are running the CLI commands will need jq in addition to the AWS command line interface.

OBJECTIVE 1: Experimentally determine the typical time to provision flow logs to an S3 bucket

To start with, we need to set some variables that will be used by our commands:

UNIQ_ID=$RANDOM; echo $UNIQ_ID
ACCOUNT=$(aws sts get-caller-identity | jq -r .Account); echo $ACCOUNT
YYYY=$(date -Idate | cut -d"-" -f1); echo $YYYY
MM=$(date -Idate | cut -d"-" -f2); echo $MM
DD=$(date -Idate | cut -d"-" -f3); echo $DD
REGION=$(aws ec2 describe-availability-zones --output text --query 'AvailabilityZones[0].[RegionName]'); echo $REGION

We generate a unique ID to ensure that the S3 bucket that we create is unique. Linux has a variable called $RANDOM that we can use for this purpose.

The YYYY, MM, and DD are captured so we can look into the bucket for our data.

Lastly, we use a trick from StackOverflow.com to get the region, as yours may be different from mine.

Next, we will create a variety of “helper commands.”

Get the VPC ID

To get the VPC ID of the VPC that is named “hitc-vpc” we can use:

VPC_ID=$(aws ec2 describe-vpcs --filters Name=tag:Name,Values=hitc-vpc --query "Vpcs[0].VpcId" --output text); echo $VPC_ID

Create the Bucket for the Flow Logs

Next, we need to create an S3 bucket to receive our flow logs:

aws s3 mb s3://hitc-benchmark-flowlogs-$UNIQ_ID

Set up the Flow Logs for the VPC and Send to S3

Our next task is to make sure we have the command to set the flow logs for the VPC and send them to S3:

aws ec2 create-flow-logs \
    --resource-type VPC \
    --resource-ids $VPC_ID \
    --traffic-type ALL \
    --log-destination-type s3 \
    --log-destination arn:aws:s3:::hitc-benchmark-flowlogs-$UNIQ_ID \
    --max-aggregation-interval 600 \
    --tag-specifications 'ResourceType=vpc-flow-log,Tags=[{Key=Name,Value=hitc-flow-log}]'

Describe the Flow Logs for the VPC

Having done that, we may find it helpful to be able to describe the flow logs for the VPC in question:

aws ec2 describe-flow-logs --filter "Name=resource-id,Values=$VPC_ID"

Delete the Flow Logs for the VPC

Next, we need to be able to delete the flow logs that we just created. To do this we need to know the Flow Log ID.

FLOW_LOG_ID=$(aws ec2 describe-flow-logs --filter "Name=resource-id,Values=$VPC_ID" --query 'FlowLogs[0].FlowLogId' --output text)
echo $FLOW_LOG_ID
aws ec2 delete-flow-logs --flow-log-ids $FLOW_LOG_ID

List the contents of the Bucket

To verify the presence of a flow log, we will need to be able to drill deep into the S3 bucket. This is facilitated by the variables we set earlier.

aws s3 ls s3://hitc-benchmark-flowlogs-$UNIQ_ID/AWSLogs/$ACCOUNT/vpcflowlogs/$REGION/$YYYY/$MM/$DD/

Empty the Bucket

We also need the ability to empty the bucket between successive iterations of our test:

aws s3 rm s3://hitc-benchmark-flowlogs-$UNIQ_ID --recursive

Remove the Bucket

It is also good idea to remove the bucket when we are done with our testing:

aws s3 rb s3://hitc-benchmark-flowlogs-$UNIQ_ID

Generate Flow Log Traffic

Now that we’ve got our helper scripts figured out and tested, we need to generate some traffic to get captured by our flow log. SSH into your easy to instance and run the following command:

while true ; do wget --timeout=10 http://1.1.1.1:81; sleep 20; done

Note that I selected CloudFlare as the target as they are big enough that this would not be considered abusive and by sending the traffic to port 81, the retries will keep working until exhausted after 20 attempts and will then start over after 20 seconds. Anything that repetitively generates traffic will work here.

TEST 1 - Time the provisioning of the flow Log until first flow log record in S3

With that done, we can now measure the time between the provisioning of the flow log until first flow log record in S3. Every 20 seconds we will check the S3 bucket to see if the flow log showed up yet. Our “TEST” checks to see if the length of response of the aws s3 ls command is zero. Once there is a flow log record in the bucket, the infinite loop will exit and we subtract the start time from the end time. Working in epoch time, makes this easy.

NOTE: This assumes that the S3 bucket exists, but is empty

# Set up the Flow Logs for the VPC
aws ec2 create-flow-logs \
    --resource-type VPC \
    --resource-ids $VPC_ID \
    --traffic-type ALL \
    --log-destination-type s3 \
    --log-destination arn:aws:s3:::hitc-benchmark-flowlogs-$UNIQ_ID \
    --max-aggregation-interval 600 \
    --tag-specifications 'ResourceType=vpc-flow-log,Tags=[{Key=Name,Value=hitc-flow-log}]'
# Watch for initial flow logs
printf "Waiting for the flow logs to apear in S3 bucket "
START_TIMER=$(date +%s)
TEST=""
while [ -z "$TEST" ] ; do
  printf "."
  sleep 20
  TEST=$(aws s3 ls s3://hitc-benchmark-flowlogs-$UNIQ_ID/AWSLogs/$ACCOUNT/vpcflowlogs/$REGION/$YYYY/$MM/$DD/)
done
STOP_TIMER=$(date +%s)
echo "DONE"
echo $((STOP_TIMER-START_TIMER))" seconds have elapsed"

While this script executes, it will print a dot to stdout every 20 seconds. When it is done, the output will look like this:

772 seconds have elapsed

Ok, 772 seconds is 12.87 minutes. We now have our TEST 1 measurement. Note that this can be off by 20 seconds due to the sleep timer, which can be removed.

Tear down TEST 1

To repeat the test, we need to delete the flow log and empty the S3 bucket, but leave the S3 bucket in place.

# Delete the Flow Logs for the VPC
FLOW_LOG_ID=$(aws ec2 describe-flow-logs --filter "Name=resource-id,Values=$VPC_ID" --query 'FlowLogs[0].FlowLogId' --output text)
echo $FLOW_LOG_ID
aws ec2 delete-flow-logs --flow-log-ids $FLOW_LOG_ID
# Empty the Bucket
aws s3 rm s3://hitc-benchmark-flowlogs-$UNIQ_ID --recursive

TEST 1 Summary

This table shows the results of repeating the test 3 more times:

Script Results	Minutes
772 seconds have elapsed	12.86 Minutes
417 seconds have elapsed	6.97 Minutes
813 seconds have elapsed	13.55 Minutes
960 seconds have elapsed	16.00 Minutes

Wow, it can take 16 minutes to get your first flow logs!

So, what is up with this? Well, the catch phrase that the cloud service providers use is eventual consistency. We know that the big guys use queuing and focus intently on optimization, so I am guessing that it is for those reasons that there is a provisioning delay.

Final Cleanup TEST 1

Now that we have obtained our results, make sure that you delete the flow log and empty the bucket again, using the last set of commands and then remove the bucket, as we no longer need it.

# Remove the Bucket
aws s3 rb s3://hitc-benchmark-flowlogs-$UNIQ_ID

OBJECTIVE 2: Experimentally determine the typical time to provision flow logs to a Cloud Watch logs group

Now, let’s take a look at Flow Logs sent to Cloud Watch Logs.

Create the CloudWatch Logs Group

First we need to create the CloudWatch Logs Group:

aws logs create-log-group --log-group-name hitc-flow-logs

Delete the Log Group

And, here is the corresponding command to delete the log group:

aws logs delete-log-group --log-group-name hitc-flow-logs

Create the CloudWatch Logs Role

Next, we need create the CloudWatch Logs Role. To do this we will create the Trust Policy JSON file, the Permissions Policy JSON file, and then attach them to a new role.

Create the Trust Policy JSON file. I like to use the “heredoc” technique, so that everything can be scripted:

cat << EOF > trust-policy.json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "vpc-flow-logs.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
EOF

Create the Permissions Policy JSON file:

cat << EOF > cw-flow-logs-policy.json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents",
        "logs:DescribeLogGroups",
        "logs:DescribeLogStreams"
      ],
      "Resource": "*"
    }
  ]
}   
EOF

Next, run the following commands to create the policies and the role:

aws iam create-policy --policy-name hitc-cw-flow-logs --policy-document file://cw-flow-logs-policy.json
aws iam create-role --role-name hitc-cw-flow-logs --assume-role-policy-document file://trust-policy.json
aws iam attach-role-policy --policy-arn "arn:aws:iam::$ACCOUNT:policy/hitc-cw-flow-logs" --role-name hitc-cw-flow-logs
aws iam list-attached-role-policies --role-name hitc-cw-flow-logs

TEST 2 - Time the provisioning of the flow Log until first flow log record in CloudWatch

With our setup for Test 2 in place, we can now conduct it. This “TEST” checks to see if the length of response of the aws logs describe-log-streams command is zero. Once there is a flow log record in CW logs, the infinite loop will exit and we subtract the start time from the end time.

NOTE: This assumes that the CloudWatch Log Group exists, but is empty.

# Set up the Flow Logs for the VPC
aws ec2 create-flow-logs \
    --resource-type VPC \
    --resource-ids $VPC_ID \
    --traffic-type ALL \
    --log-destination-type cloud-watch-logs \
    --log-group-name hitc-flow-logs \
    --max-aggregation-interval 600 \
    --tag-specifications 'ResourceType=vpc-flow-log,Tags=[{Key=Name,Value=hitc-flow-log}]' \
    --deliver-logs-permission-arn 'arn:aws:iam::690634326977:role/hitc-cw-flow-logs'
# Watch for initial flow logs
printf "Waiting for the flow logs to appear in CW Logs Group "
START_TIMER=$(date +%s)
TEST=""
while [ -z "$TEST" ] ; do
  printf "."
  sleep 20
  TEST=$(aws logs describe-log-streams --log-group-name hitc-flow-logs --output text)
done
STOP_TIMER=$(date +%s)
echo "DONE"
echo $((STOP_TIMER-START_TIMER))" seconds have elapsed"

Tear down TEST 2

To re-run the test, delete the flow log and reset the CW Logs Group.

# Delete the Flow Logs for the VPC
FLOW_LOG_ID=$(aws ec2 describe-flow-logs --filter "Name=resource-id,Values=$VPC_ID" --query 'FlowLogs[0].FlowLogId' --output text)
echo $FLOW_LOG_ID
aws ec2 delete-flow-logs --flow-log-ids $FLOW_LOG_ID
# Delete and Recreate the CW Logs Group
aws logs delete-log-group --log-group-name hitc-flow-logs
aws logs create-log-group --log-group-name hitc-flow-logs

Final Cleanup TEST 2

And when we are all done, we can run the following commands to delete the role and associated policy. But hold off for now as we will need this role for Test 3.

# CLI commands to tear down the role and IAM policy
aws iam detach-role-policy --role-name hitc-cw-flow-logs --policy-arn "arn:aws:iam::$ACCOUNT:policy/hitc-cw-flow-logs"
aws iam delete-role --role-name hitc-cw-flow-logs
aws iam delete-policy --policy-arn "arn:aws:iam::$ACCOUNT:policy/hitc-cw-flow-logs"

TEST 2 Summary

Script Results	Minutes
21 seconds have elapsed	0.35 Minutes
541 seconds have elapsed	9.02 Minutes
522 seconds have elapsed	8.70 Minutes
375 seconds have elapsed	6.25 Minutes

This is much better, but it can still take up to 10 minutes and is somewhat variable.

OBJECTIVE 3: Experimentally determine the typical delay for ongoing traffic to be captured in the flow log record.

Now, onto what is probably the most interesting and has the biggest implications for security monitoring. We want to know how long will it take for traffic that occurs in my VPC to show up as recorded in my flow logs? In other words, “I want to know how much lag there is in any detective control that I implement (that is based on flow logs).”

The security implications of this are fairly significant.

The Problem Statement

Wouldn’t it be great to time stamp a packet, send it across the virtual network of our VPC, look at the packet in CloudWatch Logs, and determine how much delay there was between the time the packet was sent, and when it was recorded?

That would be fantastic. The only problem is that a flow log contains only metadata, and not any payload. Here is a screenshot of what we have to work with:

Flow Log Data Screenshot

As it turns out, about the only thing that we have to fiddle with is the Source port and the Destination port. So, I will use the source port to encode the timestamp and send it to an unchanging port (TCP port 7777) at a target IP address outside of my VPC.

Will this work?

Let’s do some math here. We know that there are 65536 (2 to the power of 16) possible TCP ports.

If we send one packet per second from a different port, we could do that for 1092 Minutes:

>>> (2 ** 16) / 60
1092.2666666666667

Or converting that to hours:

>>> (2 ** 16) / 60 / 60
18.204444444444444

We get 18 hours. Hence, we can send traffic at a rate of one packet per second from a unique source port for over 18 hours before using up all possible ports!

Converting back and forth from the source port to the epoch time

To calculate which port to use, we will take the current epoch time and divide it by 65536 and then take the integer portion of the remainder and use that for the source port. The math function that returns the remainder is called “modulo” and is represented as a % in python

source port = int(current_epoch_time % 65536)

To convert back to the epoch time from the source port, take the current epoch time and divide it by 65536 and then keep the quotient portion. Divide the source port by 65536 and take that result and add it to the quotient. Multiply the new total by 65536 and the new result is the epoch time of when the data was sent.

epoch_timestamp = int((modulo / 65536.0 + int(current_epoch_time / 65536)) * 65536)

This conversion assumes that the current epoch time is within the 18 hour window represented by the source port value.

Introducing Scapy

To generate a packet per second with an ever increasing source port that represents the time that the packet was sent, I used scapy. Scapy is written as a python module and is ideal for programmatically crafting packets.

As we will be running Scapy on our EC2 instance to send traffic out of our VPC, we need to install Scapy, its dependencies, and the additional modules my python program will use:

sudo apt update
sudo apt install -y python3-pip tcpdump  python3-matplotlib
sudo pip3 install pytz
sudo pip install --pre scapy[complete]

Note that using sudo with pip to install software is generally a bad idea. See this link for a detailed discussion of the security issues.

However, in this case, Scapy needs to run as root to be able to send packets, and we will destroy our EC2 instance at the end of our experimentation.

To launch Scapy, just type sudo scapy at the Ubuntu command line as shown below:

ubuntu@ip-10-0-1-112:~$ sudo scapy
INFO: PyX dependencies are not installed ! Please install TexLive or MikTeX.

                     aSPY//YASa
             apyyyyCY//////////YCa       |
            sY//////YSpcs  scpCY//Pp     | Welcome to Scapy
 ayp ayyyyyyySCP//Pp           syY//C    | Version 2.4.5
 AYAsAYYYYYYYY///Ps              cY//S   |
         pCCCCY//p          cSSps y//Y   | https://github.com/secdev/scapy
         SPPPP///a          pP///AC//Y   |
              A//A            cyP////C   | Have fun!
              p///Ac            sC///a   |
              P////YCpc           A//A   | We are in France, we say Skappee.
       scccccp///pSP///p          p//Y   | OK? Merci.
      sY/////////y  caa           S//P   |             -- Sebastien Chabal
       cayCyayP//Ya              pY/Ya   |
        sY/PsY////YCc          aC//Yp
         sc  sccaCY//PCypaapyCP//YSs
                  spCPY//////YPSps
                       ccaacs
                                       using IPython 8.1.1
>>>

I set up a target VM in another VPC with the security group set to allow inbound TCP port 7777 from the system with Scapy installed. As a test, I am running TCPdump on the target machine, using the following command:

sudo tcpdump -i eth0 'port 7777'

Next, use Scapy to send the target a packet. Enter the following commands at the Scapy IPython prompt, but substitute my target IP address (35.168.8.211) for the one you are using:

ip=IP(dst="35.168.8.211")
tcp=TCP(sport=1111,dport=7777,flags="S",options=[('Timestamp',(0,0))])
pay="This is Test3"
test3=ip/tcp/pay
send(test3)

The result should look like:

.
Sent 1 packets.

NOTE: If you get a “Operation not permitted” error, you forgot to run Scapy as root

Exit back to the Ubuntu command line using exit()

Looking over at the target system, we can see that TCPdump generated the following two records:

22:25:43.461121 IP ec2-34-201-148-40.compute-1.amazonaws.com.1111 > ip-172-31-48-150.ec2.internal.7777: Flags [S], seq 0:13, win 8192, options [TS val 0 ecr 0,eol], length 13
22:25:43.461156 IP ip-172-31-48-150.ec2.internal.7777 > ec2-34-201-148-40.compute-1.amazonaws.com.1111: Flags [R.], seq 0, ack 14, win 0, length 0

We Sent a TCP “Syn” packet and a “Reset” was sent back because there is not a service listening on port 7777.

Putting the theory to work

Ok, so now that we know the Scapy commands to craft a packet. We can write a program to determine the Source Port to use, to craft the packet and sent it. We also can prove that the packet was received by the target system and see the timestamp printed to standard out by TCPdump.

The program is called generate-test-traffic.py and can be downloaded from my Github Repo

Copy this program to the EC2 instance that has Scapy Installed.

wget https://raw.githubusercontent.com/Resistor52/hitc-benchmark-flow-logs/hitc-ep18/generate-test-traffic.py

Run the program as root using

sudo python3 ./generate-test-traffic.py

Once the program runs, it will ask four questions as shown below and then start to send packets at a rate of one per second.

ubuntu@ip-10-0-1-112:~$ sudo python3 ./generate-test-traffic.py
Enter the name for the CSV log file (without the csv extension): test3A
Enter the desired target IP aggress in the format "111.111.111.111": 35.168.8.211
Enter the desired start time in the format "HH:MM" for local time zone: 19:00
Enter the desired end time in the format "HH:MM" for local time zone: 22:00
Traffic generation will start at: 19:00
Traffic generation will continue until: 22:00
**************** Start of Traffic **************************
.Epoch Time: 1646344800   Time: 2022-03-03 22:00:00 UTC+0000   srcPort: 14944   Remaining Time: 0 Seconds
***************** End of Traffic ***************************

In the example above, I am calling the log file test3A and provided 35.168.8.211 as the target IP. (Again, be sure to use the IP address of the target that you set up and has TCPdump listening.) I started the program at 19:00 hours UTC, simply because that was on the hour, and had it run for 3 hours, ending at 22:00. The program uses UTC because that is the time zone that AWS uses for all its cloud services, including EC2.

Now, while we are waiting for the traffic generation to start, rerun TCPdump on the target system, this time capturing the packets to a file:

sudo tcpdump -i eth0 'port 7777' -w tcpdump.pcap

Initial Look at the Data

The generate-test-traffic.py script created a log, which I named test3A.csv. Note that the script adds the “csv” extension to the filename. The top 10 rows of the file are shown below:

"Epoch_Time","Date_Time","Source_Port"
"1646334000","2022-03-03 19:00:00 UTC+0000","4144"
"1646334001","2022-03-03 19:00:01 UTC+0000","4145"
"1646334002","2022-03-03 19:00:02 UTC+0000","4146"
"1646334003","2022-03-03 19:00:03 UTC+0000","4147"
"1646334004","2022-03-03 19:00:04 UTC+0000","4148"
"1646334005","2022-03-03 19:00:05 UTC+0000","4149"
"1646334006","2022-03-03 19:00:06 UTC+0000","4150"
"1646334007","2022-03-03 19:00:07 UTC+0000","4151"
"1646334008","2022-03-03 19:00:08 UTC+0000","4152"

We can look at the first 10 packets received by the Target system using the following tcpdump command:

tcpdump -r tcpdump.pcap -nn -c10 'dst port 7777'

And the output is:

00:00.013928 IP ec2-34-201-148-40.compute-1.amazonaws.com.4144 > ip-172-31-48-150.ec2.internal.7777: Flags [S], seq 0:13, win 8192, options [TS val 0 ecr 0,eol], length 13
00:01.018892 IP ec2-34-201-148-40.compute-1.amazonaws.com.4145 > ip-172-31-48-150.ec2.internal.7777: Flags [S], seq 0:13, win 8192, options [TS val 0 ecr 0,eol], length 13
00:02.023309 IP ec2-34-201-148-40.compute-1.amazonaws.com.4146 > ip-172-31-48-150.ec2.internal.7777: Flags [S], seq 0:13, win 8192, options [TS val 0 ecr 0,eol], length 13
00:03.027319 IP ec2-34-201-148-40.compute-1.amazonaws.com.4147 > ip-172-31-48-150.ec2.internal.7777: Flags [S], seq 0:13, win 8192, options [TS val 0 ecr 0,eol], length 13
00:04.032328 IP ec2-34-201-148-40.compute-1.amazonaws.com.4148 > ip-172-31-48-150.ec2.internal.7777: Flags [S], seq 0:13, win 8192, options [TS val 0 ecr 0,eol], length 13
00:05.050076 IP ec2-34-201-148-40.compute-1.amazonaws.com.4149 > ip-172-31-48-150.ec2.internal.7777: Flags [S], seq 0:13, win 8192, options [TS val 0 ecr 0,eol], length 13
00:06.061365 IP ec2-34-201-148-40.compute-1.amazonaws.com.4150 > ip-172-31-48-150.ec2.internal.7777: Flags [S], seq 0:13, win 8192, options [TS val 0 ecr 0,eol], length 13
00:07.066456 IP ec2-34-201-148-40.compute-1.amazonaws.com.4151 > ip-172-31-48-150.ec2.internal.7777: Flags [S], seq 0:13, win 8192, options [TS val 0 ecr 0,eol], length 13
00:08.071536 IP ec2-34-201-148-40.compute-1.amazonaws.com.4152 > ip-172-31-48-150.ec2.internal.7777: Flags [S], seq 0:13, win 8192, options [TS val 0 ecr 0,eol], length 13
00:09.077085 IP ec2-34-201-148-40.compute-1.amazonaws.com.4153 > ip-172-31-48-150.ec2.internal.7777: Flags [S], seq 0:13, win 8192, options [TS val 0 ecr 0,eol], length 13

We can extract the two critical data elements from packet capture using the following command:

tcpdump -r tcpdump.pcap -nn -c10 'dst port 7777' | cut -d" " -f1,3 | cut -d"." -f1,6 | tr "." " "

And this command will output the timestamp and the source port of the packet received by the target system:

Great, everything correlates! Now lets write the whole packet to a file:

tcpdump -r tcpdump.pcap -nn 'dst port 7777' | cut -d" " -f1,3 | cut -d"." -f1,6 | tr "." " " > test3A.log

We will save test3A.log for later, as it will come in handy.

Looking in CloudWatch Logs

The CloudWatch Logs Web Console is a great tool to slice and dice log data. Paste in the following query:

fields @ingestionTime, @timestamp, ((toMillis(@ingestionTime) - toMillis(@timestamp))/1000) as IngestDelaySecs, ((toMillis(@ingestionTime) - toMillis(@timestamp))/1000/60)
as IngestDelayMins, toMillis(@timestamp)/1000 as TimeStampEpoch, (65536*floor(TimeStampEpoch/65536))+srcPort as SentTimeEpoch, fromMillis(SentTimeEpoch * 1000) as SentTime,
TimeStampEpoch-SentTimeEpoch as TimeStampDelaySec, srcPort
| sort srcPort asc
| limit 10000
| filter srcAddr = "10.0.1.112" and dstAddr = "35.168.8.211" and dstPort != "22" and srcPort != "22")

Be sure to set the time filter in the upper window to include the window in which the packets were sent.

CloudWatch Logs Insights Screenshot

Looking at the output, we see that AWS provides two fields @ingestionTime and @timestamp so we can calculate the delay the Ingestion Delay Seconds (IngestDelaySecs) by converting both time values to Epoch time and subtracting them.

But how accurate is the timestamp? Well, we can use the CloudWatch Logs Query Language to calculate a SentTimeEpoch based on the srcPort. We can also convert that to SentTime so it is human-readable. If we subtract the SentTimeEpoch from the TimeStampEpoch we get the TimeStampDelaySec.

Disregard the first record as that was captured when I demonstrated how to use Scapy with IPython using an arbitrary source port. (I generated the traffic using the Python Script a few hours beforehand and that explains the chronology of this timestamp.) This record will be removed from the downloaded dataset.

Another thing, that you may have noticed in the screenshot, is that the first srcPort is 4184 which is 40 seconds after we sent the packet with srcPort 4144!

Very interesting. It looks like we have “negative delays.” or to phrase it more accurately, the @timestamp value attached by AWS has an error that fluctuates within a range.

Use the “Export Results” button to save the file as logs-insights-results.csv

Looking at the Data Graphically

Since we see that we have an Ingest Delay and a Timestamp Error, it would be good to look at this data graphically. Toward that end, I have create a python program called plotObservations.py which can be downloaded from my Github Repo.

Copy this program to the EC2 instance that has Scapy Installed.

wget https://raw.githubusercontent.com/Resistor52/hitc-benchmark-flow-logs/hitc-ep18/plotObservations.py

Run the program as using:

python3 ./plotObservations.py

The program will read logs-insights-results.csv and will generate the four plots shown below.

PLOT 1 - Packets 1000 to 2000

PLOT 2 - Packets 1020 to 1200

PLOT 3 - The First 10,000 Packets

Delta - Seconds Between Missing Data

I saved the charts generated from data captured in the exact same manner 24 hours prior to this set, and those are shown below for comparison purposes.

PLOT 1b - Packets 1000 to 2000 (Previous Run)

PLOT 3b - The First 10,000 Packets (Previous Run)

Observations

Ingest Delay - We can see the ingest delay varies from a little over 60 seconds to almost 11 minutes every 10 minutes. Note that when we configured the flow logs we used a setting of max-aggregation-interval 600. This test should be repeated using a max-aggregation-interval 60 to measure the impact on the Ingest Delay.

Timestamp Error Seconds - The TimeStampDelaySec is calculated by subtracting the SentTimeEpoch from TimeStampEpoch. Chronologically, the SentTimeEpoch should occur first, as that is the timestamp of when Scapy sent the packet. Similarly, the TimeStampEpoch is when the VPC Flow Logs system detects the packet on the virtual network. I expected the error to be within milliseconds, but not almost a whole minute off! I am not sure how to explain this. At first, I thought that it might be a math error on my part. But we can take a look at a single packet as it was:

Recorded in the flow log (logs-insights-results.csv),
Sent by Scapy (test3A.csv), and
Recorded by TCPdump at the Target System (tcpdump.pcap)

We can use grep -E ',-58,|srcPort' logs-insights-results.csv | head -n6 | cut -d"," -f5,6-9 to generate the following table:

TimeStampEpoch	SentTimeEpoch	SentTime	TimeStampDelaySec	srcPort
1646334040	1646334098	2022-03-03 19:01:38.000	-58	4242
1646334099	1646334157	2022-03-03 19:02:37.000	-58	4301
1646334160	1646334218	2022-03-03 19:03:38.000	-58	4362
1646334219	1646334277	2022-03-03 19:04:37.000	-58	4421
1646334280	1646334338	2022-03-03 19:05:38.000	-58	4482

Selected packets from logs-insights-results.csv

Now that we have identified the first five packets with the highest error (-58), we can look for those packets in the log generated by Scapy using grep -E '"4242"|"4301"|"4362"|"4421"|"4482"' test3A.csv

Epoch_Time	Date_Time	Source_Port
1646334098	2022-03-03 19:01:38 UTC+0000	4242
1646334157	2022-03-03 19:02:37 UTC+0000	4301
1646334218	2022-03-03 19:03:38 UTC+0000	4362
1646334277	2022-03-03 19:04:37 UTC+0000	4421
1646334338	2022-03-03 19:05:38 UTC+0000	4482

Selected packets from test3A.csv

We can see that the SentTimeEpoch as calculated by my CloudWatch query logic is performed correctly as it matches the Epoch_Time as logged by Scapy.

For completeness, let’s look at TCP dump:

tcpdump -tt -nn -r tcpdump.pcap 'src port 4242 or src port 4301 or src port 4362 or src port 4421 or src port 4482' | cut -d":" -f1

produces:

1646334098.569321 IP 34.201.148.40.4242 > 172.31.48.150.7777
1646334157.873827 IP 34.201.148.40.4301 > 172.31.48.150.7777
1646334218.182541 IP 34.201.148.40.4362 > 172.31.48.150.7777
1646334277.489424 IP 34.201.148.40.4421 > 172.31.48.150.7777
1646334338.805517 IP 34.201.148.40.4482 > 172.31.48.150.7777

As expected, we can see that the epoch time recorded by TCPdump matches the other observations in the tables above.

Missing Data - Looking at the Missing Data plots in the figures above initially raised some concerns. Is the VPC dropping packets? To investigate what is going on, I tweaked plotObservations.py to generate the “Delta” plot to show the periodicity of the missing data as well as to output the srcPorts for which data is missing. That output is shown below:

The following srcPorts are missing:
[4327, 4506, 4694, 4890, 5082, 5276, 5471, 5666, 5855,
6049, 6245, 6439, 6632, 6824, 7018, 7211, 7407, 7600,
7791, 7987, 8174, 8365, 8555, 8750, 8942, 9135, 9330,
9524, 9716, 9911, 10102, 10293, 10485, 10680, 10866,
11058, 11253, 11427, 11620, 11811, 12005, 12198, 12386,
12583, 12778, 12972, 13166, 13361, 13555, 13750, 13945,
14138]

I also had plotObservations.py write the ports to a file, missingSrcPorts.csv so that we can grep for them:

grep -f missingSrcPorts.csv test3A.csv

Hmm, none of the missing srcPorts were logged by Scapy. The good news is that the VPC network did not drop any packets because all sent packets have been accounted for. The bad news is that there is bug in generate-test-traffic.py that occurs about every 195 seconds. I suspect that it is an issue with my generateTraffic function, most likely due to inaccuracies with time.sleep() compounded by variations in the amount of time it takes Scapy to write the packet out to the network.

def generateTraffic(stop_time_epoch):
    datetime_TZ = datetime.now(tz_TZ)
    seconds = int(time.time())             # Current Epoch Time
    modulo = seconds % (2 ** 16)           # Calculate the SrcPort
    datetime_time=datetime_TZ.strftime("%Y-%m-%d %H:%M:%S %Z%z")
    print(f"Epoch Time: {seconds}   Time: {datetime_time}   srcPort: {modulo}   {RemainingTimeString(stop_time_epoch)}      ", end = "\r")
    write_str='"'+str(seconds)+'","'+datetime_time+'","'+str(modulo)+'"\n'
    f.write(write_str)
    tcp=TCP(sport=modulo,dport=7777,flags="S",options=[('Timestamp',(0,0))])
    pay=logfile+" - "+str(modulo)
    # Trap the output of scapy send so it does not print to console
    state = sys.stdout
    sys.stdout = open('/dev/null', 'w')
    send(ip/tcp/pay)                        # Scapy
    sys.stdout.close()
    sys.stdout = state
    # Wait one second between packets
    time.sleep(1)
    return

OBJECTIVE 4 - Contemplate the implications of our findings

Tests 1 and 2 demonstrate that we need to budget ample time ( at least 16 minutes) to allow AWS to provision the flow logs. We observed that flow logs sent to CloudWatch Logs provision faster than logs sent to a S3 bucket.

Test 3 demonstrated that with a max-aggregation-interval set to “600,” it is possible to see an ingestion delay of up to 10.42 minutes (cat logs-insights-results.csv | cut -d"," -f4 | sort -u -n | tail -n1) and that it is a function of how long ago the last aggregation occurred. The minimum ingestion delay was 1.26 minutes (cat logs-insights-results.csv | cut -d"," -f4 | sort -u -n | head -n2). Lots of bad stuff could be detected late if you were expecting to use flow logs for real-time detection. I think that AWS would tell you that is not the use case that they were designed for. Instead use something like a Web Application Firewall (WAF).

We observed an unaccounted-for error in the ‘@timestamp’ field generated by Cloud Watch Logs Insights of +1 to -60 seconds during our test and +60 to -58 during a previous test. This demonstrates the value of benchmarking critical security systems.

In this experiment, we demonstrated a novel technique for timestamping packets that will be reduced to a flow log by using the source port field. We showed how to verify the accuracy of the technique by comparing the epoch time calculated from the source port with the time logged by TCPdump.

We also demonstrated how to identify any missing packets using this technique and in the process identified a minor bug in generate-test-traffic.py that caused a packet not to be sent about every 195 seconds. This demonstrates (again) the importance of understanding the limitations of the security tools that we use. At some point, I will circle back and fix this bug.

Remember that the limited amount of testing that was performed is not sufficient to draw too many conclusions. Tests 1 and 2 should be repeated multiple times at different times of day, over the course of several days, in different regions to get more precise statistics. Test 3 should be repeated multiple times and should consider the impact of changing the max-aggregation-interval setting. The more the testing process is automated the lower the burden of repeating each test. But, I think that I have taken this little experiment far enough for now.

Wrap Up

Wow, we covered quite a bit in this Episode:

We made extensive use of the AWS CLI
We used a Bash “while loop” to time AWS CLI calls
We crafted packets using Scapy and read those packets using TCPdump.
We also created a fairly sophisticated CloudWatch Logs Insights query, and
Learned how we can use Python to visualize data.

To learn more about TCPdump, Scapy, and packet crafting check out SANS SEC503: Intrusion Analysis In-Depth. To learn more about using Python to create security tools like generate-test-traffic.py consider SANS SEC573: Automating Information Security with Python.

Closing Comments

If you have thoughts or comments on today’s episode, feel free to chime in on the comments for this YouTube video.

If you appreciate this video and want to see more like it, be sure to give it a “thumbs up.”

Stay tuned for another installment of “Head in the Clouds” as announcements of new episodes are made on the SANS Cloud Security Twitter feed.

Meanwhile, be sure to check out the other great videos on the SANS Cloud Security YouTube Channel.

Take care.