Ensuring Data Integrity with DynamoDB

DynamoDB is an incredibly powerful NoSQL database. It's schema-less, which gives you lots of flexibility, but it also means that you are responsible for managing the integrity of your data. This includes ensuring the structure of your data, as well as the ability to preserve metadata throughout your data's lifecycle.

Unfortunately, DynamoDB doesn't currently store any metadata associated with items. If you want to know when a particular item was written to the table, for example, you have to store that information yourself. While it's not particularly difficult to add these attributes to an item, maintaining their integrity can come with some challenges.

In this article, we'll discuss several strategies that can be used to ensure data integrity in your DynamoDB tables.

Data management with PutItem and UpdateItem

Before we discuss how to ensure our data's integrity, let's quickly review the ways in which we can add/modify data in our DynamoDB tables. Similar to RBDMS, DynamoDB supports both "INSERTS" and "UPDATES", however, these two operations behave much differently than their RDBMS counterparts.

DynamoDB uses PutItem to add items to your table. Unlike RDBMS, PutItem replaces existing items by default. While you can add overwrite protection to your PutItem API calls, DynamoDB does not support service-side enforcement. This means that you can inadvertently overwrite an entire item, losing any metadata that you may have stored with it.

UpdateItem behaves as an "UPSERT" operation. This means that if you try to update an item that doesn't exist, DynamoDB will automatically create it for you. Like with PutItem, you can add conditions to your UpdateItem API calls to modify this behavior, but there is no way to implement it service-side. UpdateItem also merges root attributes, allowing you to perform partial updates without needing to pass the entire object in.

We'll show some examples of these operations throughout this article.

Protecting data on UPSERTS

UPSERTS are an incredibly powerful capability of DynamoDB, allowing you to essentially fallback to a PutItem operation if the item doesn't exist. This is a great feature, but it also means that you have to provide both defaults for items that don't exist, as well as overwrite protection for metadata that already does. A simple example is with created and modified attributes:

javascript
{
  TableName: 'myTable',
  Key: { pk: 'somePK', sk: 'someSK'},
  UpdateExpression: 'SET #ct = if_not_exists(#ct,:ct), #md = :md',
  ExpressionAttributeNames: { '#ct': 'created', '#md': 'modified' },
  ExpressionAttributeValues: { ':ct': now, ':md': now }
}

In this example, we're using the if_not_exists() function to check if the "created" attribute exists, and if not, set a default value. If the item is already in the table, the stored value of "created" is preserved, but if the item doesn't exist, the attribute is set to the value of now. This is a very common pattern that eliminates the need to perform extra calls to check if an item exists before inserting or updating.

As mentioned previously, UPSERTS with UpdateItem merge root attributes. This is extremely useful in real world applications since it allows you to make changes without needing access to the entire item. On the other hand, there are times where you might want to overwrite the input data completely while still preserving metadata. The first option is to maintain a "schema" for your data on the client side, and then use that to generate an UpdateExpression with the right combination of SET and REMOVE statements. This can be very effective for defined schemas, but if the structure is more dynamic, other strategies might work better.

Attribute Isolation

Another way to preserve metadata while allowing complete item overwrites is to isolate any input data by adding it to a single map type attribute. This allows the storage of metadata (like creation time and computed indexes) at the root of the item, while still allowing simple overwrites and partial UPSERT support. Take the following example:

javascript
{
  pk: "somePK",
  sk:  "someSK",
  data: {
    someKey: "someValue",
    someOtherKey: "someOtherValue"
  }
  created: "2021-12-15T00:00:00.000Z",
  modified: "2021-12-15T00:00:00.000Z",
  type: "someEntityType",
  otherMetaData: "someValue",
  gsi1pk: "someGSI1PK", // calculated GSI PK
  gsi1sk: "someGSI1SK" // calculated GSI SK
}

Here we've created a data attribute that stores our input data. This format lets us use UpdateItem to either do a complete overwrite (while still preserving metadata) by setting a new data value, or partial updates without having to pass in the entire object (see Adding Nested Map Attributes).

This can be a very effective way to isolate input data from metadata, but it comes with limitations and challenges. map type attributes support nested object manipulations such as list_append(), ADD, and REMOVE, as well as if_not_exists() to prevent overwriting existing data. However, map does not support nested set type attributes, and the syntax for updating nested map attributes is a bit cumbersome. Libraries like DynamoDB Toolbox can make this a lot easier, but still require constructing complex data structures.

It's generally preferred to use separate attributes for your GSIs, but if you wanted to map secondary indexes directly to your input data, this method would prevent you from doing so as GSIs must map to root attributes. In addition, you can't project nested attributes in a map to other GSIs. You would either need to project the entire map attribute, or copy the relevant data to root attributes.

Protecting data on OVERWRITES

If the limitations of the "Attribute Isolation" strategy do not work with your data model, then it's possible to use PutItem to allow both a flexible schema of root attributes while still maintaining integrity checks on metadata. This strategy does require that any metadata you wish to preserve be passed in alongside other input data, so you must have the ability to retrieve this data on the client side prior to making the API call.

A common scenario for this might be a web dashboard that allows users to update a data object or add custom fields that you want to store as root attributes. Because we are completely overwriting the DynamoDB item, we need to supply our metadata, but we also need that data to be immutable, so we must have a way to preserve its integrity. This is important because, depending on the situation, users could simply manipulate the API call and change metadata values, even if the UI doesn't allow it.

There are several ways to address this. Some examples include returning a hash of the metadata to be verified on the server side, maintaining the state via a session, or perhaps performing additional look ups to rehydrate the attributes. However, there is a much easier way to accomplish this by using a simple ConditionExpression in your PutItem API call. Not only is this method stateless (which is great for serverless applications), but you could also use this to preserve data integrity when using API Gateway as a proxy.

Below are example parameters of a PutItem API call:

javascript
{
  TableName: 'myTable',
  Item: { created: 1234567890, ...otherInputData },
  ConditionExpression: '#ct = :created',
  ExpressionAttributeNames: { '#ct': 'ct' },
  ExpressionAttributeValues: { ':created': created }
}

Here we want to preserve the integrity of the created attribute. By adding the ConditionExpression of #ct = :created, we're telling DynamoDB to only allow an overwrite of this item if the supplied value of created in the Item matches the stored value of created. You can verify as many attributes as you like, all without needing to perform additional complex checks.

Prevent overwrites of existing items

If you want to prevent overwriting items that already exist, you can add a ConditionExpression to your PutItem API calls that checks if an attribute doesn't exist (pk in our example below).

javascript
{
  TableName: 'myTable',
  Item: { ... },
  ConditionExpression: 'attribute_not_exists(#pk)',
  ExpressionAttributeNames: { '#pk': 'pk' }
}

You could alternatively check that the values of the primary key attributes don't match a stored item:

javascript
ConditionExpression: '#pk <> :pk AND #sk <> :sk'

Implementing domain-specific constraints

If you're building microservices or following domain-driven design principles, you can use ConditionExpressions to restrict state changes to items. For example, if your domain logic requires that an item have a state of pending in order for it to be changed to approved, you can implement this using the following:

javascript
{
  TableName: 'myTable',
  Key: {
    pk: 'somePK',
    sk: 'someSK'
  },
  UpdateExpression: 'set #state = :newstate',
  ExpressionAttributeNames: { '#state': 'state' },
  ExpressionAttributeValues: {
    ':newstate': 'approved',
    ':existingstate': 'pending'
  },
  ConditionExpression: '#state = :existingstate'
}

These types of conditions are super useful because you don't need to perform extra operations to check the current state of the item.

Adding service level protections

In most cases, any protections or integrity checks you want to perform must be added to the API calls. For anyone writing code that interfaces directly with DynamoDB, this obviously gives them the ability to bypass most of the restrictions we've discussed. While this can be mitigated by implementing things like data abstraction layers that limit access to the raw API calls, you may still want to add additional protections at the service level.

DynamoDB, like all AWS services, requires IAM policies to grant permissions to perform certain actions. The details of IAM are way beyond the scope of this article, but below are a few examples of how IAM policies can help protect your data, even if using raw API calls.

Disable DeleteItem

Perhaps a bit obvious, but you can simply omit (or explicitly Deny) the DeleteItem permission from any IAM policies attached to specific execution environments and/or roles. This prevents these users from deleting items from your table once they are created.

javascript
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "NoItemDeletes",
      "Effect": "Allow",
      "Action": [
        "dynamodb:PutItem",
        "dynamodb:UpdateItem",
        "..."
      ],
      "Resource": "arn:aws:dynamodb:*:*:table/myTable"
    }
  ]
}

Disable updating specific attributes

While protecting against deletion seems like a reasonable step, it doesn't do much good if all the attributes of an item can be changed. IAM supports a number of really useful conditions that give you fine-grained access control over your DynamoDB tables. There are a number of additional examples here, but I've included this example that shows how to disable updates to the created attribute:

javascript
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "BlockCreatedUpdates",
      "Effect": "Allow",
      "Action": [
        "dynamodb:UpdateItem"
      ],
      "Resource": "arn:aws:dynamodb:*:*:table/myTable",
      "Condition": {
        "ForAllValues:StringNotLike": {
          "dynamodb:Attributes": [
            "created"
          ]
        }
      }
    }
  ]
}

As powerful as IAM permissions are, they are not a silver bullet. In the example above, you would not be able to do UPSERTS that provide a default value for the created attribute. This means you'd have to allow either the PutItem permission, or create another policy that allowed unrestricted UpdateItem requests. This is possible, and you could limit permissions to specific roles or environments and grant developer access appropriately, but you'd have to weigh the tradeoffs of that complexity.

I do find that fine-grained IAM policies work extremely well for API Gateway proxy integrations, especially as a way to minimize some VTL logic. But again, there are tradeoffs you need to consider.

Conclusion

This post barely scratches the surface of DynamoDB data strategies, but hopefully these examples give you some ideas on how to better protect the integrity of your DynamoDB tables.

If you want to learn more DynamoDB data strategies and modeling techniques, sign up to get information about my new DynamoDB Modeling course at DynamoDBModeling.com.

Tags: #aws #serverless #dynamodb #nosql #data-modeling