User Tools

Site Tools


access:access_server_2013_robustification_project

Access Server 2013 Robustification Project

The Access/License server has been running fine for several years but it has become apparent that the system is vulnerable to some “single point of failure” conditions.

Problems

There are many potential problems with the current Access server architecture. In a rough order of severity they are:

  • Complete region failure (rare, but something close to this happened in 2010 or 2011). Fortunately the S3 buckets are not specific to a region.
  • RDS or EC-2 instance failure (has happened in 2011 and 2012, about once per year)
  • AWS fabric failure (at least once per year), e.g. S3, network, virtual host failure
  • Storage and retrieval failures, mostly experienced by the Agtek Access Java Client
  • Lack of Track redundancy due to them being stored in instance specific storage.
  • Potential black hat attacks (mainly on the AccessWeb application)
  • Through put of operations, appears to be DB related.
  • Client failures (losing keys)

Possible Solutions

AWS Tools

  • RDS snapshots, currently being done and retained for last 3 days. (50Gib) Can restore from a snapshot.
  • EC2 snaps shots done once per day, keeping last two days (100 GiB) Can restore from snapshot.
  • EBS Snapshots can be moved between regions.
  • AWS console operations (new instance, snapshot, etc) can be automated.

Virtual Machine failure recovery strategies

  • Quick reboot - Available via the AWS management console.
  • Quick instance recreation - Either:
    • scripted in command lines tools on dev.agtek.com or
    • built into the AccessSupport tool (NOT keyforge)
    • Scripting is easy, but not easy to transfer the skills to another person
    • Building into the support tool makes them easy to use, but not as easy to adapt for future changes.

Recreate RDS, EC2 constellation in new region

  • Quick instance recreation (as in last section): allow region specifier
  • Migrate EBS / RDS instances : cross region migration/snapshot.

Security issues:

  • Implement https for web application
  • Add a security analyzer to look for anomalies and send alerts
  • Include failure (404, 501, bad login) attempts in auto security analysis

Monitoring:

  • Increase real time monitor goals to include:
    • Real time connection monitoring
    • Operation duration

Storage problems:

  • Most storage issues appear to be related to the Access Java Client, fix it.
  • Track storage can be moved to S3, increasing the safety of track storage.
  • TBD: Identify robustness goals, recovery speed, etc. to guide solution selection.
  • Attend April 30 AWS conference to get briefed on more AWS tech.
  • Document modern (2013) system architecture
  • Document failover process (for manual recovery), recovery procedures for failure modes.
  • Modify AccessSupport tool to automate instance creation (from existing AMI-create snap of existing AMI, reattach EBS), recover EBS from snaps, repopulate DB from backup snap,
  • Modify AccessSupport tool to copy snaps to another region to prep for region failover process.
  • Consider auto copy snaps to another region for backup.

Possible Track items to consider at the same time

Likely only make these modifications when we rework a track product

  • Move track storage to S3, integrate with Access Files.
  • Integrate track api with regular API?
  • Drop support for firmware loads on devices (old grey boxes).
  • Drop support for TSMAdmin client
  • Drop support for TSMAdmin server
    • SQL customer tables; assetid, association, gps, rtk, rtktrack, track, vehicle
    • SQL customer tables; tsm, tmm, device
  • Remove SupportTool tabs for Devices, Trackwork Modules, Trackwork Servers (associated tables if not already present).

Server Architecture Improvements

The following areas are routine maintenance items and/or feature requests that need to be done. The timing is right to do these at the same time as the other efforts.

  • Upgrade the server infrastructure to the latest Java 7
  • Add wildcard search to admin api for users
  • Routine update of AMI Linux server upgrades (security)
  • Possible update of entire Linux AMI (2013-03 variant released).
  • Performance improvements: add index to problematic tables (licence, licenseuser, licenselog).
  • Add licenselog pruning.
  • File/Folder level permissions.
  • Complete remove of the Box stuff.
access/access_server_2013_robustification_project.txt · Last modified: 2013/04/05 19:02 by mjallison