Access Server 2013 Robustification Project
The Access/License server has been running fine for several years but it has become apparent that the system is vulnerable to some “single point of failure” conditions.
Problems
There are many potential problems with the current Access server architecture. In a rough order of severity they are:
Complete region failure (rare, but something close to this happened in 2010 or 2011). Fortunately the S3 buckets are not specific to a region.
RDS or EC-2 instance failure (has happened in 2011 and 2012, about once per year)
AWS fabric failure (at least once per year), e.g. S3, network, virtual host failure
Storage and retrieval failures, mostly experienced by the Agtek Access Java Client
Lack of Track redundancy due to them being stored in instance specific storage.
Potential black hat attacks (mainly on the AccessWeb application)
Through put of operations, appears to be DB related.
Client failures (losing keys)
Possible Solutions
AWS Tools
RDS snapshots, currently being done and retained for last 3 days. (50Gib) Can restore from a snapshot.
EC2 snaps shots done once per day, keeping last two days (100 GiB) Can restore from snapshot.
EBS Snapshots can be moved between regions.
AWS console operations (new instance, snapshot, etc) can be automated.
Virtual Machine failure recovery strategies
Recreate RDS, EC2 constellation in new region
Security issues:
Implement https for web application
Add a security analyzer to look for anomalies and send alerts
Include failure (404, 501, bad login) attempts in auto security analysis
Monitoring:
Storage problems:
Most storage issues appear to be related to the Access Java Client, fix it.
Track storage can be moved to S3, increasing the safety of track storage.
Recommended Solution
TBD: Identify robustness goals, recovery speed, etc. to guide solution selection.
Attend April 30 AWS conference to get briefed on more AWS tech.
Document modern (2013) system architecture
Document failover process (for manual recovery), recovery procedures for failure modes.
Modify AccessSupport tool to automate instance creation (from existing AMI-create snap of existing AMI, reattach EBS), recover EBS from snaps, repopulate DB from backup snap,
Modify AccessSupport tool to copy snaps to another region to prep for region failover process.
Consider auto copy snaps to another region for backup.
Possible Track items to consider at the same time
Likely only make these modifications when we rework a track product
Move track storage to S3, integrate with Access Files.
Integrate track api with regular
API?
Drop support for firmware loads on devices (old grey boxes).
Drop support for TSMAdmin client
Drop support for TSMAdmin server
SQL customer tables; assetid, association, gps, rtk, rtktrack, track, vehicle
SQL customer tables; tsm, tmm, device
Remove SupportTool tabs for Devices, Trackwork Modules, Trackwork Servers (associated tables if not already present).
Server Architecture Improvements
The following areas are routine maintenance items and/or feature requests that need to be done. The timing is right to do these at the same time as the other efforts.
Upgrade the server infrastructure to the latest Java 7
Add wildcard search to admin api for users
Routine update of AMI Linux server upgrades (security)
Possible update of entire Linux AMI (2013-03 variant released).
Performance improvements: add index to problematic tables (licence, licenseuser, licenselog).
Add licenselog pruning.
File/Folder level permissions.
Complete remove of the Box stuff.