====== Access Server 2013 Robustification Project ====== The Access/License server has been running fine for several years but it has become apparent that the system is vulnerable to some "single point of failure" conditions. ===== Problems ===== There are many potential problems with the current Access server architecture. In a rough order of severity they are: * Complete region failure (rare, but something close to this happened in 2010 or 2011). Fortunately the S3 buckets are not specific to a region. * RDS or EC-2 instance failure (has happened in 2011 and 2012, about once per year) * AWS fabric failure (at least once per year), e.g. S3, network, virtual host failure * Storage and retrieval failures, mostly experienced by the Agtek Access Java Client * Lack of Track redundancy due to them being stored in instance specific storage. * Potential black hat attacks (mainly on the AccessWeb application) * Through put of operations, appears to be DB related. * Client failures (losing keys) ===== Possible Solutions ===== AWS Tools * RDS snapshots, currently being done and retained for last 3 days. (50Gib) Can restore from a snapshot. * EC2 snaps shots done once per day, keeping last two days (100 GiB) Can restore from snapshot. * EBS Snapshots can be moved between regions. * AWS console operations (new instance, snapshot, etc) can be automated. Virtual Machine failure recovery strategies * Quick reboot - Available via the AWS management console. * Quick instance recreation - Either: * scripted in command lines tools on dev.agtek.com or * built into the AccessSupport tool (NOT keyforge) * Scripting is easy, but not easy to transfer the skills to another person * Building into the support tool makes them easy to use, but not as easy to adapt for future changes. Recreate RDS, EC2 constellation in new region * Quick instance recreation (as in last section): allow region specifier * Migrate EBS / RDS instances : cross region migration/snapshot. Security issues: * Implement https for web application * Add a security analyzer to look for anomalies and send alerts * Include failure (404, 501, bad login) attempts in auto security analysis Monitoring: * Increase real time monitor goals to include: * Real time connection monitoring * Operation duration Storage problems: * Most storage issues appear to be related to the Access Java Client, fix it. * Track storage can be moved to S3, increasing the safety of track storage. ===== Recommended Solution ===== * TBD: Identify robustness goals, recovery speed, etc. to guide solution selection. * Attend April 30 AWS conference to get briefed on more AWS tech. * Document modern (2013) system architecture * Document failover process (for manual recovery), recovery procedures for failure modes. * Modify AccessSupport tool to automate instance creation (from existing AMI-create snap of existing AMI, reattach EBS), recover EBS from snaps, repopulate DB from backup snap, * Modify AccessSupport tool to copy snaps to another region to prep for region failover process. * Consider auto copy snaps to another region for backup. ===== Possible Track items to consider at the same time ===== Likely only make these modifications when we rework a track product * Move track storage to S3, integrate with Access Files. * Integrate track api with regular API? * Drop support for firmware loads on devices (old grey boxes). * Drop support for TSMAdmin client * Drop support for TSMAdmin server * SQL customer tables; assetid, association, gps, rtk, rtktrack, track, vehicle * SQL customer tables; tsm, tmm, device * Remove SupportTool tabs for Devices, Trackwork Modules, Trackwork Servers (associated tables if not already present). ===== Server Architecture Improvements ===== The following areas are routine maintenance items and/or feature requests that need to be done. The timing is right to do these at the same time as the other efforts. * Upgrade the server infrastructure to the latest Java 7 * Add wildcard search to admin api for users * Routine update of AMI Linux server upgrades (security) * Possible update of entire Linux AMI (2013-03 variant released). * Performance improvements: add index to problematic tables (licence, licenseuser, licenselog). * Add licenselog pruning. * File/Folder level permissions. * Complete remove of the Box stuff.