This shows you the differences between two versions of the page.
| Both sides previous revision Previous revision Next revision | Previous revision | ||
| access:access_server_2013_robustification_project [2013/04/03 17:31] mjallison [Problems] | access:access_server_2013_robustification_project [2013/04/05 19:02] (current) mjallison [Possible Solutions] | ||
|---|---|---|---|
| Line 5: | Line 5: | ||
| ===== Problems ===== | ===== Problems ===== | ||
| There are many potential problems with the current Access server architecture. In a rough order of severity they are: | There are many potential problems with the current Access server architecture. In a rough order of severity they are: | ||
| + | * Complete region failure (rare, but something close to this happened in 2010 or 2011). Fortunately the S3 buckets are not specific to a region.  | ||
| * RDS or EC-2 instance failure (has happened in 2011 and 2012, about once per year) | * RDS or EC-2 instance failure (has happened in 2011 and 2012, about once per year) | ||
| * AWS fabric failure (at least once per year), e.g. S3, network, virtual host failure | * AWS fabric failure (at least once per year), e.g. S3, network, virtual host failure | ||
| * Storage and retrieval failures, mostly experienced by the Agtek Access Java Client | * Storage and retrieval failures, mostly experienced by the Agtek Access Java Client | ||
| + | * Lack of Track redundancy due to them being stored in instance specific storage.  | ||
| + | * Potential black hat attacks (mainly on the AccessWeb application) | ||
| + | * Through put of operations, appears to be DB related. | ||
| * Client failures (losing keys) | * Client failures (losing keys) | ||
| + | |||
| ===== Possible Solutions ===== | ===== Possible Solutions ===== | ||
| + | AWS Tools | ||
| + | * RDS snapshots, currently being done and retained for last 3 days. (50Gib) Can restore from a snapshot.  | ||
| + | * EC2 snaps shots done once per day, keeping last two days (100 GiB) Can restore from snapshot. | ||
| + | * EBS Snapshots can be moved between regions. | ||
| + | * AWS console operations (new instance, snapshot, etc) can be automated. | ||
| + | Virtual Machine failure recovery strategies | ||
| + | * Quick reboot - Available via the AWS management console.  | ||
| + | * Quick instance recreation - Either: | ||
| + | * scripted in command lines tools on dev.agtek.com or | ||
| + | * built into the AccessSupport tool (NOT keyforge) | ||
| + | * Scripting is easy, but not easy to transfer the skills to another person | ||
| + | * Building into the support tool makes them easy to use, but not as easy to adapt for future changes. | ||
| + | |||
| + | Recreate RDS, EC2 constellation in new region | ||
| + | * Quick instance recreation (as in last section): allow region specifier | ||
| + | * Migrate EBS / RDS instances : cross region migration/snapshot. | ||
| + | |||
| + | Security issues: | ||
| + | * Implement https for web application | ||
| + | * Add a security analyzer to look for anomalies and send alerts | ||
| + | * Include failure (404, 501, bad login) attempts in auto security analysis | ||
| + | |||
| + | Monitoring: | ||
| + | * Increase real time monitor goals to include: | ||
| + | * Real time connection monitoring | ||
| + | * Operation duration | ||
| + | |||
| + | Storage problems: | ||
| + | * Most storage issues appear to be related to the Access Java Client, fix it. | ||
| + | * Track storage can be moved to S3, increasing the safety of track storage. | ||
| + | |||
| + | ===== Recommended Solution ===== | ||
| + | * TBD: Identify robustness goals, recovery speed, etc. to guide solution selection. | ||
| + | * Attend April 30 AWS conference to get briefed on more AWS tech. | ||
| + | * Document modern (2013) system architecture | ||
| + | * Document failover process (for manual recovery), recovery procedures for failure modes.  | ||
| + | * Modify AccessSupport tool to automate instance creation (from existing AMI-create snap of existing AMI, reattach EBS), recover EBS from snaps, repopulate DB from backup snap, | ||
| + | * Modify AccessSupport tool to copy snaps to another region to prep for region failover process.  | ||
| + | * Consider auto copy snaps to another region for backup.  | ||
| + | ===== Possible Track items to consider at the same time ===== | ||
| + | Likely only make these modifications when we rework a track product | ||
| + | * Move track storage to S3, integrate with Access Files.  | ||
| + | * Integrate track api with regular API? | ||
| + | * Drop support for firmware loads on devices (old grey boxes). | ||
| + | * Drop support for TSMAdmin client | ||
| + | * Drop support for TSMAdmin server | ||
| + | * SQL customer tables; assetid, association, gps, rtk, rtktrack, track, vehicle | ||
| + | * SQL customer tables; tsm, tmm, device | ||
| + | * Remove SupportTool tabs for Devices, Trackwork Modules, Trackwork Servers (associated tables if not already present). | ||
| ===== Server Architecture Improvements ===== | ===== Server Architecture Improvements ===== | ||
| - | The following areas are routine maintenance items and/or feature requests that need to be done. | + | The following areas are routine maintenance items and/or feature requests that need to be done. The timing is right to do these at the same time as the other efforts. | 
| * Upgrade the server infrastructure to the latest Java 7 | * Upgrade the server infrastructure to the latest Java 7 | ||
| * Add wildcard search to admin api for users | * Add wildcard search to admin api for users | ||
| * Routine update of AMI Linux server upgrades (security) | * Routine update of AMI Linux server upgrades (security) | ||
| * Possible update of entire Linux AMI (2013-03 variant released). | * Possible update of entire Linux AMI (2013-03 variant released). | ||
| + | * Performance improvements: add index to problematic tables (licence, licenseuser, licenselog). | ||
| + | * Add licenselog pruning. | ||
| + | * File/Folder level permissions. | ||
| + | * Complete remove of the Box stuff.  | ||