I help engineers. I create tools that make it easier to validate, release, and observe their deployment. I design security-conscious platform services that minimize risk and magnify observability. I bring 33 years of hands-on experience creating, automating, and evolving large-scale Internet services.
I am an author of practical technology books from O'Reilly and Packt:
I contributed content and technical editing for the following books:
The main object was to use my experience to best focus potential solutions to achieve business directives.
Pinger provides free texting and calling to millions of concurrent interactive app users. Any slowness or failure in the backend service is immediately visible to users in every timezone. Every change to the service had to be evaluated and implemented carefully.
The network used Juniper SRX and EX switches for external gateways, Cisco internal switches and NetScalar load balancers. At the time I arrived there was very little documentation of the network, and everything was done cowboy on the network units. I restructured the network to isolate the out of band management systems and built tools to simplify common operations.
Implemented Puppet management of diverse global sites. Created tools to ease hands-on management and automate common processes. Automated creation of application-specific data points and alerts in Zenoss.
The pinger service had two production sites for the backend service, with voice media systems spread out around the world. I revised the manual one view of everything configuration push system at Pinger to accomodate multiple diverse internal and external configurations around the globe.
All network management was done via Zenoss or provided data to Zenoss. I cleaned up a lot of issues with Zenoss and tracked open bugs with the Zenoss team. I wrote python code to make changes to the Zendmd backend store, and to automate creation and removal of Zenoss data points and graphs.
Designed completely automated, vendor-agnostic, world-wide provisioning system for Equinix Carrier Ethernet Exchange. Delivered 3 months ahead of schedule. Created an automated customer self-help OAM and RFC-2544 testing system.
The initial provisioning scope deliverable was due by October, 2010. I delivered the initial deliverables in July 2010. This service was Equinix's first completely automated provisioning system, so I had to break a lot of new ground here. This involved significant amounts of data exchange between many business units in the company.
The main code had to be vendor-independant since the provisioning system had to deploy on Alcatel-Lucent ESS-7200 switches at first, but also support Juniper switches within a year.
The code acquired operating parameters from Oracle network port information and Oracle Financial InstallBase module, storing it in local databases and synchronized with related sources.
The code base was built to be environment smart, allowing it to function in NetDev, SwDev, QA, UAT, PILOT and PROD instances without any code changes.
Designed an integrated code base which allowed automated OAM and RFC-2544 network testing from the customer portal. This code dynamically provisioned new connections on the Alcatel gear to Accedian NIDs and ran test suites for the customer without any human involvement.
Enhanced existing statistics systems to use new common libraries for reporting errors and notifying the NOC of issues. Built requirements and a test lab to existing statistics systems to use new common libraries for reporting errors and notifying the NOC of issues.
Implemented production IPv6 allocation, routing, and name service for customers.
Redesigned the multi-site backbone to provide greater redundancy and flexibility. Upgraded the switching platform to Force10 for asic-based forwarding and hitless failover.
When I arrived the network depended on proxy arp to border gateway, which required configuration on every Extreme Black Diamond switch in the network for each customer. I simplified the configuration to use local routing and OSPF, which allowed me to improve flexibility and redundancy in case of link failures between the switches.
After this was done, I removed some expensive peering points and links which were logically redundant but not physically so. Then I arranged truly redundant peering with multiple tier-1 providers and settlement-free peering on the PAIX public mesh.
The Extreme switches had a number of fairly horrible bugs in BGP routing which made them unable to handle full feeds from all providers (which I built test cases for and proved to Extreme in their own lab). I engineered a tiered BGP solution which was able to provide full routing tables to customers while providing a limited view of functional routes to the Extreme Black Diamond switches.
The net effect of these changes was to reduce total cost-per-megabit from $135 to $32, and increase by an order of magnitude the amount of traffic customers passed through the network. Reliability was increased to nearly 3 years without a single customer-visible network outage. This was due to a power failure under control of facilities staff.
I also acquired an ARIN IPv4 block and migrated all customers from provider-specific IP space to the PI block. Which involved many months of wrangling customers.
This was a massive, multi-year project to replace the aging Extreme Black Diamond switches at the core of the network.
I created and continously updated business requirements. I identified all market solutions capable of handling datacenter core duties. I provided all findings and total cost-of-ownership details directly to the board of directors. I technically evaluated solutions provided by Cisco, Juniper, Force10, Extreme BD 10k and Alpine and Foundry switches.
When we selected a potential solution, I built a test lab and evaluated performance and functionality of the solution in extreme detail. In the Force10 test, I identified 6 bugs in the routing protocols of the FTOS software. Which compared favorably with dozens in the Cisco 6500 product at the time.
After acquisition of the Force10 E300 switches I created and tested extensively a perl script which converted all customer configuration from Extreme to Force10/Cisco format. This allowed us to simply move cables at migration time.
I replicated all standard customer functions in the new environment using before/after Wiki documentation. As Force10 did not have a function to download its configuration at a specified time, I wrote a small shell script suitable for nightly cron invocation which caused the unit to TFTP the configuration.
Implemented a distributed Nagios network monitoring system. Wrote custom plug-ins for specialized tests of old-world facilities management gear. Unified all systems to a cfengine-controlled FreeBSD standard.
Built a set of Nagios monitoring systems with instances in each datacenter. Configured all Nagios instances to report via NCSA to a central monitoring system used by NOC staff.
Created custom Nagios event handlers to only report problems visible from multiple instances (where appropriate).
Created a Nagios check utilities to alarm based on variety of data stored in pre-existing RRD files. This utility does not require Nagios to gather the RRD data, and deals appropriately with both kibibytes and kilobytes.
Created Nagios check utilities to gather useful alarm data from facilities systems including large APC units, old and new MGE power systems, IT watchdogs (and some random other) environment sensors, and a variety of telnet or custom protocol units.
Built another Nagios environment for customer monitoring. Built an easy to use monitoring setup and management UI inside the Colo Control Panel so that customers could add/remove monitoring of hosts with zero NOC staff involvement. This involved reading in and writing out Nagios configurations, as well as careful tests and loading of the new configurations on the fly.
The SVcolo environment had dozens of systems with few common functions. These units were all built with the same script, however devolved significantly over time. I built a test lab and evaluated cfengine2, puppet and bcfg2 as possible management solutions.
After selecting cfengine, I rebuilt the entire multi-hundred line build script as a cfengine ruleset to maintain the systems over time. Each system's unique functions became documented and described within the policy language, providing a centralized repository of configuration information tracked in a source code tree.
As cfengine did not have FreeBSD package management at the time, I engineered code to properly support FreeBSD and got it integrated into the main source tree. I followed up by enhancing cfengine's package management to include removing and upgrading packages for all platforms, including a significant amount of optimizing for Yum environments.
Some other tasks:Upgrade a simple spreadsheet of customer data into a de-centralized MySQL database with dozens of tables containing all customer resources and assignments.
Built a LAMP (perl) server setup to handle both internal/staff and customer requests. Designed a set of OO modules around each set of data, and another set to handle interactive/ajax requests. Created a UI for the staff to review and maintain this data. Upgraded the customer UI from a 1993-style 3-page web UI to a 100-plus page modern Ajax-enhanced customer portal. Continously upgraded both environments to provide more customer-management functions to the NOC staff and self-help functions for customers, including:
Set up an rwhois server answering all queries with customer IP allocation data from the database.
Created various hourly/nightly/weekly/monthly reports based on customer power and network usage allowing NOC staff to proactively prevent power circuit overload, system compromise/abuse issues, etc.
Developed a set of tools to pause the HTTP servers, push out new code, refresh template caches and restart HTTP service so as to provide no visible downtime to the user.
This required custom enhancements of rdist, and later evolution of a wrapper system around rsync.
Lots to say here, but the short and sweet is that I managed the business objectives with the board of directors, wrote all network and security related code for the e-commerce products, and performed project management and Q/A work on the other products. I didn't know these terms at the time, but using Agile terminology we worked in a style now known as Scrum sprint mode, and I was ScrumMaster for all projects including ones in which I didn't have a single line of code.
Implemented and managed a high-uptime managed co-location service. Enhanced Mon and Cricket to create a unified network management infrastructure. Automated tools to reduce manual effort.
Built out a colocation environment from scratch, starting from pencil and paper to a finished cage environment. I actually did this three separate times, once in 1999, again in 2001 and again in 2003 due to massive business growth and the ever-changing evolution of colocation providers in that time period. Each instance was completely different.
Short and sweet: built enhancements to request, bind, mon, cricket, majordomo, apache, cgiwrap and a wide variety of other tools we used to manage the environment. Submitted all of these changes back to the original developers or published for the community to use.
As Isite ran with a minimal staff and sold only through web developers, I automated every customer management process to avoid hiring low-skilled workers.
Managed Cisco, Juniper, Extreme, Nortel, Foundry, HP, Wellfleet, 3Com, Proteon and other routers.
I've touched a lot of routers over the last 20 years. Thankfully, the user interface has simplified around IOS lookalikes. Unfortunately the use of ASIC-based forwarding makes many issues more complicated.
I was brought into this project by Cisco when the best of the CCIEs couldn't figure out how to make Cisco's design work. The simple part of the project was properly documenting the actual usage of the network (instead of the executive theory) and redesigning the packet flow to support this. This involved thousands of lines of traffic engineering and packet shaping configuration on each node point, but that was the easy part.
The harder part was fixing all the bugs in the T-train (not yet production) Cisco VoIP software. I ended up creating so many bug reports that I was given direct source code access and a new line of development code Q-train? I forget... was created to track the issues we identified in production. In a very real and practical sense every site using Cisco's VoIP solution with Cisco routers has benefitted from the work I did on this project.
I've been doing network security work for 18 years now. In addition to the common Internet/IP firewalls, I've done internal/corporate firewalls for IP, IPX, XNS, HP-NS, AppleTalk, and NetBIOS protocols.
For Internet security, I've come to the conclusion that simple packet filtering is basically useless if you actually want to protect the internal networks from damage (which isn't always a break-in). Many of the commercial FireWall packages are neat products, but they sometimes make too many assumptions about the network structure (or are simply too permissive).
I have built back-to-back proxy systems with three layers of packet filtering routers. It sounds paranoid, but the protected zone cannot be directly attacked, even by denial-of-service attacks. Having the source code lets me review the security of the code itself, and made changes as necessary. If nothing else, I often change what is logged and when.
Tested performance and reliability of technologies in laboratory and site-specific configurations.
I love lab work. Knowing what is really involved in network transactions is essential to having a clue when there are problems in the implementation.
Textbook (SAGE) definition of Senior System Administrator, with experience in FreeBSD, NetBSD, Solaris, HP/UX, SCO Unix, UnixWare and Linux distributions Red Hat, CentOS, Debian, Gentoo and some small image distros I've used on flash drive systems.
I rarely just use an operating system. I often end up being actively involved in the development of the OS while supporting an environment. You'll find my name in patches to everything from Solaris boot code to FreeBSD package management tools.
I was supporting LANs back when NetWare 2.11 was common and 3+/Open was still a viable(?) platform. I continued supporting both through NetWare's NDS and Microsoft ADS platform implementations and growth into the scalable solutions they are today.
Most networks grow by attrition, rather than design or plan. Besides the obvious network topology issues, many issues regarding data access present themselves. A business finds crucial business data on a variety of PC Networks, Unix workstations, Minicomputers, and legacy systems. I'm really good at is working with multiple platforms, providing transparent access to information. Cross-platform data access can be handled through various connectivity products, but a good solution is dependant on the needs of the users trying to access the data -- and these are rarely technofiles.
My home environment is probably a good example of interconnectivity. We have a FreeBSD server, Solaris server, 2 different Linux distribution desktops, two Windows XP/Vista desktops, 2 Macs and both Mac and Windows laptops. There is a single logon to all network resources, and they are available on all platforms. Since I use my home as a test lab, the network is configured to be flexible for expansion as needed.
Network monitoring is the one place where a single decent application that does what everyone needs seems to be an impossible hurdle.
Things that fail to alarm are bad. Things that alarm too often get ingored. Very few tools seem to do this correctly out of the box. And what is correct changes based on the organization, the team, even the resource in question.
I spend a lot of time documenting people's ideas of what they want checked and how they want to be alarmed about it, and writing plug-ins for the various monitoring tools to give them what they need.
We linked together pretty much every navy office in the continental Unites States, each of which had their own separate ccMail installation. The Navy had given up on phone call synchronization since it simply couldn't keep up, so the network upgrade made synchronization possible again.
This pushed the boundaries farther than ccMail (prior to acquisition by Lotus) had imagined them. We were getting custom patches from them daily for several months to address issues we identified and documented for them. I also did a lot of redesign work on timers and settings for large synchronizations. I was later told that this document floated around Lotus for a while, and was used without significant change for the default Notes synchronization settings.
I was supporting LANs back when NetWare 2.11 was common and 3+/Open was still a viable(?) platform. I continued supporting LANs as 10base2 evolved into 10baseT and network cabling finally started being done by the facilities folks. Thankfully, I haven't seen ARCnet, G-Net, or thick ethernet in over 16 years.
Anyway, LAN support in the early '90s tended to be one-stop shopping; we did it all. We ran the network wire, configured the servers, installed the applications, and supported the users. Things have diversified since then, and I've focused on network/application/voice issues, and Internet/Intranet services.
On all projects since May of 1992 I have been the project lead
or solely responsible for my portion of the project.
I work well with either independent goals or as part of a team effort.
Presenter at
Bay Area Large Scale Production Engineering (LPSE)
Presenter and published by
Usenix LISA Conference,
Presenter at
BayLISA.
Founding member of the
League of Professional Systems Administrators (LOPSA)
Participation in
NANOG (North American Network Operators Group)
Participation in FIRST (Forum of Incident Response and Security Teams)
Satwant Jakher, Director, SRE
Palo Alto Networks
Marc Kodama, DevOps Manager
Quixey
Phil Clark, CEO
Paxio
Mark Izillo, Site Operations Manager
StubHub!
More references and contact information available on request.