Jo Rhett

San Jose · California
github.com/jorhett/ · www.linkedin.com/in/jorhett

Summary:

I help engineers. I create tools that make it easier to validate, release, and observe their deployment. I design security-conscious platform services that minimize risk and magnify observability. I bring 33 years of hands-on experience creating, automating, and evolving large-scale Internet services.

Writing:

Puppet Best Practices Learning Puppet 4

I am an author of practical technology books from O'Reilly and Packt:

I contributed content and technical editing for the following books:


Experience:             Click on any sentence to get informal, verbose details.

10/2020 - 05/2024
Tubi: Staff Security and Infrastructure Engineer
02/2004 - 09/2020
Consulting & Short-term Engagements
Box: Staff Site Reliability Engineer
Palo Alto Networks: Principle Site Reliability Engineer
Humana: Machine Learning CI/CD in Kubernetes
23andMe: GDPR-compliance Security Automation
Nuance Enterprise Services: Senior Manager / Principle Automation Engineer
Quixey: Principle Engineer
Ooyala: Site Reliability Engineer
Chegg: Senior DevOps Engineer
StubHub! (eBay): Senior Operations Architect
Smaller Engagements
Customized Training
Many companies spent thousands of dollars on generic training programs that leave the staff unable to apply the knowledge gained in their working environment. I spend time learning the operations environment and working with the team, then create customized training programs tailored specifically to their day to day needs. These programs have delivered immediate positive change for the customer every time.
2012 - 2013
Pinger: Principle Operations Engineer

Managed deployment 24x7 for the Textfree and Pinger applications, which service millions of concurrent users around the world. Restructured the network to improve availability and security.

Pinger provides free texting and calling to millions of concurrent interactive app users. Any slowness or failure in the backend service is immediately visible to users in every timezone. Every change to the service had to be evaluated and implemented carefully.

The network used Juniper SRX and EX switches for external gateways, Cisco internal switches and NetScalar load balancers. At the time I arrived there was very little documentation of the network, and everything was done cowboy on the network units. I restructured the network to isolate the out of band management systems and built tools to simplify common operations.

Implemented Puppet management of diverse global sites. Created tools to ease hands-on management and automate common processes. Automated creation of application-specific data points and alerts in Zenoss.

The pinger service had two production sites for the backend service, with voice media systems spread out around the world. I revised the manual one view of everything configuration push system at Pinger to accomodate multiple diverse internal and external configurations around the globe.

All network management was done via Zenoss or provided data to Zenoss. I cleaned up a lot of issues with Zenoss and tracked open bugs with the Zenoss team. I wrote python code to make changes to the Zendmd backend store, and to automate creation and removal of Zenoss data points and graphs.

2010 - 2011
Equinix: Senior Network OSS Tools Developer

Designed completely automated, vendor-agnostic, world-wide provisioning system for Equinix Carrier Ethernet Exchange. Delivered 3 months ahead of schedule. Created an automated customer self-help OAM and RFC-2544 testing system.

The initial provisioning scope deliverable was due by October, 2010. I delivered the initial deliverables in July 2010. This service was Equinix's first completely automated provisioning system, so I had to break a lot of new ground here. This involved significant amounts of data exchange between many business units in the company.

The main code had to be vendor-independant since the provisioning system had to deploy on Alcatel-Lucent ESS-7200 switches at first, but also support Juniper switches within a year.

The code acquired operating parameters from Oracle network port information and Oracle Financial InstallBase module, storing it in local databases and synchronized with related sources.

The code base was built to be environment smart, allowing it to function in NetDev, SwDev, QA, UAT, PILOT and PROD instances without any code changes.

Designed an integrated code base which allowed automated OAM and RFC-2544 network testing from the customer portal. This code dynamically provisioned new connections on the Alcatel gear to Accedian NIDs and ran test suites for the customer without any human involvement.

Developed libraries of code for the Network Tools team standards. Enhanced a diverse variety of statistics systems using Memcache. Tested other NoSQL implementations for a distributed SNMP poller.

Built new common libraries for use by the entire Network Tools development team:
  1. Consistent interface for creating and updating Remedy tickets using SOAP protocols.
  2. Simple perl interface for creating Monolith events using SNMP traps.
  3. Simple perl interface for using SOAP-provided services -- wrapped around SOAP::Lite.
  4. Clean database constructor which provides an enhanced DBI object for any of hundreds of internal Oracle, MySQL, Postgres databases.
  5. Environment detection library for auto-sensing which backend resources to use.

Enhanced existing statistics systems to use new common libraries for reporting errors and notifying the NOC of issues. Built requirements and a test lab to existing statistics systems to use new common libraries for reporting errors and notifying the NOC of issues.

2004 - 2009
Silicon Valley Colocation: Network Engineer

Implemented production IPv6 allocation, routing, and name service for customers.

  1. Acquired a /32 allocation from ARIN for routing.
  2. Updated RADB route set to list IPv6 testbed routes.
  3. Set up internal routing infrastructure using Quagga host for a discrete IPv6 testbed network.
  4. Set up IPv6 BGP peering with all providers and peers able to do IPv6 routing.
  5. Created up a nameserver that holds all production domains but answers only to IPv6 queries.
  6. Mirrored production website to a IPv6-only site.
  7. Wrote new code for the colo control panel to allow customers to add/modify AAAA and IP6.INT records.

Redesigned the multi-site backbone to provide greater redundancy and flexibility. Upgraded the switching platform to Force10 for asic-based forwarding and hitless failover.

When I arrived the network depended on proxy arp to border gateway, which required configuration on every Extreme Black Diamond switch in the network for each customer. I simplified the configuration to use local routing and OSPF, which allowed me to improve flexibility and redundancy in case of link failures between the switches.

After this was done, I removed some expensive peering points and links which were logically redundant but not physically so. Then I arranged truly redundant peering with multiple tier-1 providers and settlement-free peering on the PAIX public mesh.

The Extreme switches had a number of fairly horrible bugs in BGP routing which made them unable to handle full feeds from all providers (which I built test cases for and proved to Extreme in their own lab). I engineered a tiered BGP solution which was able to provide full routing tables to customers while providing a limited view of functional routes to the Extreme Black Diamond switches.

The net effect of these changes was to reduce total cost-per-megabit from $135 to $32, and increase by an order of magnitude the amount of traffic customers passed through the network. Reliability was increased to nearly 3 years without a single customer-visible network outage. This was due to a power failure under control of facilities staff.

I also acquired an ARIN IPv4 block and migrated all customers from provider-specific IP space to the PI block. Which involved many months of wrangling customers.

This was a massive, multi-year project to replace the aging Extreme Black Diamond switches at the core of the network.

I created and continously updated business requirements. I identified all market solutions capable of handling datacenter core duties. I provided all findings and total cost-of-ownership details directly to the board of directors. I technically evaluated solutions provided by Cisco, Juniper, Force10, Extreme BD 10k and Alpine and Foundry switches.

When we selected a potential solution, I built a test lab and evaluated performance and functionality of the solution in extreme detail. In the Force10 test, I identified 6 bugs in the routing protocols of the FTOS software. Which compared favorably with dozens in the Cisco 6500 product at the time.

After acquisition of the Force10 E300 switches I created and tested extensively a perl script which converted all customer configuration from Extreme to Force10/Cisco format. This allowed us to simply move cables at migration time.

I replicated all standard customer functions in the new environment using before/after Wiki documentation. As Force10 did not have a function to download its configuration at a specified time, I wrote a small shell script suitable for nightly cron invocation which caused the unit to TFTP the configuration.

Implemented a distributed Nagios network monitoring system. Wrote custom plug-ins for specialized tests of old-world facilities management gear. Unified all systems to a cfengine-controlled FreeBSD standard.

Built a set of Nagios monitoring systems with instances in each datacenter. Configured all Nagios instances to report via NCSA to a central monitoring system used by NOC staff.

Created custom Nagios event handlers to only report problems visible from multiple instances (where appropriate).

Created a Nagios check utilities to alarm based on variety of data stored in pre-existing RRD files. This utility does not require Nagios to gather the RRD data, and deals appropriately with both kibibytes and kilobytes.

Created Nagios check utilities to gather useful alarm data from facilities systems including large APC units, old and new MGE power systems, IT watchdogs (and some random other) environment sensors, and a variety of telnet or custom protocol units.

Built another Nagios environment for customer monitoring. Built an easy to use monitoring setup and management UI inside the Colo Control Panel so that customers could add/remove monitoring of hosts with zero NOC staff involvement. This involved reading in and writing out Nagios configurations, as well as careful tests and loading of the new configurations on the fly.

The SVcolo environment had dozens of systems with few common functions. These units were all built with the same script, however devolved significantly over time. I built a test lab and evaluated cfengine2, puppet and bcfg2 as possible management solutions.

After selecting cfengine, I rebuilt the entire multi-hundred line build script as a cfengine ruleset to maintain the systems over time. Each system's unique functions became documented and described within the policy language, providing a centralized repository of configuration information tracked in a source code tree.

As cfengine did not have FreeBSD package management at the time, I engineered code to properly support FreeBSD and got it integrated into the main source tree. I followed up by enhancing cfengine's package management to include removing and upgrading packages for all platforms, including a significant amount of optimizing for Yum environments.

Some other tasks:
  1. Created a common hardware standard and a standard pool of spares for easy swap by non-technical staff.
  2. Upgraded all systems to a common FreeBSD standard.
  3. Unified logging to a centralized server.
  4. Set up log analysis using Simple Event Correlator to create NOC tickets, activate nagios alarms or start tracking other events as appropriate.

Designed and implemented a LAMP solution to track all internal systems, customers and resources. Created a LAMP-based Control Panel to track customers, ports, cross connects, and power usage.

Upgrade a simple spreadsheet of customer data into a de-centralized MySQL database with dozens of tables containing all customer resources and assignments.

Built a LAMP (perl) server setup to handle both internal/staff and customer requests. Designed a set of OO modules around each set of data, and another set to handle interactive/ajax requests. Created a UI for the staff to review and maintain this data. Upgraded the customer UI from a 1993-style 3-page web UI to a 100-plus page modern Ajax-enhanced customer portal. Continously upgraded both environments to provide more customer-management functions to the NOC staff and self-help functions for customers, including:

Set up an rwhois server answering all queries with customer IP allocation data from the database.

Created various hourly/nightly/weekly/monthly reports based on customer power and network usage allowing NOC staff to proactively prevent power circuit overload, system compromise/abuse issues, etc.

1999 - 2004
Isite Services: Chief Technology Officer

Theorized and created an integrated e-commerce application suite in C++ and Perl/LAMP. Designed tools used for zero-downtime upgrades.

Developed a set of tools to pause the HTTP servers, push out new code, refresh template caches and restart HTTP service so as to provide no visible downtime to the user.

This required custom enhancements of rdist, and later evolution of a wrapper system around rsync.

Consulted with the investors on all technology and business choices. Managed the teams implementing all technology directives.

Lots to say here, but the short and sweet is that I managed the business objectives with the board of directors, wrote all network and security related code for the e-commerce products, and performed project management and Q/A work on the other products. I didn't know these terms at the time, but using Agile terminology we worked in a style now known as Scrum sprint mode, and I was ScrumMaster for all projects including ones in which I didn't have a single line of code.

Implemented and managed a high-uptime managed co-location service. Enhanced Mon and Cricket to create a unified network management infrastructure. Automated tools to reduce manual effort.

Built out a colocation environment from scratch, starting from pencil and paper to a finished cage environment. I actually did this three separate times, once in 1999, again in 2001 and again in 2003 due to massive business growth and the ever-changing evolution of colocation providers in that time period. Each instance was completely different.

Short and sweet: built enhancements to request, bind, mon, cricket, majordomo, apache, cgiwrap and a wide variety of other tools we used to manage the environment. Submitted all of these changes back to the original developers or published for the community to use.

As Isite ran with a minimal staff and sold only through web developers, I automated every customer management process to avoid hiring low-skilled workers.

1993 - 1999
Visionary Management Technologies: Senior Network Engineer (consultant)

Managed Cisco, Juniper, Extreme, Nortel, Foundry, HP, Wellfleet, 3Com, Proteon and other routers.

I've touched a lot of routers over the last 20 years. Thankfully, the user interface has simplified around IOS lookalikes. Unfortunately the use of ASIC-based forwarding makes many issues more complicated.

Redesigned an IPSec VPN network to use policy routing for best performance. Implemented with no downtime in a 24-hour production network, including 3 European and 4 Asia-Pacific sites. Stabilized a Cisco VoIP call manager and implemented media gateways to improve voice quality in 24 offices worldwide.

I was brought into this project by Cisco when the best of the CCIEs couldn't figure out how to make Cisco's design work. The simple part of the project was properly documenting the actual usage of the network (instead of the executive theory) and redesigning the packet flow to support this. This involved thousands of lines of traffic engineering and packet shaping configuration on each node point, but that was the easy part.

The harder part was fixing all the bugs in the T-train (not yet production) Cisco VoIP software. I ended up creating so many bug reports that I was given direct source code access and a new line of development code Q-train? I forget... was created to track the issues we identified in production. In a very real and practical sense every site using Cisco's VoIP solution with Cisco routers has benefitted from the work I did on this project.

Designed and implemented firewalls using context-based filtering routers and transparent proxies. Maintained a multi-organization firewall segregating distinct internal security domains.

I've been doing network security work for 18 years now. In addition to the common Internet/IP firewalls, I've done internal/corporate firewalls for IP, IPX, XNS, HP-NS, AppleTalk, and NetBIOS protocols.

For Internet security, I've come to the conclusion that simple packet filtering is basically useless if you actually want to protect the internal networks from damage (which isn't always a break-in). Many of the commercial FireWall packages are neat products, but they sometimes make too many assumptions about the network structure (or are simply too permissive).

I have built back-to-back proxy systems with three layers of packet filtering routers. It sounds paranoid, but the protected zone cannot be directly attacked, even by denial-of-service attacks. Having the source code lets me review the security of the code itself, and made changes as necessary. If nothing else, I often change what is logged and when.

Tested performance and reliability of technologies in laboratory and site-specific configurations.

I love lab work. Knowing what is really involved in network transactions is essential to having a clue when there are problems in the implementation.

Provided advanced system administration, network configuration, and host security management for FreeBSD, Linux, Solaris, HP/UX, Irix, SCO Unix V and UnixWare systems.

Textbook (SAGE) definition of Senior System Administrator, with experience in FreeBSD, NetBSD, Solaris, HP/UX, SCO Unix, UnixWare and Linux distributions Red Hat, CentOS, Debian, Gentoo and some small image distros I've used on flash drive systems.

I rarely just use an operating system. I often end up being actively involved in the development of the OS while supporting an environment. You'll find my name in patches to everything from Solaris boot code to FreeBSD package management tools.

Installed and maintained Novell and Windows servers and clients. Designed the network to provide single logon and seamless operation between Unix, Windows, and NetWare environments.

I was supporting LANs back when NetWare 2.11 was common and 3+/Open was still a viable(?) platform. I continued supporting both through NetWare's NDS and Microsoft ADS platform implementations and growth into the scalable solutions they are today.

Most networks grow by attrition, rather than design or plan. Besides the obvious network topology issues, many issues regarding data access present themselves. A business finds crucial business data on a variety of PC Networks, Unix workstations, Minicomputers, and legacy systems. I'm really good at is working with multiple platforms, providing transparent access to information. Cross-platform data access can be handled through various connectivity products, but a good solution is dependant on the needs of the users trying to access the data -- and these are rarely technofiles.

My home environment is probably a good example of interconnectivity. We have a FreeBSD server, Solaris server, 2 different Linux distribution desktops, two Windows XP/Vista desktops, 2 Macs and both Mac and Windows laptops. There is a single logon to all network resources, and they are available on all platforms. Since I use my home as a test lab, the network is configured to be flexible for expansion as needed.

Designed custom plug-ins for SunNet Manager and HP OpenView to test unique resources and implement custom alert notifications.

Network monitoring is the one place where a single decent application that does what everyone needs seems to be an impossible hurdle.

Things that fail to alarm are bad. Things that alarm too often get ingored. Very few tools seem to do this correctly out of the box. And what is correct changes based on the organization, the team, even the resource in question.

I spend a lot of time documenting people's ideas of what they want checked and how they want to be alarmed about it, and writing plug-ins for the various monitoring tools to give them what they need.

1992 - 1993
Technology, Management & Analysis Corp: Systems Engineer

Created Secret security-level WAN links between the Pentagon City SEA05 Metropolitan Area Network and shipyard offices throughout the US. Implemented mission-critical military systems with Sun and HP Unix. Created advanced IP and IPX protocol filters to control link utilization on large multivendor networks.

I can't remember what parts I'm allowed to discuss here so I'll be brief. 16 towering office buildings just south of the Pentagon, legally known as Crystal City but affectionately known as Pentagon City. All of these buildings were connected in a single large fiber network known as NAVSEA SEA04/SEA05. Each department and/or contractor on this network had their own lowest-cost-bidder contracts for network help, so implementing both milnet and interior/classified routing was often precarious because you counldn't trust that the department next door wouldn't just re-use your unique IDs.

Now multiply that by an order of magnitude when you connect shipyards all around the US and bring them into the same network. Network clue was fairly low, so we spent more time firewalling against mistakes than against intruders.

Implemented mail transfer and directory synchronization on all Navy ccMail systems, which created the largest unified ccMail system in the world at that time.

We linked together pretty much every navy office in the continental Unites States, each of which had their own separate ccMail installation. The Navy had given up on phone call synchronization since it simply couldn't keep up, so the network upgrade made synchronization possible again.

This pushed the boundaries farther than ccMail (prior to acquisition by Lotus) had imagined them. We were getting custom patches from them daily for several months to address issues we identified and documented for them. I also did a lot of redesign work on timers and settings for large synchronizations. I was later told that this document floated around Lotus for a while, and was used without significant change for the default Notes synchronization settings.

1991
Network Alternatives: Network Administrator

Installed and maintained SCO Unix, Novell NetWare, LAN Manager, 3+/Open, and Lantastic environments using Ethernet, ARCNet, TCNS, and G-Net networks. Performed server installations and migrations as technology changed.

I was supporting LANs back when NetWare 2.11 was common and 3+/Open was still a viable(?) platform. I continued supporting LANs as 10base2 evolved into 10baseT and network cabling finally started being done by the facilities folks. Thankfully, I haven't seen ARCnet, G-Net, or thick ethernet in over 16 years.

Anyway, LAN support in the early '90s tended to be one-stop shopping; we did it all. We ran the network wire, configured the servers, installed the applications, and supported the users. Things have diversified since then, and I've focused on network/application/voice issues, and Internet/Intranet services.

On all projects since May of 1992 I have been the project lead or solely responsible for my portion of the project.
I work well with either independent goals or as part of a team effort.

Other Information:

Presenter at Bay Area Large Scale Production Engineering (LPSE)
Presenter and published by Usenix LISA Conference,
Presenter at BayLISA.
Founding member of the League of Professional Systems Administrators (LOPSA)
Participation in NANOG (North American Network Operators Group)
Participation in FIRST (Forum of Incident Response and Security Teams)


References:

Satwant Jakher, Director, SRE
Palo Alto Networks

Marc Kodama, DevOps Manager
Quixey

Phil Clark, CEO
Paxio

Mark Izillo, Site Operations Manager
StubHub!

More references and contact information available on request.

Links: