Hudl Product Team Blog 52000613c0d7cd018801b62d 2016-09-16T11:08:03-05:00 Caching Hudl’s news feed with ElastiCache for Redis 57dc1363d4c961246d003bae 2016-09-16T11:08:03-05:00 2016-09-16T10:00:00-05:00 <p>Every coach and athlete that logs into Hudl immediately lands on their news feed. Each feed is tailored to each user and consists of content from teams they are in as well as accounts they choose to follow. This page is the first impression for our users and performance is critical. Our solution: <strong>ElastiCache for Redis</strong>.</p> <p><img src="" alt="News Feed" /></p> <p>Before we talk about caching though, it’s important to understand at a high level the data model used in the feed. There are 6 main collections used by the feed:</p> <p><img src="" alt="Feed DB" /></p> <p>To help illustrate this, let’s look at two users: Sally and Pete. Sally decides to follow Pete. We now say that Sally is a follower of Pete and that Pete is a friend of Sally. When Pete posts something, that post (aka content) gets added to his user timeline as well as the home timeline for Sally. When Sally logs into Hudl to view her feed, she sees her home timeline presented in reverse chronological order. If she then clicks on Pete, she views his user timeline and can view all the posts he’s created.</p> <p>Let’s take a look now at what gets loaded every time a user hits their feed. We first grab a batch of post IDs from their home timeline, fetch those posts, and load all users referenced by those posts. Since the feed was created back in April 2015, the DB has grown rapidly and the total size is up to 120 GB. We currently have <strong>18 million</strong> follower relationships and <strong>30 million</strong> pieces of content. So where does caching come into play?</p> <h1 id="redis">Redis</h1> <p>In the world of caching, there are primarily two options: <a href="">Memcached</a> and <a href="">Redis</a>. For the longest time at Hudl, the default option was Memcached. It’s a proven technology and had previously served the vast majority of needs across our services. However, with the introduction of the news feed, we decided to dig a little deeper into the data structures Redis had to offer and we’re really excited by what we found:</p> <h2 id="lists">Lists</h2> <p>This alone would’ve been reason enough to use Redis. Timelines are naturally stored as lists so being able to represent them that way in cache is amazing. As posts are added to timelines, we simply do a <a href="">LPUSH</a> (add to the front) followed by a <a href="">LTRIM</a> (used to cap the list at a max size). The best part: we don’t have to invalidate the cache as posts are added because it’s always being kept in sync with the DB.</p> <h2 id="hashes">Hashes</h2> <p>Displaying the number of followers and friends for a given user is a critical component for any feed. By storing these as fields on a hash for each user, we can quickly call <a href="">HINCRBY</a> to keep the values in sync with the DB without the need to invalidate the cache every time a follow or unfollow happens.</p> <h2 id="sets">Sets</h2> <p>We love to use <a href="">RabbitMQ</a> to retry failed operations. Sets are the perfect way for us to guarantee we don’t accidentally insert the same post on a user’s timeline more than once without having an extra DB call. We use the post ID as the cache key, each user ID as the member, and then call <a href="">SISMEMBER</a> and <a href="">SADD</a>.</p> <h1 id="elasticache">ElastiCache</h1> <p>Once we decided on Redis, the next question was how to get a server spun up and configured so we could start testing with it. We love <a href="">AWS</a> and had heard about <a href="">Amazon ElastiCache for Redis</a> as an option so we decided to give it a try. Within minutes, we had our first test node spun up running Redis and were connecting to it through the <a href="">StackExchange.Redis</a> C# driver.</p> <p>With ElastiCache, we were easily able to configure our Redis deployment, node size, security groups and use <a href="">Amazon CloudWatch</a> to monitor all key metrics. We were able to create separate test and production clusters, all without waiting for an infrastructure engineer to setup and configure the servers manually. Here’s what we used for our production cluster:</p> <ul> <li>Node type: cache.r3.4xlarge (118 GB)</li> <li>Replication Enabled</li> <li>Multi-AZ</li> <li>2 Read Replicas</li> <li><a href="">Launched in VPC</a> </li> </ul> <p>The final step in completing our deployment was configuring alerts through <a href="">Stackdriver</a>. They seamlessly support integrating with the ElastiCache service and within a few minutes, we had our alerts configured. We were most interested in three key metrics:</p> <ul> <li><strong>Current Connections:</strong> if these drop to 0, our web servers are no longer able to access the cache and require immediate attention.</li> <li><strong>Average Bytes Used for Cache Percentage:</strong> if this reaches 95% or higher, it’s a good signal to us that we may need to consider moving to a larger node type or lowering down our expiration times.</li> <li><strong>Swap Usage:</strong> if this gets to 1GB or higher, the Redis server is in a bad state and requires immediate attention.</li> </ul> <h1 id="results">Results</h1> <p>The feed was launched back in April 2015 and since then we couldn’t be happier with its performance. Hudl’s traffic is highly seasonal and football season is our prime time. Starting around August, coaches and athletes from all over the country get back into football mode and log into Hudl daily. During the week of September 5th - 11th, there were 1.2 million unique users accessing their feeds. The feed service averaged 300 requests per second, with a peak of 800. Here are some quick stats from ElastiCache during that same week:</p> <ul> <li><strong>Total Cached Items:</strong> 21 million</li> <li><strong>Cache hits:</strong> 175K/min (average), 350K/min (peak)</li> <li><strong>Network in:</strong> 43 MB/min (average), 101 MB/min (peak)</li> <li><strong>Network out:</strong> 600 MB/min (average), 1.25 GB/min (peak)</li> </ul> <p><img src="" alt="Cache Hits" /></p> <p>Let’s take a closer took at two calls in the feed service: getting the timeline and hydrating the timeline. The first call looks at just the operation of fetching the timeline list from redis. No other dependencies. The second call takes the post IDs in the timeline and loads all referenced users and posts. It’s important to note that this includes time spent loading records from the database if they are not cached and then caching them. This is the primary call used when loading the feed on the web as well as on our iOS and Android apps. </p> <p><img src="" alt="Timeline Stats" /></p> <p>Based on the success of feed, ElastiCache for Redis is quickly becoming our default option for caching. In the last year, five other key services at Hudl have made the switch from Memcached. It’s easy to setup, offers blazing fast performance, and gives users all the benefits that Redis has to offer. If you haven’t tried it out yet, I would strongly recommend giving it a shot and let us know how it works out for you.</p> Joel Hensley How we stay sane with a large AWS infrastructure 56e9cbecd4c9617c560002c3 2016-03-16T16:30:18-05:00 2016-03-16T16:00:00-05:00 <p>We’ve been running in AWS since 2009 and have grown to running hundreds, at times even thousands of servers. As our business grew, we developed a few standards that help us make sense of our large AWS infrastructure.</p> <p><img src="" alt="EC2 Instances List" /></p> <h2 id="names-and-tags">Names and Tags</h2> <p>We use three custom tags for our instances, EBS volumes, RDS and Redshift databases, and anything else that supports tagging. They are extremely useful for cost analysis but are also useful for running commands like describeInstances.</p> <ul> <li><strong>Environment</strong> - we use one AWS account for our environments. This tag helps us differentiate resources. We only use four values for this: test, internal, stage or prod.</li> <li><strong>Group</strong> - this is ad-hoc, and typically denotes a single microservice, team or project. Because there are many projects ongoing at any given time, we discourage abbreviations to improve clarify. Examples at Hudl: monolith, cms, users, teamcity</li> <li><strong>Role</strong> - within a group, this denotes the role this instance plays. Like RoleNginx, RoleRedis or RoleRedshift.</li> </ul> <p>We also name our instances. It makes talking about instances easier, which can help when firefighting. We use <a href="">Sumo Logic</a> for log aggregation, and our _sourceName (ie: host name) values match up with our EC2 instance names. That makes comparing logs and CloudWatch metrics easier. We pack a lot of information into the name:</p> <p><img src="" alt="p-monolith-rabbit" /></p> <p>At a glance I can tell this is a production instance that supports our monolith. It’s a RabbitMQ server in the ‘D’ availability zone of the ‘us-east-1’ region. To account for multiple instances of the same type, we tack on the ‘id’ value, in this case it’s the first of its kind. For servers which are provisioned via Auto-Scaling Groups, instead of a two-digit number we use a six-digit hash. Short enough that humans can keep it in short-term memory and long enough to provide uniqueness.</p> <h2 id="security-groups--iam-roles">Security Groups &amp; IAM Roles</h2> <p><em>Note If you are familiar with Security Groups and IAM Roles, skip this paragraph.</em> Security groups are simple firewalls for EC2 instances. We can open ports to specific IPs or IP ranges or we can reference other security groups. For example, we might open port 22 to our office network. IAM Roles are how instances are granted permissions to call other AWS web services. These are useful in a number of ways. Our database instances all run regular backup scripts. Part of that script is to upload the backups to S3. IAM Roles allow us to grant S3 upload ability but only to our backups S3 bucket and they can only upload, not read or delete.</p> <p>We have a few helper security groups like ‘management’ and ‘chef’. When new instances are provisioned we create a security group that matches the <code>{environment}-{group}-{role}</code> naming convention. This is how we keep our security groups minimally exposed. The naming makes it easier to reason about and audit. If we see an “s-<em>” security group referenced from a “p-</em>” security group, we know there’s a problem.</p> <p>We keep the <code>{environment}-{group}-{role}</code> convention for our IAM Role names. Again, this lets us grant minimal AWS privileges to each instance but is easy for us humans to be sure we are viewing/editing the correct roles.</p> <h2 id="wrap-it-up">Wrap it up</h2> <p>We’ve adopted these naming conventions and made it just part of how folks provision AWS resources at Hudl. It makes it easier to understand how our servers are related to each other, who can communicate with who and on what ports, and we can precisely filter via the API or from the management console. For very small infrastructures, this level of detail is probably unnecessary. However, as you grow beyond tens and definitely past hundreds of servers, standards like these will keep your engineering teams sane.</p> Jon Dokulil Measuring Availability: Instead of Nines, Let’s Count Minutes 56a4d9f5d4c9612596005192 2016-01-24T10:38:16-06:00 2016-01-24T08:00:00-06:00 <p>Running sites with high availability is a foregone conclusion for most businesses. Availability is pretty easy to abstractly define, but rarely explained with real examples. Most likely, the first half-dozen Google results you come across when searching about it will spend many words musing about “nines” and equating them to precise minute values and telling you just how many of these “nines” you might need. It’s harder to find more detailed explanations about how companies go about computing and tracking their availability, particularly for complex SaaS websites. Here’s how we do it for our primary web application, <a href=""></a>.</p> <h2 id="calculating-overall-server-side-availability">Calculating Overall Server-Side Availability</h2> <p>We measure our server-side availability per-minute by aggregating access logs from our NGINX servers, which sit at the top of our application stack.</p> <p><img src="/assets/56a45a20d4c961258d02f605/nginx_logs.png" alt="Availability at the NGINX layer" /></p> <p>Our NGINX logs are similar to the <a href="">default format</a>, and for availability tracking we’re interested in the status code and elapsed time. We tack on some service information, which I’ll talk about in a bit.</p> <p><img src="/assets/56a4dd70d4c961258d033468/nginx_log_example.png" alt="NGINX access log" /></p> <p>For each minute, the number of successful and failed responses are counted. A request is considered unsuccessful if we respond with a <a href="">5XX HTTP status</a> or if it takes longer than five seconds to complete.</p> <p>Each individual minute is then categorized by its percentage of successful requests:</p> <ul> <li>If less than 90% of requests succeed, the minute is considered <strong>down</strong>.</li> <li>If greater than 90%, but less than 99%, the minute is <strong>degraded</strong>.</li> <li>Otherwise (&gt;= 99%), the minute is <strong>up</strong>.</li> </ul> <p>The down/degraded/up buckets help reflect seasonality by weighting the more heavily-accessed features, and also help separate site-wide critical downtime from individual feature or service outages.</p> <p>Here’s a three hour period with a brief incident — web servers in our highlights service spiked to 100% CPU for a couple minutes:</p> <p><img src="/assets/56a4d9f7d4c961259600519d/degraded_service.png" alt="Degraded service" /></p> <p>For 2016 we’ve set a goal of no more than 120 individual minutes of downtime and 360 minutes of degraded service. We arrived at these thresholds by looking at previous years and forecasting while also pushing ourselves to improve. Admittedly, the target could be quantified as a “nines” percentage. But counting minutes is more straightforward and easier to track than a target uptime of 99.977%.</p> <h2 id="by-the-microservice">By the Microservice</h2> <p>Hudl is split into smaller microservices, each serving a few pages and API endpoints. Along with the overall availability described above, we track it for each of these services independently to help identify contributors to availability issues. Regularly looking into how each performs lets us know where to focus on optimization or maintenance efforts.</p> <p>Here’s an example of a service that performs inconsistently and needs some work:</p> <p><img src="/assets/56a4de4eedb2f35f95007d51/struggling_service.png" alt="Struggling service" /></p> <p>Stacking these metrics up side-by-side helps see the services that are higher volume and which services are performing relatively better or worse:</p> <p><img src="/assets/56a4de88d4c96125960051b3/per_service_availability2.bmp" alt="Service comparison" /></p> <h2 id="shortcomings">Shortcomings</h2> <p>There are a few things our algorithm doesn’t cover that are worth noting.</p> <p>The argument could be made that <a href="">4XX HTTP responses</a> should also be included as failed requests. We go back and forth on this - we’ve had server-side problems manifest as 404s before, but it’s tough to tell the difference between “good” and “bad” 404s. We’ve opted to exclude them for now.</p> <p>It doesn’t cover front-end issues (e.g. bad JavaScript) since it’s solely a server-side measurement. We have some error tracking and monitoring for client code, but it’s a part of our system where our visibility is more limited.</p> <p>The server-side nature of the monitoring also doesn’t cover regional issues like ISP troubles and errors with CDN POPs, which come up from time to time. These issues still affect our customers, and we do what we can to identify problems and help route around them, but it’s another visibility gap for us.</p> <h2 id="never-perfect-always-iterating">Never perfect, always iterating</h2> <p>We’ve iterated on the algorithm several times over the years to make sure we hold ourselves accountable to our users. We have alerts on availability loss and investigate incidents prudently. If users are having a rough time and it’s not reflected in our availability, we change how we run the numbers. This is what works well for us today, but I expect it to change as we move forward with our systems and our business.</p> Rob Hruska The Low-Hanging Fruit of Redshift Performance 5650e6d8a0b5dd060e3353ff 2015-11-23T08:02:17-06:00 2015-11-22T13:20:00-06:00 <p><em>This post is part three of our blog series about Fulla, Hudl’s data warehouse. <a href="">Part one</a> talked about moving from Hive to Redshift and <a href="">part two</a> talked about loading in our application logging data. Part three was supposed to be about tailing the MongoDB oplog, but I decided to take a brief detour and tell you about some performance issues we started to run into after we hashed out our ETL cycle for logging data.</em></p> <p>To begin, let’s rewind about a month. We had about 30 people using our Redshift cluster regularly, a few of whom were “power users” who used it constantly. Redshift was proving itself to be a great tool to learn about our users and how they use Hudl. However, with increased usage, bottlenecks became a lot more noticeable and increasingly painful to deal with. Queries that used to take 30 seconds were taking minutes. Some longer queries were taking more than an hour. While this might be reasonable for large analytical batch-jobs, it’s hardly acceptable for interactive querying.</p> <p>We quickly learned that the biggest source of latency for us was our <code>logs</code> table. Some brief background on the logs table: it contains all of our application logs with common fields parsed out. Most of our application logs contain key-value pairs that look something like: </p> <p><code> ... Operation=View, Function=Highlight, App=Android, User=123456... </code></p> <p>We parse out about 200 of the most common key-value pairs into their own special column in the logs table. I won’t go any further into how we populate the logs table since that’s already been covered in <a href="">part two of this series</a>. Instead I’m going to talk about some problems with the table.</p> <h2 id="problems-with-the-logs-table">Problems With The Logs Table</h2> <h3 id="its-a-super-useful-table">It’s a super useful table!</h3> <p>It contains information about how our users are actually using our products. Everyone wants to use it in almost every query. This is actually a really good problem to have. It means we’re providing people with something useful. However, it exacerbates some of our other problems.</p> <h3 id="its-huge">It’s huge</h3> <p>It has more than 30 billion rows. This is a problem for several reasons. It requires queries to sift through much more data than they need to. It also makes it prohibitively slow to vacuum, which brings me to the next problem…</p> <h3 id="it-was-completely-un-vacuumed">It was completely un-vacuumed</h3> <p>In Redshift, <a href="">vacuuming</a> does two things: 1) sorts the table based on a predefined sortkey and 2) reclaims unused disk space left after delete or update operations (<a href="">here’s a good SO answer</a> on why this is necessary). The logs table is generally only appended to, so we don’t really need to worry about reclaiming deleted space, however the sorting aspect is drastically important for us. The logs table had a sortkey, but the table was 0% sorted so most queries against the table had to do large table scans. We were also worried that it would take weeks to actually vacuum the table and would use all the cluster’s resources in the process.</p> <h3 id="none-of-the-columns-had-encodings">None of the columns had encodings</h3> <p>This was problematic because it meant we weren’t using our disk space and memory efficiently. As a result the table took up 20 TB of our 32 TB cluster. In short, this means we couldn’t ever make a copy of the table to fix our problems.</p> <h3 id="it-had-a-really-wide-column">It had a really wide column</h3> <p>We store the raw log message in a column alongside all the parsed out fields. If someone wanted a field from a log that we didn’t parse out, they had to extract it from the <code>message</code> field using regex or SQL wildcards. It was made even worse by the fact that the <code>message</code> column was of type <code>VARCHAR(MAX)</code> which means that 64K of memory had to be allocated for <strong>every single record</strong> that was read into memory. Even if the actual log message was only 1K, it still had to allocate 64K of memory for it. As a result all of the queries that touched the message field had to write to disk a lot when they ran out of memory. It was just… just awful.</p> <h2 id="lets-fix-it">Lets Fix It!</h2> <p>As I mentioned above, our humble two-node Redshift cluster only has 32TB of space so there’s no way we could copy the 20 TB table within the same cluster. Additionally, we wanted to fix this problem without hindering performance and with minimal downtime since some people use it for their full-time job. With all that in mind, here’s the plan we came up with:</p> <ol> <li>Spin up a new Redshift cluster from one of the snapshots that are taken regularly. The AWS Redshift console makes this super easy to do in about three clicks: <img src="/assets/565210f2edb2f305ef330521/snapshotrestore.gif" alt="Restoring a Redshift cluster from a snapshot." /></li> <li>Drop the <code>logs</code> table on the new cluster. Re-create the table with <code>messagetime</code> as the sortkey and with the message column truncated to <code>VARCHAR(1024)</code>. We chose <code>VARCHAR(1024)</code> because we realized &gt;99% of the logs we were storing were shorter than that. What’s more, the logs that were longer were usually long stack traces which were typically not useful in the kind of offline analysis we use Fulla for.</li> <li>Use the <a href="">copy command</a> to load in the most recent month’s logs into the table from S3. This will automatically set sensible column encodings for us.</li> <li>Use the copy command to load the rest of the logs from S3. This took about 18 hours total. It’s somewhat interesting how disk usage increases during a COPY and decreases afterwards (shown below). I assume this is because the files that were pulled from S3 get cleaned up after they’re loaded into the table. <img src="/assets/56521327d4c961061033c7a6/diskspace.png" alt="Redshift disk space usage during a large COPY operation" /></li> <li>Run <code>vacuum logs;</code>. This took about 4 days total. After we do all that, the new cluster is ready to use! We just need to switch over from the old cluster. Luckily, Redshift makes it possible to do this with minimal downtime and without needing to update the connection string on all of your clients! We renamed the old cluster from “fulla” to “fulla-old”, then renamed the new cluster from “fulla-new” to “fulla”. Overall, the switchover took about 10 minutes total and people were ready to start using the shiny new logs table.</li> </ol> <h2 id="after">After</h2> <p>Some brief stats from after the migration. Between the column encodings, the truncated message column, and having the table vacuumed the total table size went from 20 TB to 5 TB. The number of “slow queries” we were detecting per day dropped significantly as can be seen in my fancy excel graph below:</p> <p><img src="/assets/5650e6d8a0b5dd060e335408/slowqueries.png" alt="Number of slow queries before and after the switch" /></p> <p>During this time, overall usage has been steadily increasing as we’ve been educating more people on how to use Fulla:</p> <p><img src="/assets/565213d0edb2f305ef33070e/Screen_Shot_2015_11_22_at_2_12_41_PM.png" alt="Total queries by day before and after the switch" /></p> <p>Despite this increase in usage, cluster performance has improved greatly!</p> <h2 id="next-steps">Next Steps</h2> <p>Our improvements are not yet complete. The logs table is still the biggest bottleneck of working with our Redshift cluster. We’re putting a process in place to extract product-specific and team-specific tables from the log data rather than having everyone use one giant logolith (seewhatididthere.jpg). Another impending problem is that the logs table takes about 12 hours to vacuum. We kick off a vacuum job each night after we load in the logs from the previous day. To get that time down, we’re planning on breaking off logs from previous years and creating a view over multiple tables.</p> <h2 id="main-takeaways">Main Takeaways</h2> <ol> <li>Vacuum your redshift tables. Vacuum early and vacuum often.</li> <li>Make sure you have proper <a href="">column encodings</a>. The best way to do this is to let the copy command automatically set them for you if possible.</li> <li>Avoid <code>VARCHAR(MAX)</code>. When you have <code>VARCHAR</code> fields, be really conservative with their size. Don’t just default to <code>VARCHAR(MAX)</code> to avoid thinking about your data. There are real disadvantages to having wide fields.</li> </ol> <p>If you are working with Redshift and this article helped you, please let me know! We’re learning more about it every day and I’d love to swap tips and tricks with you. Stay tuned for part 4 of this series where I tell you all about tailing the MongoDB oplog.</p> Josh Cox Populating Fulla with SQL Data and Application Logs 563a38e4a0b5dd062e00afc0 2015-11-22T13:20:06-06:00 2015-11-09T13:00:00-06:00 <p>This is the second post about Hudl’s internal data warehouse, Fulla. The <a href="">first post</a> introduced Fulla, discussed our desires in a data warehouse, and detailed our path to Fulla v2. This post will cover our ETL (extract-transform-load) pipeline to get Hudl’s production SQL and application log data into Fulla on a regular schedule.</p> <p><img src="/assets/563a36efd4c9610610207396/HudlETL.png" alt="Hudl ETL diagram" /></p> <h2 id="production-sql-data">Production SQL Data</h2> <p>We have three major sources of data that go into Fulla. First, we have our production relational databases. We have two relational databases, one SQL Server and one MySQL. They are each about 100 GB, which makes this source easily the smallest of our three sources.</p> <p>To keep Fulla in sync with our relational databases, we execute nightly batch jobs to refresh the data. For each database, we spin up an Elastic MapReduce (EMR) cluster to run a Sqoop job against a read replica. The data is saved in Avro format on S3. Once a table has been exported to S3, we truncate the table in Fulla and upload the newest data. Given the similarity between tables in SQL Server and MySQL and tables in Redshift, as well as the comparably small size of our relational databases, this has been the easiest part of our ETL pipeline.</p> <h2 id="application-logs">Application Logs</h2> <p>Our second source of data is our logs. Hudl made a decision a while back to <a href="">Log All the Things</a>. As a result, we generate a few hundred million application log messages per day during our busy seasons. We recently moved to Sumo Logic (from Splunk) to collect and review our log data. Sumo Logic is great for real-time queries, but it slows down when doing multi-day queries over a broad set of logs. Further, Sumo Logic only stores data for the last 60 days, making it impossible to do year-over-year comparisons.</p> <p>Sumo Logic sends rolling archives of our logs to an S3 bucket, so getting the raw data isn’t difficult. However, we don’t have strictly structured logs, so mapping an individual log onto a columnar database like Redshift is non-trivial. We decided to perform a few ETL steps before loading. First, we made the decision to keep only application logs and drop other logs (e.g. Nginx) for which the volume outweighed the utility. Second, we didn’t want to just load the raw log message to Redshift, as this would require significant regex parsing and substring extraction by Fulla users. To make it easier for Hudlies to query log messages, we decided to handle some initial parsing before loading into Redshift.</p> <p>For example, a typical log message looks like the following:</p> <pre><code>2015-10-11 11:22:33,444 [INFO] [Audit] [request_id=c523a1e3] App=Hudl,Func=View,Op=Clip,Ip=,AuthUser=12345,Team=555,Attributes=[Clip=4944337215,TimeMs=10] </code></pre> <p>It includes a timestamp, a hostname, log level, and a bunch of key-value pairs. Many of these key-value pairs are used in almost all log messages, such as App, Func, Op, and AuthUser. We analyzed a set of log messages to find the most commonly-used 100 key-value pairs. We then wrote a job using regex patterns for each of those key-value pairs to extract the keys and values from the message. We’ve also allowed users to request additional key-value pairs to parse out. Now, in addition to the common log values of time, host, and even the raw message, our logs table in Fulla has over 140 other columns that Hudlies can query. Wide tables aren’t a concern in Redshift. It’s a columnar database, so it only fetches the requested fields on reads rather than the entire row.</p> <p>As an added benefit, it has been relatively easy to transition a query from Sumo Logic to Fulla. Hudlies are Sumo Logic power users, whether it’s someone from our awesome coach support team looking for an accidentally deleted playbook or a Product Manager checking the impact of a new feature. Rather than forcing them to memorize two ways of accessing data, they can use most of the same logic. This has been a strong factor in our quick adoption rate as over 180 Hudlies have signed up for Fulla, representing nearly half of all employees.</p> <h2 id="lessons-learned">Lessons Learned</h2> <p>The transition to Fulla v2 has gone pretty well, and we’ve learned a lot along the way. Below are a few of our more helpful tips.</p> <ul> <li>Automate, Automate, Automate.</li> </ul> <p>The data in your data warehouse should update automatically, without someone needing to push a button. Your pipelines will need tweaking and require a fair bit of babysitting at first, but they will stabilize eventually. Think of the pieces that are shared across pipelines, and see how you can abstract them to be used in new pipelines.</p> <p>For example, we have a script to launch an EMR cluster. You can specify a number of nodes, utilize spot pricing, and run any number of arbitrary jobs. To create a new job, you just need to use a defined format – have a script that will prepare your environment and run the job, then zip up the directory and post it to S3. Our scheduler spins up new clusters every night to run the jobs, and they terminate when they’re finished. This has made it really easy to add new components to our ETL pipeline and has saved us a lot of money by avoiding the use of long-running EMR clusters waiting for a task.</p> <ul> <li>Use a Workflow Management tool</li> </ul> <p>We make heavy use of Spotify’s <a href="">Luigi</a> in our pipelines. It is great for managing dependencies when executing long chains of tasks. Make sure you take full advantage of it by composing your jobs into the smallest unit that makes sense. For example, our Sqoop job on our SQL Server exports data from 80 different tables. Rather than do a single <code>sqoop import-all</code> task in Luigi, we do a <code>sqoop import</code> task for each of the 80 tables. If our spot instances get terminated after exporting 79 of 80 tables, we can quickly pick up where we left off rather than starting over at the beginning. Luigi sees that we’re only missing data in S3 for one of the tables, so it does a <code>sqoop import</code> on that table, then proceeds to cleaning and loading to Redshift.</p> <p>There are a number of other workflow management tools out there, like Airbnb’s Airflow. Find one that works for you (or build your own, if you’re adventurous).</p> <ul> <li>Know your data</li> </ul> <p>As you’re putting together your data warehouse, it’s important to know how your data is stored and how your users will access it. This will come into play not just on SORTKEYS and DISTKEYS for Redshift, but in how your ETL pipeline is structured.</p> <p>As one example, I noted that we truncate all the tables from our relational databases and load entirely fresh data in every morning. It’s possible to update tables from a relational database in Redshift without doing this, by updating only those rows that changed. We briefly discussed streaming the binary log from MySQL and the transaction log from SQL Server, saving the updates to a staging table, and merging on a more frequent basis. However, we ultimately decided this wasn’t worth the engineering time. Our relational databases are not that big, and they don’t contain our most frequently requested data. Running the Sqoop job and copying the data into Redshift takes about 2 hours with a 5 node EMR cluster. This ends up costing us less than $3 per day, as we use spot instances with ephemeral EMR clusters.</p> <p>Our third post will discuss our initial performance issues with our design and the steps we took to improve the Fulla experience for our users.</p> Alex DeBrie Migrating Millions of Users in Broad Daylight 562a935bd4c96106101431ae 2016-01-24T08:19:29-06:00 2015-10-23T15:00:00-05:00 <p>In August we migrated our core user data (around 5.5MM user records) from SQL Server to MongoDB. The migration was part of an ongoing effort to reduce dependency on our monolithic SQL Server instance (which is a single point of failure for our web application), and also isolate data and operations within our system <a href="">into smaller microservices</a>.</p> <p>The migration was a daunting task – our user data sees around 800 operations/second and is critical to most requests to our site. We moved the data during the daytime while still taking full production traffic, maintaining nearly 100% availability for reads and writes during the course of the migration. Our CPO fittingly described it as akin to “swapping out a couple of the plane’s engines while it’s flying at 10,000 feet.” I’d like to share our approach to the migration and some of the code we used to do it.</p> <h2 id="the-big-picture">The Big Picture</h2> <p>Our user records have IDs that are numeric and sequential. Starting at ID 1, users are migrated in ranges of 1000 IDs at a time, moving sequentially upward through all user IDs (from 1 to about 5.5MM). The critical state during the migration is the “point” user ID. All reads and writes for user IDs below the point (i.e. migrated users) go to MongoDB; operations to unmigrated IDs above the point continue going to SQL.</p> <p><img src="/assets/562a9473d4c96106310078aa/overall_timeline.png" alt="Migration timeline" /></p> <p>For the range being migrated (the point user ID + 999), write operations are locked; code along all write paths throws an exception if the write is for a user ID in the locked range. This guarantees that requests are never “lost” by sending writes for already-migrated users to SQL. Reads for the migrating range still go to SQL and return results, and all users in the migration range are considered “pending” until every record is completely migrated and the point user ID advances to the beginning of the next range. A batch of 1000 users takes about 500ms to migrate, so the probability of a write request colliding with a user being migrated is low.</p> <h2 id="code-refactoring-conditional-routing">Code Refactoring, Conditional Routing</h2> <p>A lot of prep work went into refactoring application code to introduce conditional behavior that checks user IDs to see if they’ve been migrated or not, invoking SQL or MongoDB where appropriate.</p> <p>Hudl’s web application code is multi-layer; it has a domain layer (with interfaces like <em>IUserDomain</em> and <em>IUserUpdateDomain</em>) that encapsulates business logic, validation, etc., and the domain implementations call a data layer of DAO classes that work directly with the database(s). Web/MVC components (e.g. controllers and service endpoints) sit above the domain layer. Refactoring this meant:</p> <ul> <li>Introducing the new MongoDB data layer and domain code to support the same behavior as the existing SQL domains.</li> <li>Adding a “Migration Resolver” (<em><a href="">MigrationResolver.cs</a></em>) that decides which domain implementation to return for the requested user ID(s), and re-routing all higher-layer code (e.g. controllers, APIs, etc.) through the resolver instead of (the previous behavior) directly interfacing with the domain layer.</li> <li>Introducing a “Migration State” class (<em><a href="">UserMigrationState.cs</a></em>) as the source of truth about where the migration is at (i.e. the point user, and whether a batch is locked for migration).</li> </ul> <p><img src="/assets/562a9465a0b5dd060e13a96c/code_refactoring.png" alt="Code refactoring" /></p> <p>Post-refactor, here are a few examples of how different user operations behaved:</p> <p><strong>GetUserById(123)</strong> - <em><a href="">MigrationResolver.cs</a> (line 118)</em></p> <p><em>The resolver checks the state singleton for the provided user ID. If the user is migrated, the resolver returns the MongoDB domain; otherwise it returns the SQL domain.</em></p> <p><strong>GetUsersByIds({ 123, 456, 1200, 13457 })</strong> - <em><a href="">HybridUserLookupDomain.cs</a> (line 60)</em></p> <p><em>The resolver returns a hybrid domain that uses the state singleton to split requested IDs into migrated and unmigrated subsets, then querying the MongoDB domain for the former and SQL domain for the latter. Results from both are combined, sorted, and returned.</em></p> <p><strong>GetUserByEmail(“”)</strong> - <em><a href="">HybridUserLookupDomain.cs</a> (line 40)</em></p> <p><em>The resolver returns a hybrid domain that queries MongoDB for the email first; if found, it’s returned, otherwise SQL is queried for the same email.</em></p> <p>Consistent layering and vertical separation between SQL and MongoDB code really helped us. By guaranteeing that everything went through the resolver and that all of our SQL and MongoDB operations were isolated and separate, we could confidently know that we didn’t have stray SQL calls scattered throughout the code, reducing the potential that we missed or forgot one as we refactored and introduced our conditional routing. It helped when introducing migration-specific code like our write-lock behavior in just a few places, giving us confidence that we had everything covered. If you’re tackling a similar effort, consider some up-front work (independent of the migration effort itself) to improve your code layering if you don’t have it already.</p> <h2 id="the-migration-job">The Migration Job</h2> <p>A custom, interactive console application (<em><a href="">UserMigrationJob.cs</a></em>) serves as the control process for the entire job. It provides commands to migrate single users or manual ranges for testing and gives clear output about migration state and success. It’s written to cautiously expect success – if it ever encounters a response or state that it’s unfamiliar with, it aborts the migration and leaves the system in a known, stable state, printing out diagnostic information to help fix the situation and resume after making corrections.</p> <p>Here’s an early run of the job in our staging environment (the batches take a bit longer to migrate since our stage hardware is less powerful):</p> <p><img src="/assets/562a946ba0b5dd060e13a96e/console_app.png" alt="Migration job console application" /></p> <p>Our primary production webservers (about 20 c4.xlarge EC2 instances) serve as the workhorses for the migration; user ids are split into batches and parallelized across them. Using the webservers is convenient because they already have all of the code needed to interact with both SQL and MongoDB, so there’s no need to copy or duplicate data layer code to the job itself. It also makes it easy to parallelize the job, since it just means splitting a range of user IDs across all the servers.</p> <p><img src="/assets/562a9a7ba0b5dd060e13ac57/job_server_parallelization.png" alt="Parallelization across webservers" /></p> <p>It does mean throwing some extra load at apps taking production traffic, but that’s mitigated by being prepared with our systems monitoring, logging, and our ability to throttle or stop the job if necessary.</p> <p>The migration job itself starts at user ID 1 and works upward in batches of 1000. For an iteration migrating a range of users, it:</p> <ul> <li>Splits the batch into batch size / N webservers.</li> <li>Makes an HTTP request to set the migration state on each webserver, which sets the point user ID and locks writes for the entire range about to be migrated.</li> <li>Sends one sub-batch to each webserver. The webservers query the batch of users from SQL, insert them into MongoDB, and heavily verify the migration by re-querying both databases and comparing results. The webserver responds with a status and some result data.</li> <li>Makes an HTTP request to each webserver to move the point user ID and unlock the batch. The point is moved up to the next unmigrated user ID.</li> <li>Repeats.</li> </ul> <p>If at any point a step fails or we don’t see a synchronous state across the servers, the job stops itself, abandons the current batch, unlocks writes, and echoes the current state for manual intervention.</p> <p>Having the migration state mutable and stored in memory on the webservers makes the system a bit vulnerable. The state is critical for routing, and if it’s ever out of sync it means we’re potentially corrupting data by sending it to the wrong database. To mitigate this, the job persists the state to a database after each batch is migrated, and the webapps are coded to read the state on startup or prevent startup completely if they can’t. The job is also adamant about ensuring consistency of the state, preferring to keep it corrected and avoid continuing the migration if things aren’t as they should be.</p> <h2 id="go-time">Go Time</h2> <p>When everything was ready to go, we came in on a Monday morning and turned on the configuration toggle to start sending new user creation to MongoDB and let that simmer for a little while, keeping an eye on metrics and logs for anything unexpected. During this and the remainder of the migration, we were in close contact with our awesome support team’s tech leads – they were plugged in on where we were at, and were looking out for any odd support calls or behavior that trended with the ranges of IDs for new users or users that had been migrated.</p> <p>We started the actual migration by moving users 1-9999 on Monday afternoon. Again moving cautiously, we monitored the systems for the rest of the day and overnight before committing to a larger-scale run. </p> <p>On Tuesday morning we ran the migration on users 10000-99999. This longer, sustained run let us understand how the load of the migration would impact the webservers and databases we were working with. After pushing the migration up to user 499999, we stopped again to monitor and observe. We did uncover a couple minor bugs in our domain layer at this point, so we spent the remainder of the day coding, testing, and deploying those fixes.</p> <p>Wednesday after lunch, we pulled the trigger and ran the migration on the remaining users, which took around an hour to complete. Nobody really even noticed we were migrating some of Hudl’s most critical data right out from under them; it was a typical Wednesday afternoon – and that’s exactly how we wanted it.</p> <h2 id="wrapping-up-and-a-few-of-many-lessons-learned">Wrapping Up, and a Few (of many) Lessons Learned</h2> <p>After completing the migration, we saw our SQL Server steady state CPU utilization drop from 25% to 15%. That was a pretty big win for us. Additionally, moving the data and code to its own microservice gave us a great bounded context to work within as we go forward, keeping it isolated and making changes lower-risk and easier to deploy.</p> <p>One thing that’s <em>really</em> easy to do during migrations is let “little changes” creep in. You see some code that could use a little cleanup, or find a data type or method signature that could be improved a bit, and it’s easy to say, “oh hey, I’ll just fix that now while I’m in there”. My advice: don’t. Every little change adds risk and additional testing to something that’s already inherently full of risk. Make a note of those things and change them after you’re done with the migration. Trust me, it’ll keep you sane.</p> <p>We started the project without dedicated QA, and that really hurt us in the long run; we crammed a bunch of testing late in the effort. Try to get QA involved early and all the way through the process. On top of that, having good communication with our support team during the process helped us stay on top of any customer-facing issues that we didn’t notice with our metrics. Bottom line: don’t go too far off the grid. Being heads-down is important, but stay connected to keep the feedback coming.</p> <p>Finally, make sure you’re monitoring everything you can. For us, that included things like:</p> <ul> <li>MongoDB operations</li> <li>CPU (and other system metrics) on webservers, SQL Server, and the new MongoDB systems</li> <li>User logins, accesses, updates, creates, etc. (for error rates, volume, and performance)</li> <li>Application logs (we use SumoLogic to aggregate), including many that we’d specifically added for the migration</li> </ul> <p>It was an exciting, challenging (and at times grueling) project. I’m really proud of the team and all the time and effort put in to make it so successful. There’s so much more I could cover here, so if you’d like more detail or insight into something, hit us up on Twitter at <a href="">@HudlEngineering</a>.</p> Rob Hruska Hello Fulla 560a871bedb2f335500880b7 2015-11-22T13:19:48-06:00 2015-09-29T08:00:00-05:00 <p>Over the last year, the Data Engineering squad has been building a data warehouse called Fulla. Recently, the squad rethought our entire data warehouse stack. We’ve now released Fulla v2 and Hudlies are querying data like never before giving us a better understanding of our customers and our product.</p> <p>Every night, Fulla gets a fresh copy of most of our production data, which comes from SQL Server, MySQL and a handful of MongoDB clusters. We also parse all the logs from our web servers and append them to a logs table. When we say “Fulla” at Hudl, most people think of <a href="">re:dash</a>, an open source query execution app. However, Fulla is our ETL pipeline, Redshift, <em>and</em> re:dash.</p> <p>There are two big challenges that make exporting data tricky at Hudl:</p> <ol> <li>A few years ago we <a href="">bet the farm on MongoDB</a>. Many old data models continued living in SQL Server (users, teams, schools, etc.), but new models were sent to MongoDB.</li> <li>More recently, we started moving to a <a href="">microservices architecture</a>.</li> </ol> <p>Both moves have been great for development at Hudl. But for serious statistical analyses, we need all the data in one place. In the early days of data science at Hudl, exporting data was a highly manual (and fairly janky) process. It involved finding the router or primary node for the Mongo collection we cared about, running <code>mongoexport</code> to an external drive attached to the server and copying the data to <a href="">S3</a>. Then, we would write a SQL query to get the rest of the business data we cared about and ship that to S3. If we wanted log data, we had to use the Splunk API to write a query, which felt a lot like draining the Atlantic with a coffee stir. Needless to say, we spent a large majority of our time moving data around, and not much time doing the more interesting things data scientists love to do.</p> <p>We quickly realized we needed a data warehouse. We use S3 to feed Spark batch jobs so our initial thought was to build a Hive warehouse on top of S3 instead of HDFS. We thought, “If Netflix is doing it, how hard could it be?” As it turns out, very hard. Because we really wanted to use S3, we picked EMR as our Hadoop implementation. I won’t go in depth about this part of our journey, but here are a few problems we never found a good solution to: </p> <ul> <li>Serialization/Deserialization</li> <li>Latency</li> <li>Multi-tenancy</li> <li>Cluster maintenance</li> </ul> <p>EMR shines as an engine for batch jobs. It’s extremely easy to use and we <em>love</em> Amazon’s Spark on YARN implementation. But as a persistent Hive warehouse, results were mixed at best. Could we have gotten it to work? Possibly. But it was a big headache and we were eager to move away from it.</p> <p>Enter Redshift. In June, we spent a few days with our AWS Solutions Architect learning about Redshift and spinning up a proof-of-concept cluster. It was love at first sight. Switching to Redshift solved all of the above problems we faced with Hive. The one tradeoff is that Redshift is more strict about the schema, but after using it for a few months I’m no longer convinced that this constraint is a negative.</p> <p>This is the first in a three-part series on Fulla. The switch from Hive to Redshift has taken us from 4-5 diehard users to more than 80 Hudlies querying our data. This gives us a better view of the company, and we believe it’s going to promote a data-driven culture at Hudl and we want to share how we built it. The next post will give an overview of our ETL pipeline and describe how we process our logs so they can be queried in Redshift. After that, we’ll post about how we tail Mongo Oplogs to keep our copies of production data fresh and clean.</p> Ben Cook Data Science on Firesquads: Classifying Emails with Naive Bayes 55d39d71edb2f325e100fdfb 2015-08-18T18:11:50-05:00 2015-08-18T16:00:00-05:00 <p>At Hudl, we take great pride in our Coach Relations team and the world-class support they provide for our customers. To help them out and to foster communication between the product team and Coach Relations, we have an ongoing rotation known as Firesquad. Each squad on the product team takes a two-week Firesquad rotation, during which we build tools and fix bugs that will help Coach Relations provide support more efficiently and more painlessly. </p> <h1 id="introduction">Introduction</h1> <p>This year, for our Firesquad rotation, we on the Data Science squad wanted to help automate the classification of support emails. The short-term goal was to reduce the time Coach Relations needs to spend when answering emails. Longer term, this tool could allow us to automatically detect patterns and raise alarms when specific support requests are occurring at an abnormal rate. </p> <h1 id="road-mapping">Road-mapping</h1> <p>At a practical level, we had two challenges to solve:</p> <ol> <li>Train an effective classifier. </li> <li>Build an infrastructure that reads emails from Zendesk, classifies them, and writes those classifications back to Zendesk.</li> </ol> <p>In order to solve these challenges, we used the following technologies. </p> <h4 id="modeling-apache-sparks-mllib">Modeling: Apache Spark’s MLlib</h4> <p>Although there are many machine learning implementations that could be used to classify emails, few of them can train on large datasets as efficiently as <a href="">Apache Spark’s MLlib</a>. Given the large number of emails in our training sample and the even larger number of features that we anticipate using, the choice of MLlib was quite natural. </p> <h4 id="data-pipeline-amazons-kinesis">Data Pipeline: Amazon’s Kinesis</h4> <p><a href="">Amazon Kinesis</a> is a cloud-based service for processing and streaming data at a large scale. Although we do not currently have a large influx of support emails that would necessitate such a solution, we decided to use Amazon’s Kinesis because of it’s scalability and the ease of use. In addition, learning to use Kinesis would level up our team for processing large scale data in real time. </p> <h1 id="the-classifier">The Classifier</h1> <p>The task of classifying emails is not a new one. Spam filters are a classic example of this task. Rather than reinvent the wheel, we decided to use a tried and true approach: a Naive Bayes classifier using word n-grams as features. </p> <h2 id="naive-bayes">Naive Bayes</h2> <p>Bayes theorem (shown below), indicates that the probability that a certain email is of class <code>C_k</code> given that it has a certain set of features: <code>x_1, ..., x_n</code> is proportional to the likelihood of that class (how often that class occurs in all the emails) times the probability of those features occurring in an email given that an email is class <code>C_k</code> divided by the probability of those features occurring. </p> <p><img width="300" alt="Bayes Theorem" src="" /></p> <p>Using the chain rule allows us to rewrite Bayes theorem as:</p> <p><img width="600" alt="Chain Rule" src="" /></p> <p>We’re unlikely to have any way to estimate these complex conditional probabilities, so we make the Naive assumption that features are conditionally independent:</p> <p><img width="200" alt="Conditional Independence" src="" /></p> <p>This makes the problem much more tractable and allows us to simplify the initial classification probability to the product of simple single feature conditional probabilities as shown below:</p> <p><img width="300" alt="Naive Bayes" src="" /></p> <h2 id="data">Data</h2> <p>To start, we exported all email data from Zendesk from 2012 to 2015. After removing unlabeled emails or emails with out of date labels, we have 150,000 emails to use in building and evaluating the model. 80% of these emails are used to train the model while 20% are reserved as a test set for performance evaluation. </p> <h2 id="model">Model</h2> <p><img src="" alt="Flowchart" /></p> <p>A flowchart showing the steps we took to build the classifier is seen above. We first tokenize each email by creating n-grams of one, two, three, four, and five words. After this, we removed all “stop words” such as “the” or “and.” To find out which tokens are most important for differentiating between different categories, we went through each email category and calculated the signal to background (S/B) ratio. The S/B ratio for a given category and token is defined as the number of emails containing that token that are in the category divided by the number of emails containing that token that are not in that category. For a specific category, call it <code>A</code>, this can be written: <code>P(A | token)/(1 - P(A | token))</code>. We want the S/B ratio to be fairly high so that we will use tokens that have a strong discriminating power, in our case we require <code>S/B &gt; 4</code>. </p> <h2 id="test-set-performance">Test Set Performance</h2> <p>The classifier was trained with Apache Spark MLlib’s implementation of a Naive Bayes classifier. The training sample consisted of 80% of the emails. The overall accuracy of the classifier was evaluated using the remaining 20% of emails that had been reserved as a test set. This accuracy was found to be 87.9% on the test set. The confusion matrix for the entire test set is displayed below. </p> <p><img src="" alt="confusion matrix" /></p> <p>As you can see, the classifier performs very well on the top labels but very poorly on any labels that do not have many examples. This is largely due to the fact that there is not enough data to discriminate between the features of these low-occupancy emails. The existence of so many low-occupancy email categories is largely due to the fact that the email labeling used in Zendesk has been updated recently for certain labels. With future data, the classifier can be retrained and its performance on many labels should improve dramatically. </p> <h2 id="systematic-uncertainties">Systematic Uncertainties</h2> <p>Although our classifier performs well on the 20% test set, this test set is not, in fact, representative of the current label distribution. The distribution of the top five labels over time is shown below:</p> <p><img src="" alt="label probabilities" /></p> <p>To see how this changing label distribution would affect the accuracy, we calculate the accuracy for a given month by multiplying the precision for each label by the number of emails with that label and dividing by the total number of emails in that month. Mathematically, this is represented by the following equation:</p> <p><img width="200" alt="Naive Bayes" src="" /></p> <p>Where <code>a_m</code> is the accuracy for month <code>m</code>, <code>p_i</code> is the precision for label <code>i</code> and <code>n_{i,m}</code> is the number of emails in month <code>m</code> for label <code>i</code>. The figure below shows the distribution of calculated accuracies for each month in 2014 and 2015. </p> <p><img src="" alt="Monthly Accuracies" /></p> <p>We expect that the mean of these accuracies will be similar to the mean we will see in the future when this classifier is put into production. In addition, we can calculate our systematic uncertainty on this mean by finding the difference between this mean and the 34.1% and 65.9% quartiles. This gives us an expected accuracy of: <strong>75.6% +5.8%/-6.8%</strong></p> <h1 id="deployment">Deployment</h1> <p>As mentioned previously, we chose to implement this classifier with the combination of <a href="">Apache Spark’s MLlib</a> and <a href="">Amazon’s Kinesis</a>. The use of both of these tools in tandem allows us to effortlessly scale the pipeline to handle widely varying loads.</p> <p><img src="" alt="kinesisclassifierdeployment" /> The data pipeline consists of six principal steps, shown in the above flowchart:</p> <ol> <li> <p>Emails collected by Zendesk are batched and JSON-formatted by our <em>Zendesk/Kinesis Interface</em>, written in Google’s <em>Go</em> language.</p> </li> <li> <p>The <em>Zendesk/Kinesis Interface</em> then implements a Kinesis Producer and publishes the email records to the input Kinesis shard.</p> </li> <li> <p>The <em>Email Classifier</em>, having loaded the latest model from Amazon’s S3 [A.], connects to the input shard and reads the latest records. It then formats the emails into a Spark RDD and classifies them in parallel.</p> </li> <li> <p>With the emails classified, the Spark job formats JSON records containing the email-ID and the predicted label. It batches these into 500 record batches and publishes them to a different, <em>output</em> Kinesis shard.</p> </li> <li> <p>The <em>Zendesk/Kinesis Interface</em> then recives these output records.</p> </li> <li> <p>With the labled emails, the <em>Zendesk/Kinesis Interface</em> then modifies the webmail form by pre-populating the category selection with the predicted label.</p> </li> </ol> <p>We can now scale this infrastructure by simply adding additional Kinesis shards when IO limited, or by adding Spark executors if processing becomes a bottleneck.</p> <p>Finally, as time progresses and we recieve more emails and feedback from Coach Relations we can retrain the existing model, or create new models all together, and simply upload them to Amazon’s S3. The newest model is then selected and implemented automatically, allowing us to continually optimize and improve.</p> <h1 id="conclusions-and-next-steps">Conclusions and Next Steps</h1> <p>Moving forward, we would like to make improvements to the email classification performance so that the classifier can perform better on categories outside the top six. One way for us to do this is to gather more labeled email data and use it in training. We will gradually accumulate more labeled emails as time goes on and as the Coach Relations team answers more emails, so this one will occur naturally. A second way is to use a more advanced classification scheme that does not make the naive conditional independence assumption used by Naive Bayes. To this end, we have begun testing out some recurrent neural networks. </p> <p><em>Stay tuned and if you want to help us build recurrent neural networks or other awesome classifiers, <a href="">contact us</a>!</em></p> <script language="javascript"> var images = $('.blog-post img'); if (images) { images.css('border', 'none'); } images = $('img[title="Find with or without shard key"]'); if (images) { images.css('border', ''); } </script> William Spearman Faster and Cheaper: How Hudl Saved 50% and Doubled Performance 55cd3a5cc0d6714189000ea3 2015-08-14T12:50:12-05:00 2015-08-14T12:00:00-05:00 <p>Hudl has been running on Amazon Web Services (AWS) for years and we rarely take the opportunity to optimize our instance types. Recently we began moving to Virtual Private Cloud (VPC), which caused us to re-examine each instance type we use. Choosing the right instance type from the start is challenging. It’s tough to choose the optimal instance type until you have customers and established traffic patterns. AWS innovates rapidly, so today there are a lot of choices. Balancing compute and memory, durability and performance of storage, and the right type of networking attributes is important. Choose wrongly, and you’ve wasted money, incurred downtime, and/or hurt performance. Spend too much time optimizing and you may have traded time better spent on product improvements and revenue in exchange for a relatively small amount of money. </p> <p>In this blog post I describe how we shaved 50% off our AWS spend for web servers and doubled performance. We also learned about how different instance types perform relative to each other.</p> <h1 id="goals">Goals</h1> <ol> <li>We wanted to understand how much traffic one server could handle.</li> <li>Once we understood max load, we could make a better apples-to-apples comparison of different EC2 instance types and figure out the optimal one for our usage.</li> <li>Thinking about auto-scaling, we wanted to understand an appropriate metric (CPU, requests per second, something else?) to trigger scaling events.</li> </ol> <h1 id="approach">Approach</h1> <p>A common challenge for conducting load tests is coming up with accurate test data. It can be time-consuming to generate the test data and you still only get an approximation of reality. You don’t want to optimize for a traffic pattern that won’t actually occur in production. Rather than simulate production traffic, our tooling allowed us to safely use production traffic for these tests. In addition to more accurate results, we saved a lot of time. This entire effort was done with just one day of work.</p> <p>To use production traffic for this test, we relied on a characteristic of the Elastic Load Balancer (ELB) service and some of our own internal tooling. All of our web traffic initially flows into our ELB. The ELB divvies traffic up evenly across our nginx instances. Nginx is aware of our various services and will choose the appropriate app instance for that request.</p> <p><img src="" alt="ELB to Nginx to many services" title="ELB to Nginx to many services" /></p> <p>Because we run in AWS and we care about high availability, we run servers in triplicate by running in multiple availability zones (AZ). Side note: if you aren’t familiar with the idea behind availability zones, <a href="">watch this</a> (~6min), it’s pretty cool stuff. To maximize performance and isolate problems, we like to keep traffic within the same AZ. We think of an AZ as a separate data center, so it makes sense not to hop back and forth between data centers while servicing each request. </p> <p><img src="" alt="Availability Zone architecture" title="Availability Zone architecture" /></p> <p>By reducing the number of servers available in one AZ, that became our Test set. Because the ELB would continue to divvy up traffic evenly across all three zones, the other two were completely unaffected and became our Control set. By reducing the number of servers in the Test set we could gradually increase the amount of traffic handled by each server. We monitored performance for signs of degradation. Once we began to observe degraded service, bingo, we knew the maximum load. </p> <p><img src="" alt="Performance throughout the scaling events" title="Performance throughout the scaling events" /> <sup>The top dashboard is showing average response times and the bottom is the 90th percentile (p90) response times. Yellow is the Control set and purple is Test. The red lines show two separate scaling events. Performance didn’t seem to deviate too much, though it does show several high peaks after the second downscaling event.</sup></p> <p>As we continued to shed servers in the Test set, we also kept an eye on CPU utilization. By incrementally ratcheting up traffic, we could observe the CPU characteristics at maximum load. You can see the impact to CPU after we increased the amount of traffic (the two green lines) vs the CPU of a server in the control AZ. Instance IDs blurred to protect the innocent.</p> <p><img src="" alt="CloudWatch CPU Utilization" title="CloudWatch CPU Utilization" /> <sup>You can see the impact to CPU after we increased the amount of traffic (the two green lines) vs the CPU of a server in the control AZ. Instance IDs blurred to protect the innocent.</sup></p> <p>We repeated this same test with a few different instance types to find the sweet spot for us. The service under test was our oldest in our infrastructure and was running on m1.large instances. We finally landed on c4.xlarge and found, not only could we cut our hourly spend in half, but performance actually improved by 2x! The performance improvement was an unexpected bonus. </p> <h1 id="takeaways">Takeaways</h1> <ol> <li>After testing a few different instance types and finding the maximum load a server could handle we were able to run a quarter as many app servers. Our hourly (non-reserved) spend dropped by 50%.</li> <li>Despite the huge cost savings, we also saw a 2x improvement in response times! This came about by getting onto the newer instance family. In our case, this was a move from the m1 to the c4 family.</li> <li>Something we observed (and it would be sweet if Amazon made it clearer) is that compute, or Cores, are not apples-to-apples across instance families. Within a family, the 2x, 4x, 8x instances are apples-to-apples. The m4.4xlarge is pretty much twice as fast as the m4.2xlarge. But, the two cores on the m1.large are much slower than the two cores on a c4.large. Some of these instance families are pretty old, the M1 family was released in <a href="">2007</a>, a good default is to always choose the most recent generation.</li> <li>Amazon has excellent <a href="">details about instance types</a> online, but they make it nearly impossible to easily compare them. Luckily, there are a number of sites available for just this purpose. I enjoy <a href=""></a>.</li> <li>While testing another service, we observed a single c4.xlarge instance (16 ECU) handle the same load that 21 m3.medium (3 ECU). And, it was running around 12% CPU utilization vs the 25-30% on the m3.mediums! Oh, and response times went down, once again, by half!</li> <li>We found that performance began to suffer at around 40-50% average CPU utilization. At Hudl, we want to be able to lose an entire AZ (one third of our capacity) at any time without degrading performance. Assuming 35% utilization is our max, we need to actually aim for 35% * ⅔, or around 23%. That way, in the event of an entire AZ failure, we can absorb that traffic into the other two and still maintain performance.</li> <li>Having the tooling and infrastructure in place to quickly route traffic made it easy to conduct this experiment with minimal risk to our users. We invest a lot of time and effort in our foundation. This is one of the many ways that investment pays off.</li> </ol> <p><em>Interested in working on problems like this? <a href="">We should talk</a></em></p> Jon Dokulil The Ultimate Test 55ad2206edb2f361830154e2 2015-07-20T13:58:11-05:00 2015-07-20T11:00:00-05:00 <p>It was a hot, sunny Thursday morning — not a cloud in sight. The smell of sunscreen hung in the air. The music pounded, shaking the ground. Hundreds of parents, friends, and fans filtered onto the sideline, anxiously awaiting the players to take the field.</p> <p>It wasn’t a normal week in Beaverton, Oregon. The Nike Headquarters campus was packed with out-of-towners for The Opening. In its fifth year, Nike hosted a four-day event of competitions for the top 166 high school football players across America, including Nike Elite 7on7, Nike Elite Linemen, and the Nike Football Rating National Championship. The prize? Bragging rights.</p> <p><img src="/assets/55ad2206edb2f361830154e6/IMG_4941.jpg" alt="Nike's The Opening" /></p> <p>Hudl’s football squad had the opportunity to spend the week testing our newest product — Hudl Sideline. Our main testing event was the Nike Elite 7on7. There were six teams competing for the title — Alpha Pro, Fly Rush, Hyper Cool, Lunarbeast, Mach Speed, and Superbad — each led by several football greats. The experience was surreal. It was hard to believe we were surrounded by so much talent. Even more exciting, our team saw first-hand how valuable the sideline replay product that we built can be.</p> <p>At first, some teams seemed hesitant to give it a try, but quickly changed their tune when they saw just how much they could learn from this exceptional product. </p> <blockquote> <p>“Give me the Hudl Pad,” one coach called out. “I want to see that last play.” </p> </blockquote> <p>Coaches sat with their players, meticulously pointing out a misstep or a blown coverage. Athletes gathered together and critiqued each other’s form and positioning. By the end of the first day, almost every coach had their athletes gathered around the iPad to go over plays from the last drive. </p> <p><img src="/assets/55ad27caedb2f361ae015539/IMG_4966.jpg" alt="Mach Speed's defense surrounding the iPad" /></p> <p>We had all chosen one team to help throughout the competition. I ended up with the neon yellow team, Mach Speed (which just so happened to match my sneakers perfectly). I spent the entire competition on the sideline at their games, emphatically handing them the iPad after every drive. As minutes passed, I found myself growing more and more committed to their success. When they were excited, I was excited. When they were upset, I was upset. I wanted my boys in yellow to win it all. </p> <p>I’ve watched users test my products before. Why was this time different? This was more than an average testing opportunity; it was the ultimate testing opportunity. There were no organized user tests or stuffy conference rooms. We were right there with the team — play by play, in the center of the action. In those two days, I became part of their team. The coaches and athletes recognized me and knew where to go if they needed to watch a replay. The players sought us out to see what they had missed on that last play. It was more than an hour long meeting of questions and answers. It was apparent that our product had become a necessary part of the coaches teaching technique. </p> <p>At 4 p.m. on the last day of competition, my team had made it to the finals. I stood by their side — iPad in hand — prepared to pass it off to anyone ready to watch. The team started off strong, but couldn’t keep up with their opponent. Mach Speed couldn’t clinch a victory, but I knew that what we had built helped make them better players. As the competition came to an end, I had such a sense of pride in the product that our team had built. We provided value. We helped them succeed.</p> <hr /> <p>Originally posted on <a href="">Medium</a>.</p> Leigh-Ann Bartsch Understanding User Interactions on Mobile 55c121e0c0d67134a5004b22 2015-08-04T16:29:40-05:00 2015-06-29T15:35:00-05:00 <p>Understanding how users interact with your product is a key to lean software development. Seeing people trip over your navigation or completely turn around your expected use case may be cringe inducing, but it’s critical to understanding if what you built delivers. We’ll take a look at software we’ve used at Hudl to see exactly how a person uses an app in a non-intrusive manner. It’s not only saved us loads of time but also greatly increased our sample size.</p> <p>There are several well known methodologies to see and understand actual use of desktop and web software. From low-tech methods like sitting behind someone and watching them use your software to high-tech tools like <a href=""></a> and <a href="">Verify</a>. It isn’t as easy with mobile. For me, hovering over a person while they tap on a five inch screen has proven to be a tough experience. It doesn’t feel like a users true usage environment (for them and us) and small screens can be difficult to crowd around.</p> <p>One tool that helps me better understand Hudl’s mobile users is <a href="">AppSee</a>. It’s a mobile analytics SaaS that records videos of people using our mobile apps. We use it to sample a small percentage of random sessions to see how users are truly using our apps. We’ll search for videos based on an action such as “Edited Breakdown Data” to see Hudl in action. We’ve used the content to help find flaws in our design assumptions that user interviews didn’t uncover.</p> <h3 id="a-surprising-yet-obvious-discovery">A surprising, yet obvious, discovery</h3> <p>After releasing a redesigned 4.0 version of the Hudl iOS App, we noticed people love to tap their profile pictures. A lot. When a user viewed our main menu, many would tap on their profile picture, presumably to take them to a profile view. This was on our roadmap, but it wasn’t prioritized with the initial release.</p> <p>This wasn’t an action tracked by any of our standard analytics because it wasn’t something we expected. After watching several sample sessions we recognized the problem, reprioritized the profile and linked the profile picture to an editable profile page.</p> <p><img src="/assets/55c12327edb2f36a04003bec/center/ebd_post_1.png" alt="It seems obvious in hindsight, but people were tapping their profile photos without an expected resulting action." /></p> <h3 id="its-just-too-slow">It’s just too slow!</h3> <p>My favorite example of watching users to see how they interact with the app was with a highly-requested feature we built for iPads called “Edit Breakdown Data.”</p> <p>In our initial prototype we modeled much of the user experience around our web implementation of the feature. More or less a spreadsheet that lets our coaches enter data about a clip they’re watching (e.g. offensive formation, play, yards gained).</p> <p>We gathered feedback on wireframes and hi-res mocks in user interviews and everything seemed to line up. We developed a beta of the feature over the course of a couple weeks and released it to 20 beta users.</p> <h3 id="we-got-in-our-users-way">We got in our user’s way</h3> <p>We pride ourselves on developing software that helps users get their job done with minimal friction. After watching people use Edit Breakdown Data, we realized we’d done the opposite. Users were entering the same information for multiple plays but were slowed down by having to scroll through a large list and select the same data each time. It was taking users 20 minutes to edit the data columns on less than 20 video clips. This is something that would take two or three minutes on the web. Qualitative feedback from verbal user interviews indicated the same.</p> <iframe width="420" height="315" src="" frameborder="0" allowfullscreen=""></iframe> <h3 id="if-at-first-you-dont-succeed">If at first you don’t succeed</h3> <p>With this in mind we went back to the drawing board for version two. Based on feedback from users and what we saw in AppSee, we wanted to make it easy to select data that is used repeatedly. We added a “Recent Data” column in addition to the full list of data when the user tapped on a cell to edit it. We display values chosen recently, as well as the most-used. We also sped up certain animations to create a feeling of speed. While the app didn’t process data any faster, small changes like this can improve the feel of the overall user experience.</p> <p>We’ll be releasing this update to the same beta group in a couple of weeks and going through the same methodology for qualitative and quantitative feedback. This feature will be released to millions of users shortly thereafter. As coaches add more data to their video this will both save them time and give data based insight on how to improve their game.</p> <p><img src="/assets/55c121e0c0d67134a5004b26/center/ebd_v2.jpeg" alt="Edit Breakdown Data Version 2" /></p> <p>Originally posted on <a href="">Medium</a>.</p> Jimmy Winter Using Deep Learning to Find Basketball Highlights 55720b5fc0d67171f404336e 2015-06-08T22:57:27-05:00 2015-06-05T15:00:00-05:00 <h1 id="overview">Overview</h1> <p>At Hudl, we would love to be able to watch every video uploaded to our servers and highlight the most impressive plays. Unfortunately, time constraints make this an impossible dream. There is, however, a group of people who have watched every game: the fans. Rather than polling these fans to find the best plays from every game, we decided to use their response to identify highlight-worthy plays. </p> <p><img src="" alt="ProjectOverview" /></p> <p>More specifically, we will train classifiers that can recognize the difference between highlight worthy (signal) and non-highlight worthy (background) clips. The input to the classifiers will be the audio and video data and the output will be a score that represents the probability that the clip is highlight-worthy. </p> <h2 id="training-sample">Training Sample</h2> <p>To select a sample of events to train our classifier on, we created a sample of 4153 clips, each 10 seconds long, from basketball games. No more than two clips come from the same basketball game and most are from different games played by different teams. This is to prevent the classifier from overfitting for a specific audience or arena. About half of these are clips during which a successful 3 point shot occurs. The other half are a semi-random selection of footage from basketball games. </p> <p><img src="" alt="Flowchat_Mturk" /></p> <p>We used <a href="">Amazon Mechanical Turk (mTurk)</a> to separate the plays with the most cheering from those with no cheering or no successful shot. Each clip was sent to two or three separate Turkers. To separate highlight-worthy clips from non-highlight worthy clips, we gave two or three Turkers the following instructions: </p> <p><img src="" alt="Mturk Question" /></p> <p>The distribution of average scores for the 4153 clips is shown below:</p> <p><img src="" alt="scoredistribution" /></p> <p>Clips that were unanimously scored as “3” were selected as our “cheering signal” while clips that were unanimously scored as “0” are considered to be background. This choice was made to provide maximum separation between signal and background. Moving forward, using a multi-class classifier that incorporates clips with a “1” or a “2” could improve the performance of the classifier when it is used on entire games. For the time being, however, we use a sample of 887 signal clips and 1320 background clips. </p> <h2 id="pre-processing-audio">Pre-processing Audio</h2> <p>Before sending our audio data to a deep learning algorithm, we wanted to process it to make more intelligible than a series of raw audio amplitudes. We decided to convert our audio into an audio image that would reduce the data present in a 44,100 Hz wav file without losing the features that make it possible to distinguish cheering. To create an audio image, we went through the following steps:</p> <p><img src="" alt="Audio_FlowChat" /></p> <ol> <li>Convert the stereo audio to mono by dropping one of the two audio channels. </li> <li>Use a fast Fourier transform to convert the audio from the time-domain to the frequency-domain. </li> <li>Use 1/6 octave bands to slice the data into different frequency bins. </li> <li>Convert each frequency bin back to the time-domain. </li> <li>Create a 2D-map using the frequency bins as the Y-axis, the time as the X-axis and the amplitude as the Z-axis. </li> </ol> <h2 id="video-classifiers">Video Classifiers</h2> <p>Although audio cheering seems like an obvious way to identify highlights, it is possible that the video could also be used to separate highlights. We decided to train three visual classifiers using raw frames from the video. The first classifier is trained on video frames taken from 2 second mark of each 10 second clip, the second is trained on frames taken from the 5 second mark of each clip, and the third is trained on frames taken from the 8 second mark of each clip. Below are shown representative frames from an example clip at 2, 5, and 8 seconds (left to right). </p> <p><img src="" alt="video_clips_2_5_8_seconds" /></p> <h2 id="deep-learning-framework">Deep Learning Framework</h2> <p>Because our 2D audio maps can be visualized as images, we decided to use <a href="">Metamind</a> as our deep learning engine. Metamind provides an easy-to-use Python API that lets the user train accurate image classifiers. Each classifier accepts as input an audio image and outputs a score that represents the probability that the prediction is correct.</p> <p><img src="" alt="MetamindFlowchat" /></p> <h1 id="results">Results</h1> <p>To train our classifier we split our 887 signal clips and 1320 background clips into train and test samples. 85% of the clips are used to train the classifiers while 15% of the clips are reserved to test the classifiers. In total, we trained four classifiers:</p> <ol> <li>Audio Image</li> <li>Video Frame (2 seconds)</li> <li>Video Frame (5 seconds)</li> <li>Video Frame (8 seconds)</li> </ol> <h2 id="signal-background-separation">Signal Background Separation</h2> <p>To test how well each classifier faired, we consider the predictions of the classifiers on the reserved test set. Because the classifier was not trained on these clips, overfitting cannot be causing the observed performance on the test set. The predictions for signal and background for each of the four classifiers are shown in the plots below. The X-axis is the predicted probability of being signal (i.e. the output variable of the classifier) and the Y-axis is the number of clips that were predicted to have that probability. The red histogram indicates background clips and the blue histogram indicates signal clips. </p> <p><img src="" alt="montage_sigbkg_dist" /></p> <h2 id="receiver-operating-characteristic">Receiver Operating Characteristic</h2> <p>The receiver operating characteristic (ROC) curve is a graphical way to illustrate the performance of a binary classifier when the discrimination threshold is changed. In our case, the discrimination threshold is the value of the output of our classifier above which a clip is determined to be signal. We can change this value to improve our true positive rate (the number of signal we correctly classify as signal) or reduce our false positive rate (the number of background that we incorrectly classify as signal). For example, by setting our threshold to 1, we would classify no clips as signal and thereby have 0% false positive rate (at the expense of a 0% true positive rate). Alternatively, we could set our threshold to 0 and classify all clips as signal thereby giving us a 100% true positive rate (at the expense of a 100% false positive rate). </p> <p>The ROC curve for each of the four classifiers is shown below. A single number that represents the strength of a classifier is known as the ROC area under the curve (AUC). This integral represents how well a classifier is able to differentiate between signal and background along all working points. The curves shown are the average of bootstrapped samples and the fuzzy band around the curve represent the possible ways in which the ROC curve could reasonably fluctuate. </p> <p><img src="" alt="montage_roc" /></p> <h1 id="combining-classifiers">Combining Classifiers</h1> <p>Because each of these classifiers provides different information, it’s possible that their combination could perform better than any single classifier alone. To combine classifiers we must train a third classifier that takes, as features, the probabilities from the original classifiers and returns a single probability. </p> <p><img src="" alt="combined classifiers" /></p> <p>To visualize the performance of these combined classifiers we make a 2D plot with each axis representing the input probability. Each test clip is plotted as a point in this 2D-space and is colored blue, if signal, or red, if background. The prediction of the combined classifier is plotted in the background as a 2D-color map. The color represents the combined classifier’s predicted probability of being signal or background. </p> <p><img src="" alt="montage_sigbkg_2d" /></p> <h2 id="combined-classifier-performance">Combined Classifier Performance</h2> <p>We create the ROC curves as before in order to evaluate the performance of these combined classifiers. As expected, the combined classifiers which include the audio classifier perform the best and the improvement in ROC AUC from audio alone to audio plus video is 0.96 to 0.97. This is not a dramatic improvement, but it demonstrates that there are gains to be had from adding visual information. When two visual classifiers are added together, the ROC AUC increases from ~0.79 to ~0.83. This increase indicates that there is additional information to be gained from utilizing different times in the video. </p> <p><img src="" alt="montage_roc_2d" /></p> <h2 id="final-combination">Final Combination</h2> <p>A final combination of all four classifiers was performed but this ultimate combination was no better than the pairwise combination of audio and video. This indicates that further improvements to our classification would need to come from tweaks to the pre-processing of the data or the classifiers themselves rather than by simply adding additional video classifiers to the mix. </p> <h1 id="full-game-testing">Full Game Testing</h1> <p>Despite the fact that we have evaluated our classifiers on test data, this testing has been performed in a very controlled setting. This is because the backgrounds we have used are not necessarily representative of the clips present across an entire game. Furthermore, our ability to separate signal from background is useless if our top predictions in a specific game are not, in fact, among the top plays in that game. </p> <p>To evaluate our classifier <em>in the wild</em> we will split four games into overlapping 10 second clips. Overlapping clips means that we make clips for 0 seconds to 10 seconds, 5 seconds to 15 seconds, 10 seconds to 20 seconds, etc… These clips are then passed through the audio classifier. Our goal in doing this this is to answer the following three questions:</p> <ol> <li>What is the distribution of probabilities for clips in a whole game?</li> <li>How many of our “top picks” are highlight worthy?</li> <li>Does our signal probability rating represent the true probability of a clip being signal?</li> </ol> <h2 id="probability-distributions">Probability Distributions</h2> <p>The probability distribution of clips for the four test games is found below. </p> <p><img src="" alt="montage_gameprobdist" /></p> <p>As is seen in the above images, the majority of clips are classified as background. </p> <h2 id="top-picks">Top Picks</h2> <p>An animated gif for the top clip from each of the four test games is shown below. In addition, the top five clips from each game and their probabilities are shown and the content of each clip is discussed. </p> <h3 id="game-1">Game 1</h3> <p><img src="" alt="146111_54ea64da64a33e34b4d00110_3490000_3500000_871" /></p> <ol> <li>Score = 0.87: Made 3-pt shot.</li> <li>Score = 0.82: Made 3-pt shot.</li> <li>Score = 0.80: Made 2-pt layup.</li> <li>Score = 0.78: Made 3-pt shot.</li> <li>Score = 0.71: Missed 3-pt shot and missed 2-pt layup (crowd was pleased).</li> </ol> <h3 id="game-2">Game 2</h3> <p><img src="" alt="140350_54e7da72918f6c3a08f96870_2010000_2020000_823" /></p> <ol> <li>Score = 0.82: Made 3-pt shot.</li> <li>Score = 0.65: Made 2-pt shot.</li> <li>Score = 0.62: Made 2-pt layup. </li> <li>Score = 0.61: Made shot (shot was out of clip). </li> <li>Score = 0.58: Blocked 2-pt shot.</li> </ol> <h3 id="game-3">Game 3</h3> <p><img src="" alt="5974_54de9a684727513458fce866_425000_435000_819" /></p> <ol> <li>Score = 0.82: Made 2-pt layup.</li> <li>Score = 0.68: Blocked 2-pt shot. </li> <li>Score = 0.65: Made 3-pt shot.</li> <li>Score = 0.65: Missed free throw. </li> <li>Score = 0.61: Made 2-pt layup.</li> </ol> <h3 id="game-4">Game 4</h3> <p><img src="" alt="43172_54e7e6ccef1934113ce915a6_1380000_1390000_683" /></p> <ol> <li>Score = 0.69: Made 2-pt layup.</li> <li>Score = 0.68: Made 2-pt layup.</li> <li>Score = 0.66: Made 3-pt shot.</li> <li>Score = 0.64: Transition between two segments of video.</li> <li>Score = 0.64: 2-pt shot blocked. </li> </ol> <p>Of these, we would consider the made shots to be signal which gives us 14 signal out of 20 total clips. Additionally, the top play of each game is signal. </p> <p>To understand our expectations, we use the Poisson Binomial distribution. The mean is the sum of all 20 probabilities and the standard deviation is the square root of the sum of <code>probability*(1-probability)</code> for each of the 20 probabilities. This indicates that we should expect 14.7 +/- 2.1 signal events. Our 14 observed signal events are consistent with this expectation as seen in the distribution below. </p> <p><img src="" alt="PoissonBinomial" /></p> <h1 id="next-steps">Next Steps</h1> <p>There are many additional steps that could be used to improve the performance of the highlight classifier and there are a number of challenges to be solved before using the classifier is practical on a large scale. </p> <p>Some of these are performance related: fast processing and creation of the audio images. </p> <p>Others are more practical: we need to be able to determine which team a highlight is for so we don’t suggest that a player tag a highlight of them getting dunked on. </p> <p>Additionally, there are improvements to the classifiers themselves: these can include an increase in the size of the training sample or performing more preprocessing of the data to make signal/background discrimination easier. </p> <p>The last and perhaps most important step is the optimization of the video classifiers: right now these video classifiers provide minimal value when combined with the audio classifiers, but this value could be increased substantially if we were to standardize the location of the “cheering” within each clip. This would help us to distinguish between impressive successful shots, free throws, and plays that occur away from the basket. </p> <p>It’s an exciting time for the product team at Hudl and we’re constantly coming up with innovative new projects to tackle. If you are interested in working with us to solve the next set of problems, check out our <a href="">job postings</a>! </p> William Spearman How Our Product Team Works 552d3780c0d671524e014c01 2015-04-14T21:46:28-05:00 2015-04-14T11:00:00-05:00 <p><em><strong>Disclaimer</strong>: If you’re familiar with Spotify’s team dynamic and structure, you’ll hear a lot of borrowed terminology and concepts throughout this post, and our diagrams might look eerily similar. That’s because we’re huge fans of how Spotify does things, and we believe in a lot of the same product development values they do. If you’re not familiar, check it out <a href="">here</a>.</em></p> <p>One part of Hudl I frequently have to explain to people outside the company is the structure of our product team. Fellow developers at other companies, friends I graduated with, and plenty of people in between want to know how Hudl works—and as it turns out, there’s a lot to talk about. We’re constantly evolving and learning more about how to keep our heads on straight, and as we do, we want to get the lessons learned on the table.</p> <h2 id="growing-pains">Growing Pains</h2> <p>In January of 2011, a few years after being founded, Hudl’s product team was at around 20 people. We didn’t have to care about things like who was responsible for what or duplication of effort—all that mattered was that coaches and athletes loved our software and that we were able to move quickly on that software. We had a few distinct areas of development, but the responsibilities of those areas were great enough that important features like recruiting, highlights, and signups weren’t given enough attention. We were able to move fast, but it couldn’t last if we still wanted to deliver quality product while scaling up.</p> <p>Hudl’s product team has grown significantly since then. We’ve gone from the small, 20-person team we were at the beginning of 2011 to a team of around 120 today, and we’ve experienced plenty of growing pains along the way. People began to step on each other’s toes more and more often. Deploying became a process people dreaded, because they had to wait in our deploy queue for hours. Nobody had a clear idea of who was supposed to work on what. Eventually, these pains became unmanageable and we set out to correct them with a new team structure, keeping the following questions in mind:</p> <ol> <li>How do we ensure our team’s structure scales with our rapid growth?</li> <li>How do we keep bureaucracy at a minimum?</li> <li>How do we keep ourselves <em>fast</em> (planning, developing, testing, and releasing included)?</li> <li>How do we afford all our features and products the attention and development time they deserve?</li> </ol> <p>We transitioned to a structure that was a take on Spotify’s around two years ago, keeping autonomy and speed in mind, and we think that it’s afforded us both of those traits with no end in sight.</p> <h2 id="squads">Squads</h2> <p><img src="/assets/552d3abed4c96156ac0140b1/squads.png" alt="Squads" /></p> <p>The most atomic unit of our product team is the squad - a cross-functional, autonomous team of five to eight. Most squads focus their attention and development time on one of the many aspects of Hudl’s product, planning new features and releases at their own pace. The location of the squad doesn’t matter—we have squads with people from four different remote locations, while others work entirely in one office. Instead of requiring them to be in one place, we use tools like Slack and Google Hangouts to keep lines of communication open between team members. We also don’t designate a product owner (like Spotify does) because we think the entire squad should act as the owner of their features and participate in user interviews, development of success metrics, and other responsibilities you might associate with that role. Each product manager is a blend of multiple roles in the agile methodology, such as a product owner and agile coach—but only to the degree that the squad needs those roles.</p> <p>However, one of the concerns with having so many small squads working independently is duplication of labor. It’s easy for two squads to write similar code, and do things in similar ways, but we want to avoid a constant reinventing of the wheel. To remedy this problem, we also introduced some foundational squads whose focus is to work on internal, infrastructural projects that allow other parts of the team to use common tools to move quickly and efficiently.</p> <p>Just like Spotify, we try to avoid the concept of “ownership” when it comes to a squad’s responsibilities. You can think of it as an open-source project—each one has a set of core maintainers, and external contributors that have their work approved by those maintainers. We generally center each squad around an individual microservice with its code available to everyone on the product team, which we’ve explained in a bit more detail in a <a href="">separate post</a>. This model keeps squads from blocking each other; they’re generally responsible for adding features to the app they maintain, but if they don’t have the time just yet, or you’d like to see it introduced sooner, you’re free to add the feature yourself and ask them to look it over.</p> <p>Our architecture and deployment strategy allows squads to release on their own schedule with minimal impact on others. Some squads will do two-week sprints, others will do one (a few even roll with Kanban instead of sprints!)—and most of the time, releases occur during these sprints, instead of on a schedule. By the latest averages, we do roughly <strong>90 releases a week</strong>, with many more releases per week being perfectly within reach.</p> <p>Squads do things in whatever way works best for them. We encourage agile mantras around shipping and iterating on features quickly, but don’t mandate any specific practices. Nothing is sacred. If you want to try a new way of issue tracking, go for it - some squads use pure JIRA, others pure GitHub Issues, and still others use a combination. If story pointing seems appropriate for your iterations, use it. If it doesn’t, don’t! When a tool works well for a squad, we encourage them to tell the rest of the team through guilds and other updates (which will be touched on later). Other squads that take interest in one of those new tools will generally adopt it, and eventually it may become standard throughout our team.</p> <h2 id="tribes">Tribes</h2> <p><img src="/assets/552d3abeedb2f303e60158b2/tribe.png" alt="Tribe" /></p> <p>We’ve got around 25 squads and that number is only increasing. At some point, scaling squads without any sort of common goals to rally around becomes unscalable. That’s where tribes come in.</p> <p>Tribes at Hudl align closely with distinct sections of the product: “Team Sports”, “Individual Performance”, “Community”, and “Foundation”, and can be mapped pretty closely to an organization within a company. Each tribe has a particular goal and contains a number of squads that work with that goal in mind. For instance, the Team Sports tribe works to bring value to the different team sports across the globe with squads like Football, Basketball, and Soccer delivering features to give value to each of those sports. The Foundation tribe focuses on enabling people throughout the company to work more efficiently, with squads like Infrastructure, Dev Tools, and Platform providing different tools and frameworks to do just that. These tribes can scale to many, many squads, and ensure that no matter how granular we decide to be on each one’s responsibilities, they can all still work toward common goals.</p> <p>Like squads, tribes operate fairly autonomously, but their goals are closely tied to our overall company goals. We have a product director and general manager for each to make sure the goals for product and business development within the tribe are rock solid. Each tribe can also have business development, marketing, sales, and support chapters to work specifically on its part of the product—something we’ve deviated from Spotify on. We feel that, since each tribe at Hudl is so analogous to a business unit, it makes sense to include business roles in each one to make communication between technical and non-technical roles as simple as possible.</p> <h2 id="chapters">Chapters</h2> <p><img src="/assets/552d3abec0d671525e014c8d/chapter.png" alt="Chapter" /></p> <p>Within a tribe (and sometimes among tribes), we find it pretty important to keep people in the same role communicating with each other, outside the context of their own squad. We use chapters to give people that line of communication, and allow the chapters to really define how that line should operate to keep things the least bureaucratic while still delivering value to everyone involved. For instance, quality analysts will meet up once every week or two, while developers meet on more of an as-requested basis.</p> <p>The content of meetings varies, as well—some meetings will primarily discuss risks and ways to improve the chapter, while others will share knowledge by talking about a new technology being used on a squad, or a development practice we want to start trying. The main goal of a chapter discussion is primarily to keep the chapter learning and improving, whether it be through retrospectives, tech talks, or discussions about upcoming feature and their risks. It’s also a great place for members of the team to voice concern with role-specific processes and challenge our process as it exists today.</p> <h2 id="guilds">Guilds</h2> <p><img src="/assets/552d3abeedb2f303dc015f1c/guild.png" alt="Guild" /></p> <p>Last, but certainly not least, we needed to account for those interests that span across roles, squads, and tribes. Guilds provide a way for people passionate about a certain aspect of their work to get together and discuss new developments in the field and bounce ideas off each other. People can attend public speaking guild to hone their craft at conference presentations, sit in on security guild to learn more about how we can improve our password security, or watch a talk in mobile guild about how to write great unit tests in Objective-C.</p> <p>A guild can be formed by <em>anyone</em>, at <em>any time</em>. All you need to form a guild is enough interested people—most of the time, “if you build it, they will come” rings pretty true and you’ll find people passionate about the same thing you are. That’s how the current ones came to be, and it’s likely how future guilds will spin up, too.</p> <h2 id="its-not-all-perfect">It’s Not All Perfect</h2> <p>We like to think our structure solves the majority of our scaling problems, but it has its shortcomings. A few of the most recent problems we’re trying to solve are:</p> <ol> <li>Making sure information and expertise isn’t siloed between squads</li> <li>Making sure squads are confident enough to develop features in other squads’ apps</li> <li>Making sure our product team structure is completely cohesive with other teams within the company, like support and sales</li> </ol> <p>These problems are, of course, difficult. They rely heavily on squads and team members themselves to communicate openly, challenge themselves by staying uncomfortable, and share knowledge with other parts of the company when it’s needed. As easy as that sounds, it can be even easier to stay in the comfort of your own expertise and keep that expertise to yourself.</p> <p>We’re confident we can solve these problems with our current structure, and we’ll be excited to report back when we do. Until then, feel free to ask us questions, suggest new approaches, and challenge us on our model—we’re always open to new ideas! If this kind of team sounds fun to you, we’d love for you to <a href="">join us</a>.</p> <p><em>Edit: This article originally mentioned having eight people on Hudl’s product team at the start of 2011. It’s since been revised with a more accurate headcount.</em></p> Jordan Degner Exploring the Skunkworks Genius 5509d8c4edb2f3579d006a4d 2015-03-18T19:54:05-05:00 2015-03-18T15:00:00-05:00 <p>Ever heard of the <a href="">SR-71 Blackbird</a>? Since 1976, it has held the world record for being the fastest manned aircraft.</p> <p><img src="" alt="Skunk Works SR-71" /></p> <p>What makes it so interesting is that it was developed in the first ever <a href="">Skunk Works</a> division by Lockheed Martin. Even more, these Skunk Works projects were developed before funding was provided! The idea of letting creativity drive progress without restrictions is what we have set out to replicate at Hudl.</p> <p>Several times a year we schedule a Skunk Works event. These are times when everyone on our product team is encouraged to create something new, investigate new technologies and tools, and wow us with their creativity as they pursue a project they are passionate about. Some choose Hudl-focused projects, but many scratch a more personal itch. Our only requirement is that you create something. From there, it is entirely up to the team to decide the direction, creation and potential adoption of their project. We encourage diverse, cross-discipline teams. A team is naturally formed by people who post their ideas and recruit others to work alongside them. </p> <p>As a company, we have been doing Skunk Works for four years, and some of our production features have actually been pulled from Skunk Works projects.</p> <h2 id="best-in-show">Best in Show</h2> <p>We had great results at our most recent event and want to showcase our top projects!</p> <p><a href=";index=1&amp;list=PLGPJjOu_Ky1cdZYEFVRUA0PLaBNArVT1i">Candygram</a></p> <p><a href=";list=PLGPJjOu_Ky1cdZYEFVRUA0PLaBNArVT1i&amp;index=2">Vidja Goodness</a></p> <p><a href=";list=PLGPJjOu_Ky1cdZYEFVRUA0PLaBNArVT1i&amp;index=4">Plan The Week</a></p> <p><a href=";index=5&amp;list=PLGPJjOu_Ky1cdZYEFVRUA0PLaBNArVT1i">Sparta</a></p> <h2 id="closing-thoughts">Closing Thoughts</h2> <p>The benefits of allowing our team to explore their creative genius is too good to pass up, and we get some pretty cool features and tools out of the process. We would encourage other companies and developers to give it a try. Until then, be sure to continue checking out our blog for future Skunk Works projects!</p> Casey Bateman Deploying in the Multiverse 547f417fd4c96110e4013a74 2014-12-03T12:29:34-06:00 2014-12-03T11:00:00-06:00 <p><em>This is the second in a two-part series about deploying code at Hudl. <a href="">Part 1</a> was about deployment of our primary application, “The Monolith”. This part will talk about some of the lessons we’ve learned and how we’ve applied them to our multiple application framework “The Multiverse”.</em></p> <p>At Hudl, we like to move quickly. We are constantly fixing issues, building new features, and improving the experience for our coaches and athletes. So we put a lot of thought into how we work and dedicate a lot of time to making sure we are working as efficiently as we can. Our product team is broken up into cross-functional squads consisting of one or two developers, a product manager, a quality analyst, and a designer. Each squad independently plans, prioritizes, develops, and ships features. We modeled this structure after <a href="">Spotify’s product team (pdf)</a> and adapted it to fit our culture. This structure has enabled us to very quickly improve our product and respond to user feedback.</p> <h2 id="life-before-the-multiverse">Life before the Multiverse</h2> <p>Up until January of 2014, all development for took place in one repository, which is now lovingly referred to as “<a href="">The Monolith</a>”. This worked fine for us when we had a dozen or so squads. However, about a year and a half ago, we began to hit some bottlenecks.</p> <p>At that point, the Monolith codebase was growing to the point of being unmanageable. Currently, it consists of over 50K commits and takes up more than 4GB of space on disk. Each deploy takes about 30 minutes if everything goes smoothly. When a squad wants to deploy, they have to wait in the deploy queue, which in some cases means their code won’t be pushed to prod for another few hours or maybe even the next day. This kind of bottleneck is unacceptable and is only going to get worse as we grow our team. To enable our squads to continue to move quickly, we decided that we needed to change.</p> <h2 id="enter-the-multiverse---a-40000-foot-view">Enter the Multiverse - a 40,000 foot view</h2> <h3 id="the-code">The Code</h3> <p>The Multiverse is Hudl’s microservices framework. While a lot of development still happens in the Monolith, we have begun to split up our codebase into smaller services. Each service has its own repo and can be built, tested, and deployed completely independently of the other services. Keeping with <a href="">Conway’s Law</a> each squad typically has its own service that it is responsible for, though any person can commit code to any repo.</p> <p>It is worth noting that our microservices might more aptly be called microapplications. I say this because each service is capable of serving up and handling page requests alongside its inter-service communication. So any service can be the entry point for a web request. For example, if a user navigates to <code></code>, that request will get routed to the basketball service. The basketball service may read from its database and serve the resulting html, or it may call other services to get the data needed for the request.</p> <h3 id="communication">Communication</h3> <p>Each Multiverse service exposes a client that other services can use to access its data and functionality. Services discover and locate each other through <a href="">Eureka</a>, an open-source service registry made by Netflix. All inter-service communication is wrapped in <a href="">Mjolnir</a> commands which help to isolate service failures when they inevitably happen.</p> <h2 id="deployment-process">Deployment Process</h2> <p>Deployments in the Multiverse require coordination between many moving parts. The program responsible for all of that coordination is called Alyx3 (the third in our line of deployment-coordinating programs). From here on, I’ll refer to Alyx3 simply as Alyx. Alyx’s job is to make sure a branch gets from its first commit to serving its first production request safely and efficiently. To do this, it coordinates with Teamcity, Eureka, Github, Amazon SNS, Route53, and Outpost (more on that one later). </p> <h3 id="branch-lifecycle">Branch Lifecycle</h3> <p>Branches first appear in Alyx when they are pushed up to Github. Alyx learns about the new branch via a Github webhook. After that point, the developer can freely add commits and deploy the branch to our testing or staging environment. As branches are developed and tested, Alyx shows their progress.</p> <p><img src="" alt="" /></p> <p>When a branch is merged into master, Alyx knows it is ready to deploy to production. The actual deploy begins when a squad member, typically the Quality Analyst, hits the “Deploy” button in Alyx.</p> <p><img src="" alt="" /></p> <p>If the deploy succeeds, the branch is archived so we can refer to it later and roll back if needed. This graph shows the entire lifecycle of a branch as it moves through Alyx:</p> <p><img src="" alt="" /></p> <h3 id="deploy-lifecycle">Deploy Lifecycle</h3> <p><em>Here is a somewhat simplified picture of what the process might look like for a typical deploy:</em></p> <p><img src="" alt="" /></p> <p>When the deploy is kicked off, Alyx checks with Github to make sure it knows what the latest commit on this branch is (which at this point is usually a merge into master). Once that is confirmed, Alyx asks Teamcity, “Do you have a build of branch ‘master’ at commit ‘efd32de’? If not, kick one off and let me know when it’s done.”</p> <p>Once Alyx knows where the built payload can be downloaded from, it needs to figure out where the new bits need to be deployed to. If it’s being deployed to a test environment, Alyx will find a test machine that isn’t being used yet. If the branch is being deployed to production or stage, it will be going to all of the production/stage instances in the service’s cluster.</p> <p>After figuring out what instances the deploy will target, Alyx sends a message to those machines through <a href="">Amazon SNS</a>. The message is something along the lines of “All prod servers in the basketball cluster, please deploy the payload from this URL. Register with Eureka after and let me know when you’re done.”</p> <p>The program that reads and interprets these messages is called Outpost. Outpost is a very simple .NET app that runs on all our application servers. It has the sole purpose of receiving/interpreting SNS messages and then running the appropriate deploy scripts. While it is functionally very simple, Outpost is flexible enough that it is even capable of deploying new versions of itself.</p> <p>The deployment scripts are PowerShell scripts that download the specified payload, unzip it, create/start the app in IIS, warm up the app, and instruct the app to register with Eureka. This entire process usually takes between one to two minutes. Once the script is complete, the app is ready to start taking traffic!</p> <p>Meanwhile, Alyx is sitting and waiting for the deploy to complete. It is constantly monitoring Eureka to make sure the necessary servers get registered within a certain time frame. Additionally, each server regularly publishes its status to an SNS topic that Alyx is subscribed to. This makes it so Alyx can give granular status updates to whoever is waiting for the deploy to complete.</p> <h2 id="results">Results</h2> <p>We are still actively improving and developing our deployment system, but our preliminary results have been very promising. Here’s a snapshot from one of our dashboards that monitors the number of deploys we do:</p> <p><img src="" alt="" /></p> <p>That just depicts Multiverse deploys. Because deploy times are shorter and can happen in parallel (each service can deploy independently) squads are able to deploy much more frequently.</p> <h3 id="sns-takeaways">SNS Takeaways</h3> <p>We use SNS fairly extensively in this process. One of the key takeaways we’ve learned is to never expect the messages to arrive in order or within a certain time frame. Usually the messages arrive within a few seconds, but the ordering of them can be arbitrary if the messages were sent shortly after one another.</p> <p>Another takeaway for working with SNS is to split your communication up into multiple topics when possible. This makes it easier to ignore irrelevant messages and reduces the amount of application logic that you need to write just to differentiate different types of messages. We are currently using a single SNS topic for all messages that Alyx receives, but would ideally split it up to use a different topic for each message type.</p> <h3 id="errors-are-normal">Errors are normal</h3> <p>When deploying new code many times a day, it is inevitable that errors will happen. So, it is important to handle errors well and not assume that they are an exceptional case. With a program like Alyx, errors can occur in many different ways. Messages from Github may not get delivered, builds could get backed up in Teamcity, servers could unexpectedly stop their heartbeat to Eureka, etc. One thing we have learned about dealing with these sorts of failures is that transparency is key. The most important part of failure is to communicate that failure to your users. While it may be bad when errors happen, it is even worse when they go unnoticed.</p> <p>Our deployment system does a fairly good job of recovering when issues arise. There is retry logic and fallback scenarios set up for the pieces that are known to fail from time to time. However, when automatic recovery fails, we try to get as much information to the users as possible. For example, if a deploy script spits out an error message, Outpost sends that message to Alyx so that it can be shown to the person who kicked off the deploy.</p> <h2 id="looking-forward">Looking Forward</h2> <p>Alyx (and our whole Multiverse deployment system) is still a work-in-progress. We are constantly working to improve the experience and performance for our squads. Thus far it has been a huge success for our Product Team and we’re excited for the next steps. If you are interested in working with us to solve the next set of hard problems, you should visit <a href="">our jobs page</a>. We’d love to hear from you.</p> Josh Cox