Octopart dev/ops pre-AWS, at an undisclosed co-location facility.
Octopart’s goal to help engineers quickly find component data requires innovative, robust solutions to hard technical problems. Luckily, we all love tackling these sorts of problems. As our database has grown from 1 million to 30 million parts and our users have become more sensitive to data errors and daily pricing changes, our infrastructure has evolved to handle large amounts of data robustly and quickly. The bulk of our data processing is on product offers attached to these parts.The life of a product offer begins offsite, on the servers of distributors we partner with. Every day we receive and pull client feeds from e-mails, FTP, and web submissions. On an average day we receive feeds containing a total of 20 million product offers. We have defined our own feed format in the lingua franca of the distributor world, csv, but we also occasionally need to massage submitted feeds into our format. Ensuring that every feed is in our format allows us to optimize our feed processing code. Each feed and every offer in each feed is subject to a variety of sanity checks to make sure we aren’t receiving corrupt data.
Our feed processing scripts run in parallel to maximize processing throughput -- this also allows us to work through gluts of data simply by running more processes. At the core of the processing script is a matching function which matches raw product offers into our database. Different sources of data will refer to brand names, attribute names, and categories using their own lingo. We leverage a combination of manual techniques and machine intelligence to match variants to a canonical name. Normalization is at the heart of product comparison, whether it is comparing offers between distributors for the same product, or comparing technical attributes across different products. Once a product offer has been normalized, it is inserted into our core database.
We keep our core database fully relational to optimize for write throughput and guard against data corruption. However, the joins necessary to build a product and its associated metadata are too slow to service a responsive frontend. We built out a separate product search infrastructure, ThriftDB, to solve this problem. The initial version of Hacker News Search used this infrastructure.
ThriftDB is a key/value datastore with search built in. It unifies the schema for search and data retrieval, while optimizing response times for both. To bridge the gap between our core database and our ThriftDB database, we maintain a queue of products with updates from processed feeds. Each update is processed by a script which builds an entire ‘denormalized’ product using the ThriftDB schema from the core database and sends it to ThriftDB. We again leverage parallelism to maximize throughput. Once a product update hits ThriftDB, it is accessible to our frontend via ThriftDB’s search API -- at this point the product offer is distributed to the world via our website and API.
By splitting our data processing infrastructure into dedicated components with orthogonal responsibilities we are able to isolate problems and fix them without bringing down the entire site. Each of these components could be the subject of a much more detailed blog post and requires a lot of love, attention and imagination -- if any of them sound like something you’d like to work on, email us!