How We Used Machine Learning to Categorize Parts

Created: March 22, 2018
Updated: July 1, 2024

At Octopart, we believe that electronic component information should be easily available to help engineers bring ideas to life. To achieve this, we are working constantly to improve our data quality and quantity. In the last 18 months, we have released a completely new category taxonomy, key specifications for these categories as well as launched a recommendation engine to find similar parts. Recently, we have been working hard to improve the number of categorized parts on Octopart, so that you can find parts faster and also discover more parts. We are excited to announce that we are launching a machine learning based system to categorize more parts using their descriptions. This will expand the number of categorized parts on Octopart! In this blog, we will talk about how we built this system and how it’s going to impact the Octopart users.

Let us first understand how electronic parts have been categorized at Octopart. Let’s use Passive Components as an example.

Screen Shot 2018-03-22 at 12.48.27 PM

*Three level deep category taxonomy tree at Octopart*

As can be seen above, the Octopart taxonomy has 3 levels, L1 category or Level 1 category refers to the topmost level of the tree, with Passive Components, Discrete Semiconductors, Integrated Circuits (ICs), Sensors, Power Products etc. making up the top level. Similarly, we have L2 and L3 categories which are deeper, more specific, part of taxonomy tree. Given a tree like this, we can formulate the problem as hierarchical multi-class classification.

We wanted to use the part descriptions to classify them into right category. For example, given description: “32 Bit MCU 64KB Flash 8KB RAM 25MHz MCCP/SCCP CLC”, we want to classify this part to Integrated Circuits (ICs) > Embedded Processors and Controllers > Microcontrollers. Or given “Circ. HD Recept,Socket contacts,Jam Nut”, we want to classify this part to Connectors › Circular Connectors

Descriptions, both in terms of their size as well as complexity in terms of abbreviations and word structure, are quite challenging to process. There are two approaches to solve a problem of using descriptions to classify parts:

1) Rules-based approach: In this approach, we use our Electrical Engineering knowledge to encode rules. So if the word “conn” or “connector” appears, it’s a Connector and within it, if “cylindrical” appears, then it’s a Circular Connector, if the word “DIN 41612” appears it’s a Backplane Connector and so on. However, with this approach, we need to manually write all the rules and it’s hard to scale if categories change. Also, there might be parts which are actually Cables but have the word “connector” in them, how we do we handle that?

2) Machine learning based approach: In this approach, we train a computer to learn the rules instead. Machine learning models are good at learning rules given enough training data. Example: given 100k examples of connector descriptions, a model can learn all the abbreviations and package types that are expected for a connector. It is easy to scale such an approach.

We tried the rules-based approach first but quickly realized that it is unlikely to scale. Given the complexity and variations in part data, we decided to use the machine learning based approach to classify parts using their descriptions. Let’s go through the steps of how we used the machine learning based approach:

1) Preprocessing of part descriptions: We hand-coded a few regular expressions to clean up the descriptions data. We removed all capitalizations in the data, removed punctuations and converted some of the units to their full form. We also concatenated all the descriptions into a single string. We converted this data into a bunch of tokens that could be used to train the classifier. Next, we started evaluating different models.

2) Learning Curves for Model Selection: A learning curve is a representation of how an increase of learning (y-axis) comes with greater experience (x-axis). In order to evaluate different models, we used learning curves to figure out the different amounts of bias and variance for the model, and whether we have enough training examples. If the model is too simplistic, then no matter the number of training examples, we will still get errors. But if the model is good but still learning, then more training examples will help. This analysis allowed us to choose the right model.

Screen Shot 2018-03-22 at 12.48.48 PM

We chose logistic regression with stochastic gradient descent for a number of reasons. First, as linear models are good in this situation where features (total words) are much more than training examples. Second, logistic regression gives a probabilistic estimate for each classification. So we can know if the classifier is 99% confident or just 10% confident. Also stochastic gradient descent technique is commonly used in case of large training data.

3) Model evaluation using confusion matrices, precision, recall and F-scores: In classification, accuracy is not a sufficient metric to judge the quality of a classifier especially when there is class skew (one class has much more number of examples compared to another class). At Octopart, we have 6.9 million connectors but only 100k sensors. We want to make sure we are doing well in classifying for both. We used precision and recall to do a more accurate evaluation for the model. This was calculated using confusion matrices. Below is a visual explanation of precision and recall which we found quite useful:

Screen Shot 2018-03-22 at 12.48.58 PM

*Precision and recall overview [[source](https://en.wikipedia.org/wiki/Precision_and_recall)]*

Precision and recall is usually combined into a F-score. F-1 score weighs both of them equally. We wanted to weigh precision more heavily than recall for classification tasks. We used a beta=0.25 for calculating F-score from the following formula:

Screen Shot 2018-03-22 at 12.49.08 PM

*F-score that weighs precision and recall differently*

4) Prediction using confidence thresholds that maximize F-scores: We are using logistic regression which gives confidence estimates for every prediction. But, how do we decide whether to choose these or not? Is 70% confidence good enough? Is 50%? Obviously, higher the confidence -> there is more accuracy, but we lose the number of parts we can classify. Hence, we need a balance. For every category, we ran a script which found the best threshold that maximized the F-beta score. For all results below this confidence threshold, we rejected the prediction of the classifier. This method allowed us to increase the quality of the outputs even more. In the graph below, we can see how increasing the threshold increases the precision, but decreases the recall. A good balance is at a threshold of 0.75 where the F-beta score is maximized.

Screen Shot 2018-03-22 at 12.49.15 PM

*Choosing thresholds that maximize F-score*

5) Making the predictions and storing it in a database: Having designed the system to make the prediction of a category, we scaled this system to make the prediction for all three levels of the category tree. We stored these predictions in a database so that it could be read by the front-end code to be served on Octopart. In total, we have classified over 2 million parts including 1.3 million connectors and 600k tools and supplies. We have also classified sensors, ICs and some power products. Now while you are browsing on Octopart, you will get even better results! Like the following PIC32 microcontroller from Microchip:

Screen Shot 2018-03-22 at 12.49.26 PM

*A newly categorized part using the machine learning system*

In this blog we summarized how we used machine learning to classify more parts on Octopart using their descriptions text. Some of the unique challenges that we faced were hierarchical nature of our taxonomy, class skew between categories, unique pre-processing as well as evaluation requirements. If you have any questions or comments, do drop a note below!

REFERENCES:

1.Introduction to Information Retrieval Book, Ch. 8

2.Andrew Ng’s Class on Coursera esp. Week 6

3.Text Classification Using Majority Voting

Explore Octopart