UNDERGRADUATE THESIS IN NLP
Automated Occupational Encoding using an Ensemble Classifier from TF-IDF and Doc2Vec Embeddings
For my undergraduate honor’s thesis, I developed a prototype classification system for the Canadian NOC protocol. ENENOC (the ENsemble Encoder for the National Occupational Classification), was comprised of series of steps involving data cleaning, exact match search, multi-classifier ensembling, hierarchical classification, and multiple output selection. In the absence of exact matching between job title input and NOC category descriptions, the input data was embedded using the TF-IDF algorithm and Doc2Vec. The embeddings were fed into a hierarchical, ensemble classifier that produced an enemble output based on the Doc2Vec embeddings, combined with the TF-IDF based output from a Random Forest, SVM, and KNN classifier.