Hacker News Clone

Classifying aviation-related posts on Hacker News with SLMs

by sethkim on 4/16/2025, 9:05:16 PM with 2 comments

by minimaxir on 4/16/2025, 9:26:28 PM
A few misc notes:
1. The better way to get all Hacker News data instead of blasting the API is to download the data from the official BigQuery dataset, which can do the task in a single query: https://news.ycombinator.com/item?id=40644563
2. For labeling the posts, instead of label-then-explanation, it may be better to do explanation-then-label to give the model a chance to reason though the edge cases.
3. Following up from #2, for prompt engineering the system prompt, it would likely be better to give a list of multiple valid examples and invalid examples (as noted after the fact) to guide reasoning.
4. Since the target label is a binary objective, it may be more practical/faster/cheaper to create a normal logistic regression model (e.g. tf-idf/BoW) from a large representative sample, then use that to predict the rest of the labels.
The more advanced way to do #4 would be to encode the posts as text embeddings first then use them as the input for a small MLP model...which I may or may not have a project in the pipeline based around that approach.