Hacker News Clone

Show HN: Exploring HN by mapping and analyzing 40M posts and comments for fun

by wilsonzlin on 5/9/2024, 12:31:04 PM with 159 comments

by abe94 on 5/9/2024, 3:15:30 PM
This is impressive work, especially for a one man show!
One thing that stood out to me was the graph of the sentiment analysis over time, I hadn't seen something like that before and it was interesting to see it for Rust. What were the most positive topics over time? And were there topics that saw very sudden drops?
I also found this sentence interesting, as it rings true to me about social media "there seems to be a lot of negative sentiment on HN in general." It would be cool to see a comparison of sentiment across social media platforms and across time!
by CuriouslyC on 5/9/2024, 2:13:37 PM
Good example of data engineering/MLops for people who aren't familiar.
I'd suggest using HDBScan to generate hierarchical clusters for the points, then use a model to generate names for interior clusters. That'll make it easy to explore topics out to the leaves, as you can just pop up refinements based on the connectivity to the current node using the summary names.
The groups need more distinct coloring, which I think having clusters could help with. The individual article text size should depend on how important or relevant the article is, either in general or based on the current search. If you had more interior cluster summaries that'd also help cut down on some of the text clutter, as you could replace multiple posts with a group summary until more zoomed in.
by ComputerGuru on 5/10/2024, 3:09:19 AM
Amazing work, I'm impressed by the scope of your project!
I must say though, is it jina or bge-3/flag - the embeddings (and tokenizer?) do not do a good job on tech topics. It's fine for natural words, but searching for tech concepts like "xaml", "simd", etc cause it fall back to tokenizing the inputs and tries to grab similar sounding words.
Also, just some constructive feedback, if there were some way to stop it from showing the same "hn leaderboard" of results when there are no results because a topic is too niche would be nice. I get a lot of "Stephen Hawking has died" when searching for words the embeddings aren't familiar with.
Edit: I'm not so sure how well the sentiment analysis is working. I had the feeling that there was too much negative sentiment that didn't match up to reality, so I tried looking up things HN would feel overwhelmingly positive about like "Mr Rogers", I mean, who could feel negatively about him? The results show some serious negative spikes. Look up "Carter" and there's a massive negative peak associated with the passing of Rosalynn Carter. It was an HN submission talking about all the wonderful things the Carters did.
Also, I think the "popularity over time" needs to be scaled by the median number of votes a story got that month/year, because the trend lines just go up and up if you plot strictly the number of posts. Look at the popularity of "diesel" and you'll see what I mean - this is a term that peaked ten years ago! Or perhaps it should be some sort of keyword incidence rate or number of items with a cosine similarity index of less than x from the query rather than post score, maybe?
Edit2: The dynamic "click a post to remove and recalculate similarity threshold" is awesome.
by oersted on 5/9/2024, 2:44:31 PM
Here's a great tool that does almost exactly the same thing for any dataset: https://github.com/enjalot/latent-scope
Obviously the scale of OP's project adds a lot of interesting complexity, this tool cannot handle that, but it's great for medium-sized datasets.
by rantymcrant on 5/10/2024, 3:46:02 AM
I'd like to see an analysis of the rise of self promotion on HN.
I define self promotion on HN as a "Show HN: I ..." post vs "Show HN: Something ..."
Examples from the top 100 right now
* "Show HN: Exploring HN by mapping and analyzing 40M posts and comments for fun"
* "Show HN: Browser-based knitting (pattern) software"
These are not self promotional titles. The subjects are the exploration and the software respectively.
* "Show HN: I built a non-linear UI for ChatGPT"
* "Show HN: I created 3,800+ Open Source React Icons"
These are self promotional titles. The subject of each is "I"
My own simple check just via algolia search results checking for titles that start with "Show HN: I" gave these results for years starting April 1st. Graphed divided by the total number of results for that year
```
    2023 ****************************************
    2022 ***********************************
    2021 ***************************
    2020 **************************************
    2019 *************************
    2018 *************
    2017 *******
    2016 **********
    2015 ********
    2014 ************
    2013 *********************
    2012 *****************
    2011 *********
    2010 ***
```
I feel like maybe I grew up in a time when generally, self promotion was considered a bad character trait. Your actions are supposed to be what promotes you, calling attention to them is not but I feel that culture is changing.
I wonder if the rise in self promotion (assuming there is a rise) has to do with social media etc...
I perceive a similar rise on Youtube but I have no data, just a feeling from the number of youtube recommendations for videos of "I....."
by replete on 5/9/2024, 4:14:42 PM
I think this is easily the coolest post I've seen on HN this year
by seanlinehan on 5/9/2024, 1:53:56 PM
It was not obvious at first glance to me, but the actual app is here: https://hn.wilsonl.in/
by minimaxir on 5/9/2024, 4:42:27 PM
A modern recommendation for UMAP is Parametric UMAP (https://umap-learn.readthedocs.io/en/latest/parametric_umap....), which instead trains a small Keras MLP to perform the dimensionality reduction down to 2D by minimizing the UMAP loss. The advantage is that this model is small and can be saved and reused to predict on unknown new data (a traditionally trained UMAP model is large), and training is theoetically much faster because GPUs are GPUs.
The downside is that the implementation in the Python UMAP package isn't great and creates/pushes the whole expanded node/edge dataset to the GPU, which means you can only train it on about 100k embeddings before going OOM.
The UMAP -> HDBSCAN -> AI cluster labeling pipeline that's all unsupervised is so useful that I'm tempted to figure out a more scalable implementation of Parametric UMAP.
by oersted on 5/9/2024, 2:23:12 PM
This is a surprisingly big endeavour for what looks like an exploratory hobby project. Not to minimize the achievement, very cool, I'm just surprised by how much was invested into it.
They used 150 GPUs and developed two custom systems (db-rpc and queued) for inter-server communication, and this was just to compute the embeddings, there's a lot of other work and computation surrounding it.
I'm curious about the context of the project, and how someone gets this kind of funding and time for such research.
PS: Having done a lot of similar work professionally (mapping academic paper and patent landscapes), I'm not sure if 150 GPUs were really needed. If you end up just projecting to 2D and clustering, I think that traditional methods like bag-of-words and/or topic modelling would be much easier and cheaper, and the difference in quality would be unnoticeable. You can also use author and comment-thread graphs for similar results.
by jxy on 5/9/2024, 7:08:16 PM
> We can see that in this case, where perhaps the X axis represents "more cat" and Y axis "more dog", using the euclidean distance (i.e. physical distance length), a pitbull is somehow more similar to a Siamese cat than a "dog", whereas intuitively we'd expect the opposite. The fact that a pitbull is "very dog" somehow makes it closer to a "very cat". Instead, if we take the angle distance between lines (i.e. cosine distance, or 1 minus angle), the world makes sense again.
Typically the vectors are normalized, instead of what's shown in this demonstration.
When using normalized vectors, the euclidean distance measures the distance between the two end points of the respective vectors. While the cosine distance measures the length of one vector projected onto the other.
by ed_db on 5/9/2024, 1:46:06 PM
This is amazing, the amount of skill and knowledge involved is very impressive.
by pudiklubi on 5/10/2024, 12:57:47 PM
This is wild. I've been creating my own dataset of trending articles and ironically this is how I came across your post. I'm doing a similar project for my uni thesis.
I set out with similar hypotheses and goals like you (on a slightly different scale though, haha) but I've been completely stuck on the interactive map part. Definitely getting a lot of pointers from how you handled this!
Maybe one key difference in approach is that I've put more emphasis on trying to extract key topics as keywords.
For ex:
article (title): "Useful Uses of cat"
keywords: ['Software design', 'Contraction', 'Code changes', 'Modularity', 'Ease of extension']
My hypothesis is this will be a faster search solution than using the embeddings, but potentially not as accurate. Not that far yet to really prove this though.
Would love to hear what you think! Any other cool ideas on what could be done with the keywords? I explain my process a bit more here if interested: https://hackernews-demo.streamlit.app/#data-aggregation-meth...
by stavros on 5/12/2024, 10:45:03 AM
This search engine is amazing. I was looking for an old story about curing acid reflux by some exercise, Google/DDG/Kagi/HN's Algolia were completely useless, this found it first hit. Well done, this is the HN search engine I've always wanted.
Is it possible to keep it up to date?
by thyrox on 5/9/2024, 1:41:48 PM
Very nice. Since Hn data spawns so many such fun projects, there should be a monthly or weekly updates zip file or torrent with this data, which hackers can just download instead of writing a scraper and starting from scratch all the time.
by coolspot on 5/9/2024, 8:22:33 PM
Absolutely wonderful project and even more so the writeup!
Feedback: on my iOS phone, once you select a dot on the map, there is no way to unselect it. Preview card of some articles takes full screen, so I can’t even click to another dot. Maybe add a “cross” icon for the preview card or make that when you tap outside of a card, it hides whole card strip?
by swozey on 5/9/2024, 3:53:04 PM
I'm.. shocked there's been 40 million posts. Wow.
Really neat work
edit: Also had no idea HN went back to 2006. https://news.ycombinator.com/item?id=1
edit2: PG wrote this? https://news.ycombinator.com/item?id=487171
by chossenger on 5/9/2024, 3:08:37 PM
Awesome visualisation, and great write-up. On mobile (in portrait), a lot of longer titles get culled as their origin scrolls off, with half of it still off the other side of the screen - wonder if it'd be worth keeping on rendering them until the entire text field is off screen (especially since you've already got a bounding box for them).
I stumbled upon [1] using it that reflects your comments on comment sentiment.
This also reminded me of [2] (for which the site itself had rotted away, incidentally) - analysing HN users' similarity by writing style.
[1] https://minimaxir.com/2014/10/hn-comments-about-comments/ [2] https://news.ycombinator.com/item?id=33755016
by datguyfromAT on 5/9/2024, 6:15:51 PM
What a great read! Thats for taking the time and effort to provide the inside into your process
by kriro on 5/10/2024, 3:16:01 AM
Very nice project and documented really well. I learned a lot reading the post. The examples of the improved HN search are pretty awesome.
Any idea why password reuse is so far away from security? That was the only oddity of the map for me.
by chatman on 5/10/2024, 12:48:18 AM
Worth trying Cagra (Raft)/CuVS and Lucene-CuVS for the vector search. (https://github.com/SearchScale/lucene-cuvs)
by NeroVanbierv on 5/9/2024, 2:15:36 PM
Really love the island map! But the automatic zooming on the map doesn't seem very relevant. E.g. try typing "openai" - I can't see anything related to that query in that part of the map
by celltalk on 5/10/2024, 6:34:43 AM
It would be cool to see yearly changes of UMAP, by different years or the overall evolution in pseudotime on the embedding. Such a cool side project!
by graiz on 5/9/2024, 1:44:42 PM
Would be cool to see member similarity. Finding like-minded commentors/posters may help discover content that would be of interest.
by Lerc on 5/9/2024, 5:54:08 PM
A suggestion for analysis:
Compare topics/sentiment etc. by number of users and by number of posts.
Are some topics dominated by a few prolific posters? Positively or negatively.
Also, How does one seperate negative/positive sentiment to criticism/advocacy?
How hard is it to detect positive criticism, or enthusiastic endorsement of an acknowledged bad thing?
by paddycap on 5/9/2024, 2:21:22 PM
Adding a subscribe feature to get an email with the most recent posts in a topic/community would be really cool. One of my favorite parts of HN is the weekly digest I get in my inbox; it would be awesome if that were tailored to me.
What you've built is really impressive. I'm excited to see where this goes!
by tomthe on 5/10/2024, 5:54:00 AM
I made something very similar a few weeks ago. I also included usernames with the average of their comments: https://tomthe.github.io/hackmap/
by xnx on 5/9/2024, 2:50:20 PM
As a novice, is there a benefit to using custom Node as the downloader? When I did my download of the 40 million Hacker News api items I used "curl --parallel".
What I would like to figure out is the easiest way to go from the API straight into a parquet file.
by gaauch on 5/9/2024, 3:43:01 PM
A long term side project of mine is to try to build a recommendation algorithm trained on HN data.
I trained a model to predict if a given post will reach the front page, get flagged etc, I collected over a 1000 RSS feeds and rank the RSS entries with my ranking models.
I submit the high ranking entries on HN to test out my models and I can reach the front page consistently sometimes having multiple entries on the front page at a given time.
I also experiment with user->content recommendation, for that I use comment data for modeling interactions between users and entries, which seems to work fine.
Only problem I have is that I get a lot of 'out of distribution' content in my RSS feeds which causes my ranking models to get 'confused' for this I trained models to predict if a given entry belongs HN or not. On top of that I have some tagging models trained on data I scraped from lobste.rs and hand annotated.
I had been working on this on and off for the last 2 years or so, this account is not my main, and just one I created for testing.
AMA
by ashu1461 on 5/9/2024, 2:37:40 PM
This is pretty great.
Feature request : Is it possible to show in the graph how famous the topic / sub topic / article is ?
So that we can do an educated exploration in the graph around what was upvoted and what was not ?
by cyclecount on 5/9/2024, 8:10:08 PM
I can’t tell from the documentation on GitHub: does the API expose the flagged/dead posts? It would be interesting to see statistics on what’s been censored lately.
by aeonik on 5/10/2024, 12:15:32 AM
I couldn't help but notice that Hy is on the map but Clojure isn't.
Am I out of touch?
https://hylang.org
by fancy_pantser on 5/9/2024, 3:56:41 PM
HN submissions and comments are very different on weekends (and US holidays). Your data could explore and quantify this in some very interesting ways!
by gsuuon on 5/9/2024, 6:29:47 PM
This is super cool! Both the writeup and the app. It'd be great if the search results linked to the HN story so we can check out the comments.
by nojvek on 5/9/2024, 6:02:57 PM
I'm impressed with the map component in canvas. It's very smooth, dynamic zoom and google-maps like.
Gonna dig more into it.
Exemplary Show HN! We need more of this.
by sourcepluck on 5/9/2024, 9:54:16 PM
Where is lisp?! I thought it was a verifiable (urban) legend around these parts that this forum is obssessed with lisp..?
by gitgud on 5/9/2024, 10:42:09 PM
Very cool! I was hoping to be able to navigate to the HN post from the map though? Is that possible?
by gardenhedge on 5/10/2024, 10:58:11 AM
AI is the most popular topic (by far) that I could find. Is there anything more popular?
by dfworks on 5/9/2024, 5:08:08 PM
If anybody found this interesting and would like some further reading, the paper below employed a similar strategy to analyse inauthentic content/disinformation on Twitter.
https://files.casmconsulting.co.uk/message-based-community-d...
If you would like to read about my largely unsuccessful recreation of the paper, you can do so here - https://dfworks.xyz/blog/partygate/
by carte_blanche on 5/11/2024, 6:23:06 AM
Getting "Argo tunnel error" on the page
by Venkatesh10 on 5/10/2024, 3:16:05 PM
This is the type of content I'm here for.
by racosa on 5/10/2024, 2:23:40 PM
Very cool project. Thanks for sharing it!
by freediver on 5/9/2024, 2:01:45 PM
If you have a blog, add an RSS feed :)
by redbell on 5/10/2024, 8:58:42 AM
Truly, amazing work! Not only because of the final results, but also because of the whole process it took the author to bring this to life. If I could upvote this by giving points from my karma, I wouldn't hesitate to easily give a hundred points. Without a doubt, I would classify this on par with "40k HN comments mentioning books, extracted using deep learning" (https://news.ycombinator.com/item?id=28595967), which is the highest-voted "Show HN" project related to hacker news so far with 1359 points.
I'm not in the ML/AI arena yet, so I couldn't fully understand the second half of the article except for having a general idea about Embeddings and their potential, but the first part is what interests me as a software engineer.
Following are some of the challenges the author came across, was able to overcome each of them, and published the full source code.
Downloading HN database
> There's also a maxitem.json API, which gives the largest ID. As of this writing, the max item ID is over 40 million. Even with a very nice and low 10 ms mean response time, this would take over 4 days to crawl, so we need some parallelism.
> I've exported the HN crawler [1] (in TypeScript) to its own project, if you're ever in need to fetch HN items.
Fetching and parsing linked URLs' HTML for metadata and text
> For text posts and comments, the answer is simple. However, for the vast majority of link posts, this would mean crawling those pages being linked to. So I wrote up a quick Rust service [2] to fetch the URLs linked to and parse the HTML for metadata (title, picture, author, etc.) and text. This was CPU-intensive so an initial Node.js-based version was 10x slower and a Rust rewrite was worthwhile. Fortunately, other than that, it was mostly smooth and painless, likely because HN links are pretty good (responsive servers, non-pathological HTML, etc.).
Recovering missing/dead links
> A lot of content even on Hacker News suffers from the well-known link rot: around 200K resulted in a 404, DNS lookup failure, or connection timeout, which is a sizable "hole" in the dataset that would be nice to mend. Fortunately, the Internet Archive has an API that we can use to use to programmatically fetch archived copies of these pages. So, as a final push for a more "complete" dataset, I used the Wayback API to fetch the last few thousands of articles, some dating back years, which was very annoying because IA has very, very low rate limits (around 5 per minute).
Finding a cost-effective cloud provider for GPUs
> Fortunately, I discovered RunPod, a provider of machines with GPUs that you can deploy your containers onto, at a cost far cheaper than major cloud providers. They also have more cost-effective GPUs like RTX 4090, while still running in datacenters with fast Internet connections. This made scaling up a price-accessible option to mitigate the inference time required.
This is the type of content that makes HN stands out from the crowd.
_____________________________
1. https://github.com/wilsonzlin/crawler-toolkit-hn/
2. https://github.com/wilsonzlin/hackerverse/tree/master/crawle...
by Igor_Wiwi on 5/9/2024, 8:32:28 PM
how much you paid to generate those embeddings?
by dangoodmanUT on 5/9/2024, 11:12:41 PM
excellent work
by password4321 on 5/9/2024, 2:51:53 PM
Related a month ago:
A Peek inside HN: Analyzing ~40M stories and comments
https://news.ycombinator.com/item?id=39910600
by callalex on 5/9/2024, 4:02:07 PM
“Cloud Computing” “us-east-1 down”
This gave me a belly laugh.