filter the web by hacker news

A: I want to reuse the previous content of hack news, not only the current events.

I’ve read hacker news for several years, and most of the time, I read it for recent events. The core value of HN is that it always give me quality of thought.
HN, with ~10 years history, is growing with internet growing, and still keeps growing. 10 year is a long long time for internet growth, and there should be very golden thoughts in HN.
So packing the whole history of hacker news into a pocket, and then enhancing my web reading experience when I need should be a cool stuff and worthwhile trying.

But how? One day, I read an article which is talking about the bloom filter, I want to try this algorithm on some project. Then the idea of using bloom filter to pack the hacker news comes in.
After some investigation ( hacker news api, algolia hn search ), mainly on story density in whole HN data. I think the idea is possible, but need do more work to make this idea work. Since I have some free time now (just left from a startup company), I kick off this idea. With several weeks development, Hacker News Filter comes true.
For now, this is a earily stage product, which means this product should have some core value and key features to resolve some real problems, besides it also need prove the long value, maybe local filtering, or social bookmarklet, or automatic knowledge enhancement, or maybe not any sense just fake problem at all.
What do you think? Feel free to post me your thinking!

A: Shortly to say, web link is tested by local "hash" file.
In detail:
1) In server, there is a crawler to watch Hack News API item changes. When pop a new story, it will be built into bloom filter, whose hash algorithm find here , where
"number of hash functions" is 8 
"number of bits to allocate" is 16 * (link count)
The above factors make sure the probability of false positive is low enough. Then the bloom filter size should be (distinct link count) * 2 Byte, which is small enough for current hacker news total count (~2M), then package into a local app should be works, size should be ~4MB.
2) The bloom filter data will be incrementally sync into client (the browser extension), you can also manually sync by click .
3) When end users open a link in browser(current only chrome), the extension will format the link address, and tested by bloom filter.
4) If test result is positive, extension will continue to start a HTTP call to find the link score. The score will be shown in extension badge.
5) If user popup the extension, extension will continue to load hacker news items detail.
Basically, there is an assumption that hacker news links should be very small part of the total visiting links.
Comparing the other naive method - check in server side for every visting link, the above method will make network communication less, calcuate much more fast. Actually, you will find the extension change the badge color very quickly (should always faster than the page loading), which give you a choice to know what HN are talking about the coming page.

A: By distribution purpose, the whole data is seperated into three parts:
part 1 (base type): Old enough links are distributed by package installation and updated by version upgrade. This part contains more than 90% of whole things.
part 2 (rt type): Newest links are timely built (1 minute) in server and timely pulled into extension, and store into ram. (<1k)
part 3 (archive type): The middle links are built into block files, and stored into localStorage.
Goto bloom filter status page to view the difference.
Notice 1: Bacause of the localStorage volume limitation ( max volume = 5M ), the users should update the latest extension yearly to avoid too much fragments (part 3 generated it.)
Notice 2: part 2 and part 3 support md5 checksum.
Notice 3: You can find the link total count in extension footer bar.

BTW, I have not yet investigate the other storage type, .e.g. indexDB, maybe more promising than localStorage.

A: No, the link count is pretty small, there are also some caches to avoid duplicated calucation.
If you found this issue, report a bug to me.

A: You can, download from bloom filter status page.
Thanks for HN data always open to everyone, also are expected to follow the same policies.
Roughly thinking, the data can be used to:
  • check the web link offline even in mobile.
  • pickup and mark the HN links in web page content, not only in address link.

A: follow this pipeline to format a url:
1. remove hashtag
2. remove well known noise parameter (utm_source etc)
3. remove end noise (index.html/jsp etc.)
You can test this formatter model in formatter page.
BTW, this model is not perfect and standard, some rules are pretty tricky, some rules are following convention and analyzed from hackew news dataset. And you are welcome to file a bug to report the exception.

A: Shortly to say, not yet.
After some sampling work in crawled HN data, I found the redirected link is pretty much high(1%~5%, correct me if you have more detail data). The biggest redirect case is many sites have changed from http to https in recent years. The client will test both http and https (find more in formatter part).
This project is still in early stage, handling redirect means setup and maintain a HEAD crawler(many site is not friendly to HEAD & GET, e.g. nytimes).
You can file request if you care much about such cases (please attach the link address).
Dead links (status code 40x, 50x) also have same issues.

A: I am not good at beautiful ui design. Wish to get some support for later version.
You are welcome to file request if you have better design suggestions.

A: Step 1: download the crx: click here .
Step 2: open chrome extension page: input chrome://extensions in new chrome tab.
Step 3: Drop the crx file into the page, and click install in popup window.
Then the extension is installed successful!

A: Server Cost: 1 linode vps server, 20$/month .
Not very sure how to make money, waiting for a while to learn how users using it and how popular it is.
But earning money is a must-have option for long time to run, only passion is not enough to make long walking.
In this stage it is a personal project, if promising enough, I will take a bussiness model, maybe a small bussiness is good enough.
If you think it is helpful, please support me in HN community or star in webstore.

A: This project is still in early stage, waiting for HN community feedback.

A: Full stack developer, learner, father with a very lovely daughter, living in Shanghai/China, like programming, reading, travelling, swimming and FREEDOM, just get free from a startup company, for now enjoy life very much :)
My HN account huan9huan, my gmail hhhust, twitter @huan_huang .

What to ask more? leave your question here.
Official twitter @filtertheweb

HackerNewFilter is discussed in HN Now , special thanks Yen and folz

This page is last updated at 2016-12-06.