-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add data for archive 2024_07_01 #228
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks so much for the hard work here!
I like the high-level approach of including all observed domains from entities who have more than 50 domains, but I have two primary concerns with the specific impl here.
- It creates a circular dependency problem for observed domains relying on entity mappings.
- The data is now too big to continue with our "checked into the repo as JSON" approach.
I have a few ideas on these but would like to hear what you think.
What if we preserve the existing step of observed domains with no dependency on entities (which helps with the scripts in the repo that aid in identifying new domains that require classification), and added a new step for your inclusion of the lower volume domains that are missed that lives in a bigquery table instead of checked in?
Thank you @patrickhulce. Ok i didn't noticed that use-case of newly observed domains classification. About file size, yes that's a real issue i faced (blocked by github file size and had to transform flow to stream for scalability #224 ). So we can't remove the occurence limitation on query that generates an uploaded file (observed domains). I like what I understand from your suggestion but could you please confirm I'm understanding it well? You suggest to keep the initial query to get observed domains (without any mapping to entity but with minimum of 50 occurrences) and store result in a file uploaded to the repo (basically this stream), right? So, just get back the initial behavior. Problems are that generated file Then we add an extra step which runs the new query (with entity mapping and limitation at entity level), then saves mapped observed domains into a dedicated table (it could be the existing Finally, the last step would not change and keep using the |
I've updated script according to previous discussion. Let me know if it's what you expected @patrickhulce. |
Deployment failed with the following error:
|
Hello @patrickhulce |
Thanks @Nigui ! Will need a manual rebase I think though. |
…50 different pages and generate 2024_07 data feat: keep saving in file all observed domains with minimum observations fix: tpw table exists feat: automated script splitting into 3 steps, add logs and clean table if needed feat: compute data for 2024_07_01
3a3e145
to
bf405d4
Compare
Sorry i merged master instead of rebase. But it's ok now :) |
🎉 This PR is included in version 0.26.0 🎉 The release is available on: Your semantic-release bot 📦🚀 |
This PR is here to expose and suggest a fix to inaccurate computed data when third party runs a script from dynamic subdomain.
What's wrong here is that some third parties use a dynamic subdomain to serve its main script on websites (e.g .domain.com). Some of these subdomain scripts are saved under observed-domains JSON file as results of the sql/all-observed-domains-query.sql query but analyzing http archive database we found a lot that are ignored because of number of occurrences (less than 50 cf SQL query having clause).
In this MR we've rewritten all-observed-domains-query.sql to fix this issue.
To sum-up the change, we keep observed domains with occurence below 50 only if its mapped entity (based on entity.js) has a total occurence (of all its declared domain) greater than 50.
Rest of the flow remains the same. It may be optimized in the future.
We don't rewrite existing data but compute fresh data on July 2024 httparchive with new query instead