github archive format changed

noticed previous sql results doesn’t have tensorflow, wired and found

the old githubarchive:github.timeline seems deprecated.

and new dataset is very large. probably about 2TB

ran a simple command

SELECT * FROM [githubarchive:year.2015] WHERE type="WatchEvent" LIMIT 1

bytes processed: 488 GB

oh my!

this is better:

SELECT repo_name, count(*) FROM [githubarchive:month.201601] WHERE type="WatchEvent" group by 1 order by 2 desc;

1 month data, but still costs about 1GB. very expensive.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s