spark

最近在至少两个场合听到这个玩意儿

其实以前也听说过,但是没机会用到,也就没关心

其实现在也没机会用到,因为g有自己的一套。但是现在很好奇。

http://spark.apache.org/

第一感觉是个map reduce引擎或者类库

不过紧接着看到在基本spark之上还有SQL, Streaming, ML, Graph类库。这下可用性可就不是简单的map reduce可比了。这就有些像database之于文件系统,os之于硬件的区别了。

怪不得上次有个老乡说参加spark大会有好几千人。这可不是小打小闹啊。

open source运动真是所有软件公司和开发者的福利啊。

而且感觉上g的很多技术优势基本上被open source赶上甚至超越了。

这年头即使不在g或者facebook这样的公司也能用上如此好的工具可真是开发者的幸福啊。

不过现在的问题是小开发者恐怕仍然负担不起大数据运算所需要的硬件资源。必须得在有钱并且愿意花钱的公司才能玩这些东西。

乱七八糟

昨晚见到了久违的叶老师,差不多七年没见,还是老样子

本来以为就跟叶老师吃个饭的,没想到还有好多其他人

见到了dd的一众技术高管,包括cto。他们讲了很多现实的挑战,真是很吸引人。

难道真是风水轮流转?从google到facebook,从facebook到uber, dd, airbnb。智能手机果然催生了一堆新的行业。

github archive format changed

noticed previous sql results doesn’t have tensorflow, wired and found

the old githubarchive:github.timeline seems deprecated.

and new dataset is very large. probably about 2TB

ran a simple command

SELECT * FROM [githubarchive:year.2015] WHERE type="WatchEvent" LIMIT 1

bytes processed: 488 GB

oh my!

this is better:

SELECT repo_name, count(*) FROM [githubarchive:month.201601] WHERE type="WatchEvent" group by 1 order by 2 desc;

1 month data, but still costs about 1GB. very expensive.

c++ assignment operator and temporal variable

what happens to this?

const string& a = SomeStringFunc(…);

http://www.cplusplus.com/articles/y8hv0pDG/

http://stackoverflow.com/questions/7035793/c-reference-pointing-to-a-temporary-variable

There is a small clause in the C++ standard that says that non-const references cannot bind to temporary objects. A temporary object is an instance of an object that does not have a variable name.

A temporary cannot be bound to a non-const reference, but it can be bound to const reference.