Elasticsearch/Logstash/Kibana for visualizing github watch events data

kibana

References:

1. USING LOGSTASH, ELASTICSEARCH AND KIBANA TO MONITOR YOUR VIDEO CARD – A TUTORIAL

2. USING LOGSTASH TO IMPORT CSV FILES INTO ELASTICSEARCH

3. logstash date reference

4. elasticsearch mappings

5. elasticsearch templates

6. logstash introduction

7. google bigquery

8. elasticsearch aggregation and analysis

Basic Ideas:

1. use elasticsearch for data storage and restful api. use elastichq.org to see the status and manage indexes.

2. use kibana for data dashboard. kibana is purely html/css/js.

3. use logstash for importing csv data to elasticsearch // another option is fluentd

4. use google bigquery and google storage to query and export github watch events data

Key Steps:

1. get data. query github watch events data (in a particular month) from bigquery using code in this gist, export the results (very painful). remember to choose GZIP instead csv. the data is of Aug 2014.

2. create elastic search mappings, run create_mappings.sh, see gist

3, create a logstach config file and import data with logstash, also see gist for config file. open csv file and Ctrl+s to trigger data import.

4. in elastichq.org, check elasticsearch data updates

5. in kibana, click logstash template, and then click settings icon on the top, in Index tab, set [github-watch-]YYYY.MM.DD, choose time range (on the top) to Aug 2014 01 to now. then results will come out.

6. in kibana, add a table, select terms, count with field repo_url.raw. this table will generate top 10 popular repositories given a specific time range, queries and other filters.

Caveats/Learnings:

1. elasticsearch tokenize string by default, so jquery-mobile will be split to jquery and mobile. add a raw field with not_analyzed index in the mappings so you can count by full url not words.

2. the data filter in logstash configure file is important. it use data in csv file for @timestamp field in items.

3. in logstash config file, index => “github-watch-%{+YYYY.MM.dd}” is a good practice although it’s not mandated. it means the data is split to different index by day and the split is based on @timestamp field in data, not system date.

4. i still don’t know how to skip first row in csv file.

5. remember to select time range and index in kibana.

6. filter is faster than query, according to elasticsearch docs. filter is also powerful for specify complex conditions. its like where in sql.

7. you can add multiple query terms and the count wisll stack up in histogram plot.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s