Our thoughts, knowledge, insights and opinions

OrientDB and Scala - Getting started

In this article series, I’m going to introduce you to OrientDB database, show it’s strengths, weaknesses and how to use it with Scala.

OrientDB

What is OrientDB?

OrientDB is a Multimodel Database. What does it mean? It means, you can use OrientDB in few different modes. You can use it as a Document Database (like MongoDB), Graph Database (like Neo4j, Titan) or Object Database (Db4o, ObjectDB). I will discuss these modes more later. Here I just wanted to point out that the vast amount of features makes OrientDB a strong and versatile storage choice.

What OrientDB has to offer?
- it’s fully written in Java, so you can run it almost everywhere, - fully transactional, supports ACID transactions, - it’s free even for commercial use,
- you can run it in distributed mode (full support for multi-master replication including geographically distributed clusters),
- it’s fast, as fast as almost 10k GET operations per second,
- it’s small - a full server has a footprint of about 1 MB.
- you can run it in embedded mode or in memory (which is useful for development, testing or small standalone apps),
- you can buy support or additional tools for profiling or analysis
- schemafull or schema less (or even schema mixed)
- more, and more…

It was the big picture. Time for more details, let’s install it.

Installation

As was mentioned above OrientDB is written in Java so installing it shouldn’t be a problem. At website there are prepared packages for most popular operating system along with universal one.

OrientDB Download Icons

Download version prepared for your OS or universal package. As you can see there are JDBC drivers as well as drivers for other languages than Java.

I decided to download and unpack universal package. Let’s take a look at the contents of the bundle.

Orient directory content

  • benchmarks - if you don’t trust official statistics you can measure the performance on your own. All the tools you need are inside this folder.
  • bin - here you can find the run scripts, we will be back here later.
  • config - place for config files
  • databases - in this folder the database files will be stored. You can change the directory in config.
  • lib - all dependencies required by database,
  • log - default logs folder,
  • plugins - yep, that’s true. You can extend this database,
  • www - Orient DBStudio - more about this in a moment.

bin folder contains scripts for *nix based systems along with windows ones.
- server - the most important script. You use it for starting OrientDB server. If you run it first time, it ask you for root password unless you already defined it manually by editing config file.
- dserver - run server in distributed mode (as one node of the group).
- orientdb and shutdown allows you to install DB as *nix service. Just edit first of them and put path to install directory.
- oetl - is a tool for ETL processing - moving data from or to OrientDB.
- backup is simple script for backing up your database files.
- console you can use for connecting to server and manipulate the data.
- gremlin similiar as above. You can connect to server and manipulate data but in this case you use Gremlin Language

config directory contains all configuration files. The most important one is orientdb-server-config.xml. Yep, it’s XML file, don’t forget OrientDB lives in JavaWorld ;). I don’t like work on root account, so I want to add another one and editing the config file is the best way to do it smoothly. Let’s find <users> section (close before end of the file) and add one entry:

config file - users section

As you can see I created mario account with pass password (use credentials you want to). I also added access to all resources (resources="*") which is typical for root user. If you want to limit access resources, just add allowed resource types separated by comma. All available resources are:
- info-server, to obtain statistics about the server
- database.create, to create a new database
- database.exists, to check if a database exists
- database.delete, to delete an existent database
- database.share, to share a database to another OrientDB Server node
- database.passthrough, to access to the hosted databases without database’s authentication
- server.config.get, to retrieve a configuration setting value
- server.config.set, to set a configuration setting value

Another way to add new user is to use console.

Just before <users> section you can find <storages>. There you can define your own database like in example from official documentation:

storage config

In fact, you don’t have to define a database here, by default server will allow access for every database in its databases folder. Later we will use OrientDB Studio to create our one.

One more thing worth to mention here is that, you can decide which storage engine will be used. You can choose between memory database and located in filesystem plocal database engine. The first one is good for development, but you will loose all data after the shutdown.

Time to fire it up!

Run the server.

We already added our user, so we can run the server by executing server.sh file (or server.bat if you’re Window’s user). Server welcomes us with nice ASCII art with database logo. As default all informations are printed to the console. To stop the server you have to use Ctrl-C like in most console programs.

Bootserver

Boot logs shows active listeners and plugins. OrientDB uses two standard listeners:
- binary - used by console application or OrientDB clients, which listens on port 2424 as default,
- HTTP - used by OrientDB Studio or any HTML RESTFUL client which listens on port 2480 as default.

Server picks up first available port from defined in configuration range, check logs to be sure which one is used.

Server is running, and we can connect to it. We have few convenient ways to do it:

  • console
  • gremlin console
  • API
  • OrientDB Studio

I will try to describe all of them (in shorter or longer form)

OrientDB Studio - webconsole

After OrientDB has launched you can use OrientDB Studio which is located by default under http://localhost:2480

Webconsole login

There is only one database available at this point: “GreatfulDeadConcerts” - sample database with relations of musicians and their concerts. There is also possibility to import another one, from available online public databases, i.e. about relations between Beer and breweries. We won’t use prepared databases. Instead, we will create a new one and fill it with simple data. To do so, click (unsurprisingly) “NewDB” button. All we need to fill in is database name and database user with password (what we defined in configuration file). So I will use ourDB, mario and pass respectively.

Create database

Clicking on “Create database” will automatically login us to the new database. OrientDB Studio is very powerful tool, especially for users which aren’t comfortably with console applications. It gives us the possibility to manage data in database, its schema and provides a lot of other informations. There are also two tools enabled only after paying for enterprise support: “Profiler” and “auditing”. The rest is free to use.

In short - what’s going on there. “Browse” is console itself, it’s similiar tool to console application what we can run from terminal. In console we use SQL Like syntax so Orient is easy to switch to from Relational databases. We’ll back to console later.

Next tab is “Schema”. Yep. We can define schema for our data, we can shape every model we want to store in the database. In fact we don’t need to, or we can work in mixed mode: we can define a set or required fields while still allowing some objects to be enriched with additional data. On “Schema” page there are already couple classes defined. All of them are system classes used internally besides V and E which are root classes for storing data. V means Vertex and E is for Edge. More about this you’ll find in Graph Theory section. What you can see here is that every class can have superclass. Also you can find a lot information about clustering. I feel, I should tell more about this now.

As default, when you create a class, there is also created one cluster for this class. It’s a place for storing data. A bit like Table in RDBS, but here you can have more than one cluster per class. For example: if you have class User you can define separate cluster for users from US and another one for users from EU (and another for the rest ,i.e.). In this way you are able to fetch users only from selected cluster or all of them if you wish.

Let’s back to our “Schema” session. If you click in any class, you can see properties/indexes for the class. In this place, you’re able to define your own if you want. In fact there are few properties you won’t see here - system properties. Every class has @rid, @version and @class properties embedded which are hidden in this view. RID is a Identity which is built from cluster number and position in this cluster (like #1:12). As you can see you can shape your data without one line of code and expose it by RESTFUL service to frontend. It’s a powerful feature.

Next sections of Studio in short: “Security” is self explanatory, you can manage privileges to this database. Worth to mention is that the Orient’s permission’s system is complex - you can manage permissions for clusters or classes in general, you can also set permissions for query, command, function or even for proper record. If you want to learn more about this you can look into documentation.

In “Graph” section you can find your data visualized of your data. If you’re still in our empty database you see blank page, though. In “Functions” you can define your own functions, but for better performance I recommend you to use native API for that. The last available navigation tab is “DB” where you can find an information about database itself.

Now let’s back to V and E classes.

Simple Graph Theory

If you have at least basic knowledge about graphs you can skip this section.

If you think you can store only images or graphics inside graph database, You’re wrong :) Graph database is database that uses graph structures to represent and store the data. In Relational Databases you store the data inside Table. Every row in such table is an record(entity) and cell in the row is the Property. Even relations, relations are nothing more than properties of one from the entities. If you define more complicated relation model, you should store it inside another table. Table is also structured, you cannot add nothing more than defined. On the other hand empty properties are still kept in database and fill space on disk. Graph databases don’t have those limitations.

Graph databases operate on 3 structures: Vertex(sometimes called Node), Edge(or Arc) and Property(sometimes called Attribute).

  • Vertex. It’s used to represent your data: User, Post or Invoice are good examples. In general graph databases don’t care what type of data they’re representing.

  • Edge is physical relation between Vertices. Each Edge connects two different vertices, no more, no less. Additionally Edge has label and Direction, so If you label your edge as likes you know that Peter likes pizza not pizza likes Peter. From technical perspective direction of relationship is either Outgoing or Ingoing. One more thing about edges: That relations are physical so you can fetch them filtered by their properties the same like vertexes.

  • Property - it’s a value related to Vertex or Edge. You can add properties to both previous structures. You can add name to User or mileage to highway edge.

Graph example

What can a Graph DB be used for?

Because relations in Graphs are stored on disk, every use case that relies heavily on relations will be perfect for this kind of DBs. Social networks, movies/music/people recommendation are just a few good examples. The power of graphs lays in efficient retrieval of related data. For relational DB it means lots of slow joins, whereas graphs just traverse trough connected structures. It could be nightmare if somebody ask you to prepare query which find all friends of your friends of your friends and list all restaurants they like. In graphs traversing data in this way is simple and fast.

Additionally designing data model is easier and more intuitive. If you take all nouns as vertices, verbs as edges and adverbs as properties you can easy convert case study scenarios into ready-to-use data model.

More Details about OrientDB

In general OrientDB is Document Database. This engine is base of the structure. Graph Database is built on top of it. It makes a small effectiveness lost (about 10%). It’s small price for such big advantage. Mentioned at the beginning of the article Object Database is also on top of Document one. But serialization/deserialization every object into document isn’t so effective and it works about half as fast as main engine. In fact Object Database is less popular so I won’t mention it anymore.

Some of basic concepts of OrientDB:

Document is a main unit that you can load and store in the database. You can define schema for documents but you aren’t forced to do so. Documents are very flexible and easy convertable to/from other formats like JSON.

Vertex is built on top of document. Even from vertex model you have access to underlying document. Vertex stores informations for the database. It represents the entities of your data model.

Edge connects two vertexes. It represents relationship between data. Edges are stored in database as Document, but OrientDB provides also Lightweight Edges which does not. The price for that is that Lightweight edges cannot hold labels nor properties, it’s like anonymous link between two verticies.

ORID - RecordId is represented by cluster ID and position inside this cluster. Such ID is held by every unit stored in database.

Version - every record has version property. It’s readonly from users perspective and used instead of locking, so editing one record by multiple clients in concurrent environment is easy to achieve.

Class - define your own data model. With Classes you can look onto data model from Object Oriented Perspective. OrientDB provides classes inheritance and probably you might use it to extend V or E classes which already are in the DB.

Every class has its own cluster(at least one) - place to store data into it.

Embedded relationships - although there is a concept of edge which is used for keeping relationships, you can also embed one record inside another. In such case embedded record hasn’t got their own ORID and cannot be fetched by queries directly.

Playing with data.

That’s enough theory, let’s play a little with data.

I will use the console, but you can do the same with in OrientDB Studio.

$ORIENT_HOME/bin/console.sh
orientdb >

to connect use connect command. Before we’ll do it, one sentence of explanation. If you want to use URL to access the database, you have to use one of the three available prefixes: memory, remote or plocal. What does it mean? It tells where the database is placed. memory:test will create in memory database with name test and all data will be lost after connection close. remote:localhost/test will try connect to server and find database with name test. Place where data is stored is managed by server. The last option - plocal:/databases/test will try to connect to database directly - with success if you already created the database in that directory.

Because we have already created database and it’s managed by server let’s try to connect with it:

> connect remote:localhost/ourDB mario pass
orientdb {db=ourDB}>

Dont’ forget mario and pass are credentials of created by us user. We are ready to work.
Let’s start from creating vertices:

> create class User extends V
Class created successfully. Total classes in database now: 12

and define that every User should has got a name:

> create Property User.name string
Property created successfully with id=1

now we can add users

> create vertex User set name='John'
Created vertex 'User#12:0{name:John} v1' in 0,015000 sec(s).

and easily add properties to them:

> create vertex User set name='Bob', alias='Big nose'
Created vertex 'User#12:1{name:Bob,alias:Big nose} v1' in 0,002000 sec(s).

Our Big nose works for John so, we’ll create a works_for relation. Type of edge is created in the same way as Vertex.

> create class works_for extends E
Class created successfully. Total classes in database now: 13

time to set this relation:

> create edge works_for from #12:1 to #12:0
Created edge '[works_for#13:0{out:#12:1,in:#12:0} v1]' in 0,012000 sec(s).

Let’s check which users are in database:

> select from User
----+-----+------+----+------------+--------+-------------
#   |@RID |@CLASS|name|in_works_for|alias   |out_works_for
----+-----+------+----+------------+--------+-------------
0   |#12:0|User  |John|[size=1]    |null    |null
1   |#12:1|User  |Bob |null        |Big nose|[size=1]
----+-----+------+----+------------+--------+-------------

Let’s find our Big nose but I forgot it’s name, surname or nick. In such case we can use any() filter

> select from user where any() like 'Big nose'
----+-----+------+----+--------+-------------
#   |@RID |@CLASS|name|alias   |out_works_for
----+-----+------+----+--------+-------------
0   |#12:1|User  |Bob |Big nose|[size=1]
----+-----+------+----+--------+-------------

now find who he works for:

> select expand(out('works_for')) from user where any() like 'Big nose'
----+-----+------+----+------------
#   |@RID |@CLASS|name|in_works_for
----+-----+------+----+------------
0   |#12:0|User  |John|[size=1]
----+-----+------+----+------------

More complex example:

Let’s add 3 more people

> create vertex User set name='Alice'
Created vertex 'User#12:2{name:Alice} v1' in 0,002000 sec(s).
> create vertex User set name='Daniel'
Created vertex 'User#12:3{name:Daniel} v1' in 0,002000 sec(s).
> create vertex User set name='Kate'
Created vertex 'User#12:4{name:Kate} v1' in 0,001000 sec(s).

add new relation ‘knows’

> create class knows extends E
Class created successfully. Total classes in database now: 13

and relate friends

> create edge knows from #12:0 to #12:1
> create edge knows from #12:0 to #12:2
> create edge knows from #12:0 to #12:4
> create edge knows from #12:3 to #12:1
> create edge knows from #12:3 to #12:2

Let’s create some food data

> create class Food extends V
> create Property Food.name string
> create class likes extends E
> create vertex Food set name='Pizza'
> create vertex Food set name='Sushi'
> create vertex Food set name='Burger'

if it’s hard to set relations by RID you can use subqueries for that

> create edge likes from ( select from User where name='John') to (select from Food where name='Pizza')
> create edge likes from ( select from User where name='Bob') to (select from Food where name='Pizza')
> create edge likes from ( select from User where name='Alice') to (select from Food where name='Sushi')
> create edge likes from ( select from User where name='Alice') to (select from Food where name='Burger')
> create edge likes from ( select from User where name='Kate') to (select from Food where name='Burger')
> create edge likes from ( select from User where name='Daniel') to (select from Food where name='Sushi')

Now our graph should looks similiar to following image.

Current Graph

This screenshot was made using OrientDB Studio visualization tool.

Now queries: What is the most liked food by John’s friends?

> select expand( both('knows').out('likes')) from User where name = 'John'
----+-----+------+------+--------
#   |@RID |@CLASS|name  |in_likes
----+-----+------+------+--------
0   |#14:0|Food  |Pizza |[size=2]
1   |#14:0|Food  |Pizza |[size=2]
2   |#14:1|Food  |Sushi |[size=2]
3   |#14:2|Food  |Burger|[size=2]
4   |#14:2|Food  |Burger|[size=2]
----+-----+------+------+--------

or even agregated?

> select name, count(*)  from (select expand( both('knows').out('likes')) from User where name = 'John') group by name order by count desc, name asc

----+------+------+-----
#   |@CLASS|name  |count
----+------+------+-----
0   |null  |Burger|2
1   |null  |Pizza |2
2   |null  |Sushi |1
----+------+------+-----

Now examples of traversing data:
Find friends of Kate’s up to 3rd level of relation, don’t display my friends on the list:

> select name, $depth, $path from (traverse both('knows') from #12:4 while $depth <= 3) where $depth > 1
----+------+------+------+----------------------------------------------------
#   |@CLASS|name  |$depth|$path
----+------+------+------+----------------------------------------------------
0   |null  |Bob   |2     |(#12:4).both[0](#12:0).both[0](#12:1)
1   |null  |Daniel|3     |(#12:4).both[0](#12:0).both[0](#12:1).both[2](#12:3)
2   |null  |Alice |2     |(#12:4).both[0](#12:0).both[2](#12:2)
----+------+------+------+----------------------------------------------------

In result you can see names of friends of Kate’s friends, how long is relation path and entire path itself. I know that examples above are easy and you can think, it works as fast as in Relational database, but trust me - it doesn’t. If you prepare very big database with relations as above and fetch for 10-friends long path, it won’t be much slower, but similiar query can kill almost every RDBS.

I know it’s only tip of the iceberg in topic of OrientDB, but I hope I raised your curiosity and you find some time to learn more about this wonderful database.

There are topics not covered in first article, and which I plan describe later: - use Java API from Scala code, - TinkerPop blueprint, - TinkerPop gremlin (including gremlin console and directly from scala code) - maybe something more… any advices in comments are welcome :)

In the next article I will show you how to use OrientDB from Scala code.

Greedy to learn more?
- Official Documentation
- Udemy VideoCourse
- TinkerPop - the most known Graph Tools
- Solving Problems with Graphs - wonderful presentation


Picture from Graph Theory section is from http://www.w3.org/TR/rdf11-primer

You like this post? Want to stay updated? Follow us on Twitter or subscribe to our Feed.