The English and Arabic corpus of the Holy Quran is a rich source for statistical analysis. For instance, the entire test corpora has half a million words and many thousand distinct words. A rich dataset such as the Holy Quran, therefore, provides for an exciting journey of data exploration. More

# Data Science

# Data mining 1.5 million tweets for Twitter sentiment analysis

The contents of this blog post are inherited from a short research project by Group 10 of the Information Retrieval and Data Mining module at University College London. Instead of letting it rot in my Dropbox, I decide to free the knowledge and hope someone finds it useful. More

# How to open large text files (>5 GB) on a Mac ?

A month ago, I downloaded a large dataset from Twitter. The .txt file consisted of around 1.5 million tweets in JSON and weighed at 5.5 GB. I wanted to look at the structure of the JSON in order to design a parser for processing the tweets. I was then working on a Sentiment Analysis project. As I attempted to open the file in Sublime Text 2, my powerful Mac just gave up. More

# Graph Theory 101: Directed and Undirected Graphs

This is a very short introduction to graph theory. We will be talking about directed and undirected graphs, the formulas to find the maximum possible edges for them and the mathematical proofs that underlie the philosophy of why they work. This is my first use of LaTeX on Mr. Geek. More

# Measuring influence in a group using social network analysis

I have decided to publish the contents of my Complex Networks and Web coursework project here on Mr. Geek. The information contained in this post might be complex to some, but I assure you that this will be a good long read. I have included lots of pictures to make sure you don’t get bored in the swathes of text. More