Hacky Hour 1.0

Thank you to everyone who came to our first ever Hacky Hour at UTS. It was great to see so many people come along to get help with their problems, lend a hand to others or just catch up with other researchers or people on the eResearch team.

Here are just a few of the things people talked about:

  • Peter Sefton talked to people from the Microbial Imaging Facility about managing microscope image data using an application such as Omero.
  • Michael Lake talked to a researcher from the C3 Remote Sensing group about code execution performance (R vs Python), using profiling tools to optimise code, writing test suites and the benefits of using a version control system.
  • Sharyn Wise talked to A/Prof Cynthia Whitchurch and Dr Lynne Turnbull about IN Cell Miner.
  • Chris Bell, our current Server Administrator, caught up with researchers to go over outstanding ServiceConnect issues.
  • Michael Lynch helped a researcher from FEIT who's looking at measuring social media engagement. He wanted some help finding datasets and understanding the IP issues around collecting posts.
  • Darren Lee and Jared Berghold helped researcher Mingming Cheng with his data wrangling problem (see below). In the process, they introduced him to Python, the popular Python distribution called Anaconda and the excellent interactive programming tool called Jupyter (formerly known as IPython).

Problem Solved!

Mingming Cheng, a PhD researcher from the UTS Business School, came along with a problem he encountered while assisting his supervisor, Dr Deborah Edwards, on her project that examines the spatial behavior of tourists in NSW.

Mingming and Deborah are analysing public converations that have been extracted from TripAdvisor by the Advanced Analytics Institute at UTS and generating visualisations showing the interactions between forum participants.

Mingming needed the spreadsheet in the appropriate format for importing into the visualisation tool, Gephi.

The data was in this format:

But it needed to be in this format :

As you can see, the usernames in the author column had to be transposed onto a single row for each conversation thread. Conversation threads were identified using a unique index number. Mingming had over 65,000 rows in his spreadsheet, so doing this by hand was not a viable option.

Darren Lee from the UTS Data Arena came to the rescue, hacking up a Python script in the time it takes to drink a beer (or maybe two).

Here's what the script looked like:

import sys
import csv

with open('/Users/jared/Desktop/trial_mingming.csv') as csvfile:
    reader = csv.reader(csvfile)
    prevThreadNum = None

    with open ('/Users/jared/Desktop/trial_mingming_out.csv', 'w') as outfile:
        rowBuffer = []
        for row in reader:
            threadNum = row[0]
            if prevThreadNum == threadNum:
                if len(rowBuffer):
                    #print rowBuffer[0] + "," + ";".join(rowBuffer[1:])
                    outfile.write(rowBuffer[0] + "," + ";".join(rowBuffer[1:]) + "\n")
                rowBuffer = []
                prevThreadNum = threadNum

The script takes a CSV (comma separated values) file as its input. The .csv file was generated by performing a File > Save As... in Excel and choosing .csv as the file format.

After reading in the CSV file, the Python script steps through the data, one row at a time. The first thing it does is stores the value from the first column (i.e. column '0') in a variable called threadNum: threadNum = row[0]. As you can see in the screenshot above, this first column is the index or conversation thread number.

The script then checks the index number to see if it has encountered that number before: prevThreadNum == threadNum. If it has, it continues to build up the list of usernames from the author column (i.e. the 6th column or column number '5' if you start counting from 0 like computers do).

If the script hasn't encountered that number before, then the script has come across a new conversation thread. It writes out the index number and a semi-colon separated list of all the usernames it has collected for that thread to the output .csv file: outfile.write(rowBuffer[0] + "," + ";".join(rowBuffer[1:]) + "\n").

As you can see in the screenshot above, the row in the new spreadsheet now has the index number in the first column, followed by a list of usernames separated by semi-colons (';') in the second column. It's this semi-colon separated list that is understood by Gephi.

We ran the script over Mingming's entire dataset and in less than a minute, he had the .csv file he needed to load into Gephi for visualising the data. Mingming was very grateful to have a solution to a problem he'd been grappling with for the last two months.

If you would like to know more about the script above, any of the tools mentioned (Python, Jupyter, Gephi) or any other digital research tool, then come along next week.

Thanks for reading and we look forward to seeing you at the next Hacky Hour!

Hacking away: Mingming Cheng, Darren Lee, Jared Berghold and Claire Hardgrove