Centre for Open Software Innovation
COSI at the University of Waikato fosters open development in computer science, supporting projects in cyber security, machine learning, and digital libraries.
The Centre for Open Software Innovation (COSI) was established in 2009, as the University of Waikato's leading research centre on computer science theory and practice. It is the purpose of COSI to:
- Inspire and extend open development practice in computer science;
- Innovate (open) systems, theories and tools to improve processes and products;
- Excel at core computer science theory and practice as the foundation for innovation;
- Be community leaders at the local, national and international levels through effective communication and openness.
Projects
Cyber Security
Progger (Provenance Logger) is a kernel-space logger designed to track data activity in cloud systems. It has the potential to empower cloud stakeholders (users) by allowing them to trace what has happened to their data in the cloud. It can also be used by security analysts to collect provenance data from the lowest possible atomic data actions, and enables several higher-level tools to be built for effective end-to-end tracking of data provenance.
Progger has been implemented to be tamper-evident, accurately synchronise timestamps across several machines, efficiently log the root usage of the system and reduce clutter in the log files.
Digital Libraries
The Digital Image Resizer Toy is an implementation of the algorithm for seam carving, sometimes referred to as "content aware image resizing," developed by Shai Avidan and Ariel Shamir. Seam carving seeks to avoid the drawbacks of other approaches to image resizing like cropping, where parts of the image are cut away in order to reduce the image size, and image scaling,which distorts the image contents if not done both horizontally and vertically by the same factor. It has been made famous via a video labelled "Advanced Image Resizing," which you should watch to find out what it does.
The FLAX project aims to automate the production and delivery of practice exercises for overseas students who are learning English. Our strategy is to deploy digital library software to allow teachers and students to capitalise on top-quality prose and multimedia resources already present in the world's libraries. This yields an unprecedented supply of linguistic material for students to practise on. The exercises involve students in a virtually endless supply of collaborative and competitive language activities that are interesting, compelling, and rewarding. They are presented in a web-based social setting, matching in real time students in different locations who opt for a particular type of exercise, allowing them to discuss and negotiate its parameters using chat, and undertake activities that are competitive or collaborative.
Greenstone is a suite of software for building and distributing digital library collections. Greenstone provides a new way of organizing information and publishing it on the Internet or on CD-ROM. Greenstone is produced by the New Zealand Digital Library Project at the University of Waikato, and developed and distributed in cooperation with UNESCO and the Human Info NGO. It is open-source, multilingual software, issued under the terms of the GNU General Public License. Read the Greenstone Factsheet for more information.
Katoa is a Java-based toolkit for concept-based text processing. Katoa makes use of external knowledge bases and provides 1) methods for representing natural-language text by the concepts it mentions (instead of words); 2) similarity measures that take the semantic relatedness among concepts into account; and 3) enhanced clustering methods that utilize the semantic concept relatedness information. Katoa now supports two knowledge bases: Wikipedia and WordNet.
Keywords and keyphrases (multi-word units) are widely used in large document collections. They describe the content of single documents and provide a kind of semantic metadata that is useful for a wide variety of purposes. The task of assigning keyphrases to a document is called keyphrase indexing.
For example, academic papers are often accompanied by a set of keyphrases freely chosen by the author. In libraries professional indexers select keyphrases from a controlled vocabulary (also called Subject Headings) according to defined cataloguing rules. On the Internet, digital libraries, or any depositories of data (flickr, del.icio.us, blog articles etc.) also use keyphrases (or here called content tags or content labels) to organise and provide a thematic access to their data. KEA is an algorithm for extracting keyphrases from text documents.
It can be either used for free indexing or for indexing with a controlled vocabulary. KEA is implemented in Java and is platform independent. It is an open-source software distributed under the GNU General Public License.
Maui automatically identifies main topics in text documents. Depending on the task topics are tags, keywords, keyphrases, vocabulary terms, descriptors, index terms or titles of Wikipedia articles. Maui builds on the keyphrase extraction algorithm KEA, but provides additional functionalities: it allows the assignment of topics to documents based on terms from Wikipedia using Wikipedia Miner.
Maui also has many new features that help identify topics more accurately. Maui performs the following tasks: keyphrase extraction, automatic tagging, term assignment with a controlled vocabulary, subject indexing, topic indexing with terms from Wikipedia. It can also be used for terminology extraction and semi-automatic topic indexing.
Formal Methods
The Community Z Tools (CZT) project is building a set of tools for editing, typechecking and animating formal specifications written in the Z specification language, with some support for Z extensions such as Object-Z, Circus, and TCOZ. These tools are all built using the CZT Java framework for Z tools.
Jumble is a class level mutation testing tool that works in conjunction with JUnit. The purpose of mutation testing is to provide a measure of the effectiveness of test cases. A single mutation is performed on the code to be tested, the corresponding test cases are then executed. If the modified code fails the tests, then this increases confidence in the tests. Conversely, if the modified code passes the tests this indicates a testing deficiency.
ModelJUnit is a Java library that extends JUnit to support model-based testing. Models are extended finite state machines (EFSM) that are written in a familiar and expressive language: Java. ModelJUnit is an open source tool, released under the GNU GPL license.
This software enables users to create finite-state machine models in a graphical user interface, to simulate their execution, and to apply model checking algorithms to them. Its main features are/will be: * graphical modelling of complex finite-state machine models, using the framework of discrete-event systems; * animation of several finite-state machines running in parallel; * push-button verification of finite-state machine models, i.e., checking whether a model satisfies certain properties and displaying diagnostic information if it does not.
Waters is intended to support research and teaching in the area of model checking and discrete-event systems. It includes features to create large-scale and parameterised extended finite-state machine models, and to experiment with different model checking algorithms. The tool may also be used by software developers who write safety-critical code and want to check whether it is correct. WATERS is part of Supremica.
Machine Learning
The Advanced Data mining And Machine learning System (ADAMS) is a flexible workflow engine aimed at quickly building and maintaining data-driven, reactive workflows, easily integrated into business processes.
Instead of placing operators on a canvas and manually connecting them, a tree structure and flow control operators determine how data is processed (sequentially/parallel). This allows rapid development and easy maintenance of large workflows, with hundreds or thousands of operators.
Operators for machine learning (WEKA, MOA, MEKA, deeplearning4j) and image processing (ImageJ, JAI, BoofCV, OpenImaJ,LIRE, ImageMagick and Gnuplot). R available using Rserve. WEKA webservice allows other frameworks to use WEKA models. Fast prototyping with Groovy and Jython. GIS support with OpenStreetMap integration. Read/write support for various databases and spreadsheet applications.
The Digital Invisible Ink Toolkit is a Java steganography tool that can hide any sort of file inside a digital image (regarding that the message will fit, and the image is 24 bit colour). It will work on Windows, Linux and Mac OS because it is written in Java and thus platform independent. There are four highly customisable algorithms in the tool, as well as an open-source implementation of RS Analysis (an extremely good steganalysis method). The tool has the additional advantage of being able to simulate hiding - so you can get an accurate map of where the information is hidden.
Katoa is a Java-based toolkit for concept-based text processing. Katoa makes use of external knowledge bases and provides 1) methods for representing natural-language text by the concepts it mentions (instead of words); 2) similarity measures that take the semantic relatedness among concepts into account; and 3) enhanced clustering methods that utilize the semantic concept relatedness information. Katoa now supports two knowledge bases: Wikipedia and WordNet.
Keywords and keyphrases (multi-word units) are widely used in large document collections. They describe the content of single documents and provide a kind of semantic metadata that is useful for a wide variety of purposes. The task of assigning keyphrases to a document is called keyphrase indexing.
For example, academic papers are often accompanied by a set of keyphrases freely chosen by the author. In libraries professional indexers select keyphrases from a controlled vocabulary (also called Subject Headings) according to defined cataloguing rules. On the Internet, digital libraries, or any depositories of data (flickr, del.icio.us, blog articles etc.) also use keyphrases (or here called content tags or content labels) to organise and provide a thematic access to their data. KEA is an algorithm for extracting keyphrases from text documents.
It can be either used for free indexing or for indexing with a controlled vocabulary. KEA is implemented in Java and is platform independent. It is an open-source software distributed under the GNU General Public License.
MOA is a framework for learning from a data stream, a continuous supply of examples. Includes tools for evaluation and a collection of machine learning algorithms. Related to the WEKA project, also written in Java, while scaling to more demanding problems.
The MEKA project is a Multi-label Extension to the WEKA machine learning framework. It provides an open source implementation of methods for multi-label classification, including the pruned sets and classifier chains, several benchmark methods, as well as a wrapper to the MULAN framework.
Maui automatically identifies main topics in text documents. Depending on the task topics are tags, keywords, keyphrases, vocabulary terms, descriptors, index terms or titles of Wikipedia articles. Maui builds on the keyphrase extraction algorithm KEA, but provides additional functionalities: it allows the assignment of topics to documents based on terms from Wikipedia using Wikipedia Miner.
Maui also has many new features that help identify topics more accurately. Maui performs the following tasks: keyphrase extraction, automatic tagging, term assignment with a controlled vocabulary, subject indexing, topic indexing with terms from Wikipedia. It can also be used for terminology extraction and semi-automatic topic indexing.
This Java software implements Profile Hidden Markov Models (PHMMs) for binary protein classification for the WEKA workbench. A PHMM is a Hidden Markov Model especially designed to represent multiple sequence alignments of amino acid sequences. This software learns the alignment for unaligned sequences that are represented as a string attribute in an arff file. The learning algorithm is the Baum-Welch algorithm. It trains the PHMM until a user-defined threshold or a specified number of training iterations.
Different to WEKA, the training process can be evaluated after each training iteration of the Baum-Welch algorithm. This software introduces binary PHMMs. They consists of two PHMMs one trained on the positive and the other on the negative instances.
The software allows sampling of the negative class. Additionally the software creates attribute-value representations from PHMMs. These representations can be used in combination with any other WEKA classifier (depending on the individual classifier's capabilities).
WEKA is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. WEKA contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.
WEKA Project homepage
Mathematics
The Sobol' sequence is an example of a low-discrepancy sequence. Such sequences are often called quasi-random sequences as they are commonly used in place of uniformly distributed random numbers. One application of Sobol' sequences is in numerical multiple integration, that is, in the approximation of integrals of functions which may depend on hundreds or even thousands of variables. These multiple integrals arise in areas such as quantum physics, probability, and mathematical finance.
How well a Sobol' sequence performs in numerical multiple integration is determined by the choice of direction numbers.
In some applications it is believed that it is the low-dimensional projections of the sequence that are important. As a result, this project attempts to produce direction numbers that result in Sobol' sequences with good two-dimensional projections. Files containing direction numbers and a C++ program to generate the Sobol' sequences from the direction numbers are provided. This allows the approximation of integrals in up to 21201 variables.
WAND Network Research
AMP (Active Measurement Project) is a project originating at NLANR designed to constantly perform active measurements between a mesh of specialist AMP monitors. These measurements are used to both provide a view of long-term network performance and to detect notable network events. The AMP system consists of several components: the measurement software which runs on the monitor machines and conducts the AMP tests; the reporting software which reliably sends the test results back to a central server; and the graphing software which plots the results and presents them in an accessible fashion.
BSOD generates a real-time 3D visualisation of network traffic data. It uses libtrace to read from any supported input format, such as a live network interface or a saved trace file, and displays the flow of network data between hosts. BSOD can provide at a glance information that would otherwise be difficult to observe by examining the traffic data using traditional statistical methods.
Libtrace is a library for capturing and processing network packet traces. It supports many common input methods, including device capture and trace files, and multiple formats, including pcap, DAG and OS-native sockets. The libtrace API allows programmers to easily access packet header information without needing to concern themselves with the other headers that might be present in the packet.
Libtrace also supports writing captured packets to network trace files using the pcap and ERF formats and comes bundled with a series of tools that perform most common trace manipulation tasks.
An implementation of an IPFIX meter that is heavily based on the libtrace library. IPFIX is an IETF-standardised method for exporting flow-level measurements of network traffic, similar to Cisco Netflow. Maji differs from most IPFIX implementations in that it allows the user to create their own custom measurement templates rather than relying on pre-defined templates.
A distributed network measurement tool that passively monitors a user's connection to the Internet. It collects simple statistics, such as throughput and latency, that are reported back to a central server for aggregation and analysis. Users specify their ISP and location when they first start using the tool to enable comparisons between different providers and cities.
scamper is a program that is able to conduct Internet measurement tasks to large numbers of IPv4 and IPv6 addresses, in parallel, to fill a specified packets-per-second rate. Currently, it supports the well-known ping and traceroute techniques, as well as three methods of alias resolution, sting, and neighbour discovery. It is useful in both research and in large-scale elementary network performance evaluation.
How to start a new project
Do you have an idea for a software project? Do you want to release it as an open source project?
This information will help you make the right decisions regarding license selection, project hosting, etc. Perhaps the most important step is in choosing the right license for your project. There is no silver bullet, as there are many options to choose from. However, knowing what you are trying to achieve with your project will help you make an informed decision.
The success of a project is dependent on its accessibility and availability. You must select an appropriate host to make your project publicly available. Again, it is important that you choose a suitable hosting facility to give your project the public exposure it needs.
But before you start reading below, have a look at the freely available book Open Advice - FOSS: What We Wish We Had Known When We started. The authors of this book are all involved in the software freedom community. Their experiences and advice are worth reading, as they might save you from falling into common pitfalls.
In the sections below you will find links to a variety of resources with more information around open source.
What license should I choose?
It is very important to take some time and think about what you are trying to achieve with your project.
As the Linux.com article on Choosing an open source license points out, the impact on derivative works has to be considered when choosing the license. For example, choosing the GNU General Public License, all derivative work must be GPL as well. A good overview of the options is provided in Appendix A of the paper The Need for Open Source Software in Machine Learning by Sören Sonnenburg et al.
For the impatient, you can use the ChooseALicense.com website (run by GitHub) to quickly determine an appropriate license for your project.
A complete list of approved open source licenses is available on the Open Source Initiative website. Wikipedia lists open source licenses as well.
Hosting your project
In order to make your project publicly available, you do not need to pay lots of money, there are plenty of free hosting facilities available.
What hoster you choose in the end, depends on the functionality that you need:
- The type of revision control system you want to use - for keeping track of changes in the code. You might also want to check out the article Making Sense of Revision-control Systems by Bryan O'Sullivan, to give you an idea what system is best suited for you and your team.
- A bug tracking system - useful for prioritising bugs and assigning them to different people on your team.
- A wiki for documentation purposes - a project is only ever as good as its documentation.
- An automated build system for nightly builds.
A good overview of what functionality is offered by which hoster, can be found in the Wikipedia article Comparison of open source software hosting facilities.
Submit a project
If you are involved in an open source software project here at the University of Waikato that uses an approved open source license, you can submit the initial project details for review by sending the COSI webmaster an email. Please note, the project submission is not an automated process. A submitted project must be approved first by the COSI webmaster before it can be listed.
Useful links
The Free Software Foundation, founded in 1985, is dedicated to promoting computer users' right to use, study, copy, modify, and redistribute computer programs. The FSF promotes the development and use of free (as in freedom) software -- particularly the GNU operating system and its GNU/Linux variants -- and free documentation for free software. The FSF also helps to spread awareness of the ethical and political issues of freedom in the use of software, and its Web sites, located at fsf.org and gnu.org, are an important source of information about GNU/Linux.
-- taken from the FSF homepage
The OSI are the stewards of the Open Source Definition (OSD) and the community-recognized body for reviewing and approving licenses as OSD-conformant.
The OSI is actively involved in Open Source community-building, education, and public advocacy to promote awareness and the importance of non-proprietar software. OSI Board members frequently travel the world to attend Open Source conferences and events, meet with open source developers and users, and to discuss with executives from the public and private sectors about how Open Source technologies, licenses, and models of development can provide economic and strategic advantages.
--taken from the OSI homepage
- Overview of Open Source licenses (OSI webpage)
- The Need for Open Source Software in Machine Learning by Sören Sonnenburg et al. (see Appendix A)
- GNU General Public License (GPL)
- License
- Overview of compatible and incompatible licenses
- Violations - if you catch somebody violating the GPL
 
To ensure success, FOSS projects need more than a well–chosen license. The processes by which FOSS is produced must ensure that the rights of the FOSS community are preserved and promoted. We must avoid third-party intellectual property claims, which — as the SCO controversy has shown — have striking potential to disrupt adoption of FOSS. Our practice brings together the world’s leading expertise in all the fields of law that FOSS projects may encounter or be affected by. We can counsel our clients on the big picture, beyond today’s specific problems, helping projects reach their long-term goals safely and efficiently so hackers can concentrate on making great software.
Taken from the Software Freedom Law Center homepage.
The International Free and Open Source Software Law Review (IFOSS L. Rev.) is a collaborative legal publication aiming to increase knowledge and understanding among lawyers about Free and Open Source Software issues. Topics covered include copyright, licence implementation, licence interpretation, software patents, open standards, case law and statutory changes.
--taken from the International Free and Open Source Software Law Review homepage
The New Zealand Open Source Society is an incorporated society set up to protect, advocate and advance the use of Open Source Software in New Zealand.
--taken from the NZOSS websiteOpen Advice is a knowledge collection from a wide variety of Free Software projects. It answers the question what 42 prominent contributors would have liked to know when they started so you can get a head-start no matter how and where you contribute.
-- taken from the Open Advice homepage