The Second Workshop of Melbourne-China Big Data Research Network

Date:     30 May 2016
09:00 - 12:00 FIT 1-315, Department of Computer Science and Technology, Tsinghua University, Beijing, China (清华大学计算机系FIT 1-315)
14:00 - 15:30 10-103 East Main Building, Department of Computer Science and Technology, Tsinghua University, Beijing, China (清华大学计算机系东主楼10-103)


Schedule Activity Speaker Affiliation
9:00-9:10 Arriving
9:10-9:20 Workshop Introduction Prof. Rui Zhang University of Melbourne
9:20-9:35 Introduction of Department of Computing and Information Systems of University of Melbourne Prof. Justin Zobel, Department Head University of Melbourne
9:35-9:50 Introduction of Department of Computer Science and Technology of Tsinghua University Prof. Wenwu Zhu, Deputy Department Head Tsinghua University
9:50-10:10 Signing Joint Agreement, and break Tsinghua University and University of Melbourne

Research Talks

Schedule Talk Title Speaker Affiliation
10:10-12:00 A Story of Strings Prof. Justin Zobel University of Melbourne
Large Scale Metric Learning using Locality Sensitive Hashing Prof. Rao Kotagiri University of Melbourne
The Power of Representation in Relation Classification Prof. Tim Baldwin University of Melbourne
Similarity Analytics with Advanced Relationships on Big Data Prof. Rui Zhang University of Melbourne
Pattern Recognition in Biological Data Dr. Isaam Saeed University of Melbourne
Research on Spatio-temporal Data Management and Mining Dr. Jianzhong Qi University of Melbourne
12:00-14:00 Lunch break
14:00-14:25 Crowdsourced Data Management Associate Prof. Guoliang Li Tsinghua University
14:25-14:50 Recommendation with Big Data on Mobiles Dr. Ruihua Song Microsoft
14:50-15:15 Embedding Trajectory Data Assistant Prof. Xin Zhao Renmin University

Talk 1: A Story of Strings

String processing is a fundamental challenge of computing that has been the subject of research since the discipline’s inception in the 1950s. Algorithms for efficient string processing continue to develop, despite the long history, with innovations such as new forms of trie, suffix array, string graphs, and succinct structures emerging over the last two decades. At the same time, these algorithms continue to find new applications. Three examples are information retrieval, compression, and bioinformatics. Professor Zobel will briefly review the area and explain some of the common elements underlying these very different applications.

Speaker: Professor Justin Zobel

Bio: Professor Justin Zobel is Head of the Department of Computing & Information Systems. He received his PhD from the University of Melbourne and for many years was based in the School of CS&IT at RMIT University, where he led the Search Engine group. In 2007-8 he was a Principle Senior Researcher in NICTA, leading the Computing in Health area, and in 2010 was interim Director of the Victorian Life Sciences Computation Initiative. Prof Zobel is an associate editor of the International Journal of Information Retrieval, ACM Transactions on Information Systems, and Information Processing & Management, and in 2008-9 was President of the CORE association of Australasian Departments & Schools of Computer Science. In the research community, he is best known for his role in the development of algorithms for efficient text retrieval. He is the author of "Writing for Computer Science", second edition, and co-author of "How to Write a Better Thesis", third edition and "How to Write a Better Minor Thesis". His interests include search, bioinformatics, fundamental data structures, and research methods.

Talk 2: Large Scale Metric Learning using Locality Sensitive Hashing

Metric learning tries discover mapping of features such that objects belonging a particular class each other in the new space. However, the current methods of discovering such matric mappings are computationally in feasible when the data set is huge with large number of features. My talk will describe the state of the art algorithms for metric learning. I will present our recent work on an efficient approach for discovering metric learning based mappings using Locality Sensitive Hashing (LSH). Our generic approach can accelerate state-of- the-art metric learning while achieving competitive classification accuracy, expanding feasibility by an order of magnitude. Our approach can accelerate Large Margin Nearest Neighbour (LMNN) to learn metrics on 1,000,000 samples in 3.6 minutes which is reduced from 5.8 hours.

Speaker: Professor Ramamohanarao (Rao) Kotagiri

Bio: Professor Rao Kotagiri received PhD from Monash University. He was awarded the Alexander von Humboldt Fellowship in 1983. He has been at the Uni. of Melb. since 1980 and was appointed as a professor in computer science in 1989. Rao held several senior positions including Head of Computer Science and Software Engineering, Head of the School of Electrical Engineering and Computer Science at the University of Melbourne and Research Director for the Cooperative Research Centre for Intelligent Decision Systems. He served or serving on the Editorial Boards of the Computer Journal, Universal Computer Science, TKDE, VLDBJ and International Journal on Data Privacy. He was the program Co-Chair for VLDB, PAKDD, DASFAA and DOOD. He is a steering committee member of ICDM, PAKDD. He received distinguished contribution award for Data Mining from PAKDD. Rao is a Fellow of the Institute of Engineers AU, a Fellow of Australian Academy Technological Sciences and Engineering and a Fellow of Australian Academy of Science. He was awarded Distinguished Contribution Award in 2009 by the Computing Research and Education Association of Australasia. He has published more than 350 articles and 48 PhD completions. He was the chair of ICDE 2013 and a co-chair of SIGMOD 2014.

Talk 3: The Power of Representation in Relation Classification

I will present research scrutinising recent work on word embeddings, which has shown that simple vector subtraction over pre-trained embeddings can capture different lexical relations. Prior work has evaluated this intriguing result using a word analogy prediction formulation and hand-selected relations, but the generality of the finding over a broader range of lexical relation types and different learning settings has not been evaluated. I will discuss a supervised learning experiment over a broad range of lexical relation types, and show that word embeddings capture a surprising amount of information, and that, under suitable supervised training, vector subtraction generalises well to a broad range of relations, including over unseen lexical items.

Speaker: Professor Tim Baldwin

Bio: Tim Baldwin is a Professor in the Department of Computing and Information Systems, The University of Melbourne, and an Australian Research Council Future Fellow. He has previously held visiting positions at Cambridge University, University of Washington, University of Tokyo, Saarland University, NTT Communication Science Laboratories, and National Institute of Informatics. His research interests include text mining of social media, computational lexical semantics, information extraction, and web mining. Current projects include web user forum mining, monitoring and text mining of Twitter, and text analytics for the creative industries. Tim completed a BSc(CS/Maths) and BA(Linguistics/Japanese) at The University of Melbourne in 1995, and an MEng(CS) and PhD(CS) at the Tokyo Institute of Technology in 1998 and 2001, respectively. Prior to joining The University of Melbourne in 2004, he was a Senior Research Engineer at the Center for the Study of Language and Information, Stanford University (2001-2004).

Talk 4: Similarity Analytics with Advanced Relationships on Big Data

Similarity analytics that examine relationships between records of high dimensions and complex structures are fundamental operations in data mining and machine learning tasks. We investigate sophisticated similarity analytics that involve high computational cost, which are difficult to scale to large datasets. In this talk, we discuss two algorithms of this kind, Earth Mover's Distance based similarity join algorithms, and hypergraph learning algorithms, both exploiting distributed processing on Hadoop and Spark. We propose novel techniques in the aspects of data representation, computation pruning, reusable interfaces, and workload partitioning. Comprehensive experimental study shows that our algorithms achieve an order of magnitude efficiency improvement over existing algorithms.

Speaker: Professor Rui Zhang

Bio: Rui Zhang is a Professor and Reader, and leader of the Big Data and Knowledge Research Theme at the Department of Computing and Information Systems of the University of Melbourne. He has been awarded the Future Fellowship by the Australian Research Council in 2012. His inventions have been adopted by major IT companies such as AT&T and Microsoft. In 2015, Dr Zhang has received the Chris Wallace Award for Outstanding Research in recognition of his significant contributions to the management and mining of spatiotemporal and multidimensional data. He obtained his Bachelor's degree from Tsinghua University in 2001 and PhD from National University of Singapore in 2006. He has been a visiting scholar in AT&T Labs-Research and Microsoft Research before and is now a regular visiting researcher at Microsoft Research Asia in Beijing. He has authored 90 publications in prestigious conferences and journals. His research interest is spatial and temporal data analytics, as well as general database and mining techniques including indexing, moving object management, data streams and sequence databases. He regularly serves as PC members of top conferences in data management and mining such as SIGMOD, VLDB, ICDE and KDD. He is an associate editor of Distributed and Parallel Databases.

Talk 5: Pattern Recognition in Biological Data

I plan to give an overview of our group, briefly discuss our recent work in metagenomics, metabolomics, cancer genomics, neural engineering and introduce some of our other projects (smart grids and business analytics).

Speaker: Dr. Isaam Saeed

Bio: Dr Isaam Saeed is a postdoctoral research fellow at the Halgamuge Lab (Melbourne School of Engineering). His research interests revolve around unsupervised learning algorithms to analyse complex biological data. He is an investigator on two Australian Research Council grants, and has published his work in journals such as Nucleic Acids Research, ISMEJ and Bioinformatics. He has also spent several years in industry as a co-founder of startups in the B2B legal-tech and edu-tech space.

Talk 6: Research on Spatio-temporal Data Management and Mining

In this talk Jianzhong will give a brief introduction on his research which has a focus on spatio-temporal data management and mining. Spatio-temporal data refer to data with location and time attributes, such as GPS points or trajectories of pedestrians. Jianzhong will present two of his recent studies, one on moving k nearest neighbour query processing and the other on reverse nearest neighbour heat map generation, as examples of spatio-temporal data management and spatio-temporal data mining, respectively.

Speaker: Dr. Jianzhong Qi

Bio: Jianzhong Qi is a lecturer in the Department of Computing and Information Systems at the University of Melbourne. He obtained his PhD degree from the University of Melbourne in 2014. He has been an intern at Toshiba China R&D Centre in 2009 and Microsoft Redmond in 2013, respectively. He is a nominee of CORE’s Australasian Distinguished Doctoral Dissertation Award 2015 and a recipient of University of Melbourne Early Career Researcher Grant 2016. His research interests include spatio-temporal databases and location based social networks.

Talk 7: Crowdsourced Data Management

Any important data management and analytics tasks cannot be completely addressed by automated processes. These tasks, such as entity resolution, sentiment analysis, and image recognition can be enhanced through the use of human cognitive ability. Crowdsouring platforms are an effective way to harness the capabilities of people (i.e., the crowd) to apply human computation for such tasks. Thus, crowdsourced data management has become an area of increasing interest in research and industry. We identify three important problems in crowdsourced data management. (1) Quality Control: Workers may return noisy or incorrect results so effective techniques are required to achieve high quality; (2) Cost Control: The crowd is not free, and cost control aims to reduce the monetary cost; (3) Latency Control: The human workers can be slow, particularly compared to automated computing time scales, so latency-control techniques are required. There has been significant work addressing these three factors for designing crowdsourced tasks, developing crowdsourced data manipulation operators, and optimizing plans consisting of multiple operators. In this talk, I introduce a wide spectrum of existing studies on crowdsourced data management. Based on this analysis I then outline key factors that need to be considered to improve crowdsourced data management.

Speaker: Associate Professor Guoliang Li

Bio: Guoliang Li is an Associate Professor of Department of Computer Science and Technology, Tsinghua University, Beijing, China. He received his PhD degree in computer science from Tsinghua University in 2009, and his Bachelor degree in Computer Science from Harbin Institute of Technology in 2004. His research interests include data cleaning and integration. He has published more than 80 papers in premier conferences and journals, such as SIGMOD, VLDB, ICDE, SIGKDD, SIGIR, TODS, VLDB Journal, and TKDE. He is a PC co-chair of The 14th International Conference on Web-Age Information Management (WAIM 2014) and The 17th International Workshop on theWeb and Databases (WebDB 2014). He has served on the program committees of many premier conferences, such as SIGMOD, VLDB, KDD, ICDE, and IJCAI. His papers have been cited more than 3000 times. He received IEEE TCDE Early Career Award, NSFC Excellent Young Scholars Award, New Century Excellent Talents in University Award, Beijing Excellent Doctoral Dissertation Award, Nomination Award of National Excellent Doctoral Dissertation, and SCOPUS National Youth Science Star Award.

Talk 8: Recommendation with Big Data on Mobiles

With the prevalence of mobile devices, location-based services is widely adopted. In this talk, I will introduce our recent works on recommendation and mobiles. First, we aim at recommending each user a list of restaurants for his next dining based on users’ dining implicit feedbacks (restaurant visit via check-ins), explicit feedbacks (restaurant reviews) , and some meta data (e.g., location, user demographics, restaurant attributes). Second, we address the problem of query recommendation on mobile devices by modelling the user-location-query relations with a tensor representation. Unlike previous studies based on tensor decomposition, we study this problem via tensor function learning. That is, we learn the tensor function from the side information of users, locations and queries, and then predict users’ search intent. Third, we address the difficulty of clipping articles from mobile apps. We propose a service called UniClip that allows a user to save the full content of an article by snapping a screenshot part of it. This is useful to learn user interests for recommendation.jj

Speaker: Dr. Ruihua Song

Bio: Ruihua Song is a Lead Researcher in Microsoft Research Asia. She received her B.S. and M.S. from Tsinghua University in 2000 and 2003. She joined Microsoft Research Asia in 2003. Later she received a PHD from Shanghai Jiao Tong University in 2011. Her research interests include Web information retrieval, information extraction, mobile and social. She serves as a PC member in SIGIR, SIGKDD, VLDB, IJCAI, and ECIR. Her works has been transferred into to many Microsoft products, such as ranking in Bing and reading view in IE11 and Edge.

Talk 9: Embedding Trajectory Data

The proliferation of location-based social networks, such as Foursquare and Facebook Places, offers a variety of ways to record human mobility, including user generated geo-tagged contents, check-in services and mobile apps. Although trajectory data is of great value to many applications, it is challenging to analyze and mine trajectory data due to the complex characteristics reflected in human mobility, which is affected by multiple contextual information. In this talk, we propose a Multi-Context Trajectory Embedding Model, called MC-TEM, to explore contexts in a systematic way. MC-TEM is developed in the distributed representation learning framework, and it is flexible to characterize various kinds of useful contexts for different applications. To the best of our knowledge, it is the first time that the distributed representation learning methods apply to trajectory data. We formally incorporate multiple context information of trajectory data into the proposed model, including user-level, trajectory-level, location-level and temporal contexts. All the contextual information is represented in the same embedding space. We apply MC-TEM to two challenging tasks, namely location recommendation and social link prediction. We conduct extensive experiments on three real-world datasets. Extensive experiment results have demonstrated the superiority of our MC-TEM model over several state-of-the-art methods.

Speaker: Assistant Professor Xin Zhao

Bio: Wayne Xin Zhao is currently an assistant professor at the School of Information, Renmin University of China. He received the Ph.D. degree from Peking University in 2014. His research interests are web text mining and natural language processing. He has published several referred papers in international conferences journals such as ACL, EMNLP, COLING, ECIR, CIKM, SIGIR, SIGKDD, AAAI, IJCAI, ACM TOIS, ACM TIST, IEEE TKDE, KAIS and WWWJ.


  • Logo of University of Melbourne
  • Logo of Tsinghua University
  • Logo of Renmin University of China
  • Logo of Microsoft


  • Melbourne-China Big Data Research Network, University of Melbourne International Research and Research Training Fund (IRRTF)
  • Melbourne-Chindia Cloud Computing (MC3) Research Network, University of Melbourne International Research and Research Training Fund (IRRTF)
  • Tsinghua University
  • Renmin University
  • Microsoft

  • Main contact and coordinator: Professor Rui Zhang