The need for data-driven decision-making primarily motivates interest in analysing Big Data in higher education. Although there has been considerable research on the value of Big Data in higher education, its application to address critical issues within the sector is still limited. This systematic review, conducted in December 2021 and encompassing 75 papers, analysed the applications of Big Data and analytics in higher education. The focus was on their usage in supporting learning, teaching and administration as reported in papers indexed in SCOPUS, Web of Science and IEEE Xplore. The key findings from the review revealed that Big Data and analytics are predominantly used to support learning and, to a lesser extent, guide teaching and informing administrative decision-making processes. The review also identified a set of studies focused on supporting student well-being. Further, we extend the use of Big Data in higher education to include the well-being of students and staff. This paper contributes to the growing debate on the practical use of Big Data and analytics to provide valuable insights for solving systemic challenges facing high education in the twenty-first century.
Avoid common mistakes on your manuscript.
Higher education institutions generate large amounts of exponentially increasing ‘Big Data’ (Monino & Sedkaoui, 2016) either via interaction between different stakeholders or via stakeholders interacting with the learning management systems and student records systems. Although there is no single definition of Big Data (Arbia, 2021), it is generally described using 3 Vs (Erevelles et al., 2016; Grover et al., 2018; Ristevski & Chen, 2018) – volume (a large amount of data), variety (heterogeneity of data, ranging from structured to unstructured) and velocity (the speed with which the data can be accessed). Some researchers added veracity (data quality, e.g., Jin et al., 2015), value (the worth of the generated insights, e.g.Lycett, 2013; Naeem et al., 2022) or variability (the presence of inconsistency and noise, e.g., Jo, 2019). Regardless of any particular conceptualisation, the increasing amount of data generated in the higher education sector provides opportunities for extracting valuable, actionable insights, similar to other sectors. For example, in the healthcare sector (Singh et al., 2021), Big Data techniques are used in medical image processing to detect or predict disease progression (Rehman et al., 2022). In cybersecurity (Alani, 2021), Big Data and analytics are being used for ransomware (Huang et al., 2018) or phishing detection (Gutierrez et al., 2018). In addition, Big Data has been used for crop yield prediction (Abbas et al., 2020), digital marketing (Kushwaha et al., 2021) and search engine optimisation (Drivas et al., 2020).
With the increasing use of digital technologies to support learning and teaching, a significant amount of data is being generated, primarily by engaging students and faculty in learning management systems (LMS). This data can be harvested, processed and used to address critical challenges higher education institutions face. Drawing on the successes of using Big Data in various sectors, higher education can likewise seize the opportunity to apply Big Data techniques to gain valuable insight for decision-making. Early research noted that higher education is a sector yet to be penetrated by Big Data and analytics (Attaran et al., 2018), referring to Big Data as an ‘untapped opportunity’ in higher education (Chaurasia & Frieda Rosin, 2017). Several early articles discussed the potential of harnessing Big Data in higher education (e.g., Attaran et al., 2018; Daniel, 2017; Tasmin et al., 2020). For example, Daniel (2015) proposed three scenarios in which Big Data can support learning, teaching and administration. These scenarios were developed from a critical analysis of early work on applying Big Data in education. Chaurasia and Frieda Rosin (2017) further proposed four potential uses of Big Data in education, including reporting and compliance (which could be subsumed under supporting administration in Daniel's (2015) framework), analysis and visualisation (classified under supporting the teaching and learning process), security and risk mitigation (supporting administration) and predictive analytics (supporting the teaching and learning process). Big Data can identify at-risk students, provide individualised learning experiences, or improve student assessment (Ray & Saeed, 2018). However, it is less clear how much the value of Big Data in higher education has been fully realised over the 2011–2021 period.
A literature review indicated that each of these uses had been realised. As a point case, Waheed and colleagues (Waheed et al., 2020) focused on using Big Data to support learning. They demonstrated that a neural network model containing data from the virtual learning environment could predict students at risk of failing a course. On the other hand, Cooper et al. (2016) used Big Data to identify if courses have accessibility issues, demonstrating its use to support administration. Looking at individual papers may provide insights into a particular use of each of the scenarios proposed but not a comprehensive overview, as they are limited to demonstrating application to a single case study. Therefore, systematic reviews are beneficial for obtaining an overview of the application of Big Data in higher education.
Notably, current systematic reviews of Big Data in higher education focus on how learning analytics supports study success (Ifenthaler & Yau, 2020), the effectiveness of interventions on student outcomes such as retention, engagement and performance (Foster & Francis, 2020) and the effectiveness of learning analytics in addressing student dropout rates (De Oliveira et al., 2021). The relatively narrow focus of the existing literature is a drawback, as the systematic reviews focus on a single use of Big Data in higher education. Perhaps two exceptions are Baig et al. (2020), who examined the trends in 40 published papers on Big Data in education as well as research themes addressed in this domain, and Alkhalil et al. (2021), who conducted a systematic mapping study on the use of Big Data in higher education. However, Baig et al.’s (2020) review focused on general education, without considering the specific characteristics of higher education settings. Similarly, Alkhalil et al. (2021) mapping article does not provide a summary and synthesis of the application of Big Data in higher education. As a result, there is limited research providing a broader overview of the available literature on the role of Big Data and associated analytics in higher education.
Further, it seems that studies focusing on Big Data are predominantly concerned with system performance, such as the development of predictive algorithms (Ifenthaler & Yau, 2020) or student satisfaction with using a dashboard (Ramaswami et al., 2019) but less on the application of those algorithms or the usage of the dashboard for the betterment of students’ learning outcomes. Moreover, the available reviews predominantly focus on the potential benefits of learning while neglecting the benefits for teaching and administration, leaving the question of whether the latter applications are lacking or if those conducting the reviews have paid more attention to the benefits of learning.
Furthermore, it remains unclear what challenges are faced when Big Data and analytics are used in higher education. For example, technical issues, ethical considerations, and practical limitations have posed significant obstacles to the widespread adoption of Big Data in higher education (Daniel, 2019; Klein et al., 2019).
In this article, we report on the outcome of a systematic review of the application of Big Data in higher education that covers published work from 2011 to 2022, providing a comprehensive picture, identifying gaps and suggesting directions for future research. Our work opens up a valuable dialogue for policymakers interested in incorporating Big Data analytics into their operational and strategic initiatives.
The systematic review examines the literature on using Big Data in higher education in 2011–2022. We were particularly interested in research that showcased some benefits, improvements or otherwise contributed to decision-making or better student outcomes. The following questions guided the systematic review:
Prior to conducting the systematic review (a type of descriptive research), we performed a broad search ("big data" AND "Education") to familiarise ourselves with the scope of the literature and core keywords and to develop the exclusion criteria. Upon identifying the main keywords, we constructed our query (( ( "learning analytics" OR "big data" OR "data mining" OR "dashboard" OR "academic analytics") AND ( "higher education" OR "tertiary education" OR "HEI" OR "University" OR "College" OR "Faculty") AND ( "intervention" OR "implementation" OR "case study" OR "application") AND ( "improve" OR "enhance" OR "decision making"))). We searched the title, abstract and keywords in SCOPUS, Web of Science and IEEE Xplore. Web of Science was chosen for its multidisciplinarity, and because it is the leading citation search database (Li et al., 2018), SCOPUS is the largest abstract database (Schotten et al., 2017), and IEEE as the database containing computer science papers.
The search (December 2021) resulted in 1,072 entries; however, upon removing the duplicates, there were 851 entries. Articles not in English, review articles and editorials were excluded from the search. Also, we limited the search to the following categories: "Education", "Educational research", "Computer Science OR Information Systems", "Computer Science Artificial Intelligence", "Computer Science Interdisciplinary applications", "Education Scientific disciplines", and "Telecommunications" (in Web of Science), as well as to "Computer Science" and "Social Sciences" (in Scopus). The screening of the articles underwent two stages. In the initial stage, the first author read the titles of the article to determine whether they are relevant to the study research questions. In the second stage, the abstracts were read to determine if the articles would be retained for analysis. If no decision could be made based on the abstract, the full text was screened too. The first author read and summarised all included studies. We excluded papers (1) whose context was not higher education, (2) that focused on attitudes, beliefs or opinions about Big Data. Our exclusions also extended to (3) discussion/conceptual/review papers, (4) papers where Big Data was not the main focus, (5) papers that dealt with teaching of Big Data, (6) papers merely focused on the performance of algorithms (7) papers dealing with architecture, (8) papers with incomprehensible abstracts (9) papers whose full text was not available and, (10) retracted articles because of nonsensical content. The search procedure and retention of articles for the final review are depicted in Fig. 1. The summary of the reviewed studies is given in Table 1 in the online supplementary materials.
In this review, we did not perform a formal quality or risk of bias assessment as our focus was primarily on mapping the types and areas of Big Data applications in higher education, aligning with this study's descriptive and exploratory nature. The review was not registered. Online first (ahead of print) articles were also included in the review.
In presenting the findings, we were guided by the tripartite approach (Daniel & Harland, 2017), which suggests systematic review studies should be described, synthesised and critically evaluated to provide new insights for further research.
We used descriptive statistics to summarise the findings of the review. Figure 2 presents the number of published articles per year. As shown from the figures, the number of articles published per year shows an upward trend, implying that the interest in research into the role of Big Data in higher education continues to grow globally.
The majority of the papers were from the US (f = 14), followed by papers co-authored with scholars affiliated with universities in different countries (f = 8), Spain (f = 7), the UK (f = 6), China (f = 5), Indonesia, Finland and Australia (all had a frequency of f = 3), Türkiye, Singapore, Japan, Italy, India and Ecuador ( all had a frequency of f = 2), and Vietnam, Sweden, South Korea, Philippines, Ireland, Iran, Greece, Czechia, Chile, Canada and Belgium (all had a frequency of f = 1).
In synthesising the findings, we grouped the publications into three main use categories: Big Data's role in supporting learning, teaching, and administration (Daniel, 2015). As will be seen later, some could not be classified into either of these, and we included additional categories (see Fig. 3).
At the time of the review, most published studies focused on supporting students' learning or various aspects of it. One of those aspects is the discourse of Big Data as it relates to predicting student performance. For example, Castells et al. (2020) and Perez and Gonzalez (2016) presented a tool that predicts students' performance, while Prieto et al. (2020) presented two case studies related to the development of a data visualisation tool that analyses student performance and facilitates conversations between students and councillors. Furthermore, Gutiérrez et al. (2020) presented a learning analytics dashboard that uses a multilevel clustering algorithm to predict a student's success in an academic programme and then depicts that chance along with information on the quality of the prediction. Other studies reported using learning analytics to generate student dashboards (Ramaswami et al., 2019); dashboards display an individual's activity or performance and compare it to the average in the class or use an explainable LM algorithm (Afzaal et al., 2021). Researchers have also used students’ data to identify the factors behind the predicted performance in quizzes and assignments and provide information on how the prediction could be improved by listing activities to be performed (e.g. watching a video) (Azcona et al., 2017).
In addition to the development of dashboards, researchers (Lonn et al., 2015) also endeavoured to discover the relationship between advisors’ use of learning analytics early warning system, which used information from a learning management system to provide weekly updates about student engagement and performance to advisors, and their students’ performance during a summer bridge program. Also, researchers used on-campus geolocation data to identify study groups and examine the performance among group members (Azcona et al., 2017) or examined the link between learning input (the frequency of playing videos) and performance (Ji & Han, 2019). Similarly, others discussed the associations between performance, submission time and the number of submissions, how work at night or weekends impacts performance, or how different group work patterns affect performance (Apiola et al., 2019).
Other uses of Big Data included the prediction of whether students will graduate within four years (He et al., 2018), examining the relationship between performance and digital footprint, identifying engagement patterns as an early predictor of performance, and correlating performance at an early stage and overall performance (Summers et al., 2021). The research examined the effect of intervention in the form of guidance and recommendations (prepared by the teacher and based on learning analytics) on students' academic self-efficacy and problem-solving skills (Karaoglan Yilmaz, 2022) as well as developing tools (Broos et al., 2018) that provides feedback about the performance on a positioning test (abilities to solve math problems). Studies reported the effect of learning analytics intervention on student performance in a blended course (Gong et al., 2018; Zhang et al., 2020), applied uplift modelling to demonstrate that offering tutorials to students with the most considerable likelihood to be retained as a result of the tutorials, boosts the effects of such retention efforts (Olaya et al., 2020).
Researchers were interested in students dropping out both late and early. For example, Salazar-Fernandez et al. (2019) analysed the educational trajectories of 414 students in courses with high failure rates to identify similarities and differences between students. In particular, they looked at factors such as gender, income and entry math skills that could explain the different trajectories and failure rates. On the other hand, Dodge et al. (2015), Figueroa-Cañas and Sancho-Vinuesa (2021), and Linden & Webster (2019) trialled an intervention aimed at minimising the number of students who are unlikely to succeed earlier on in their academic programme. Similarly, Herodotou et al. (2020) demonstrated how PLA (predictive learning analytics) could inform the practice of the student support team who contacted the students identified as having a low probability of completing their studies. To identify causes of academic failure, Nkhoma et al. (2020) analysed 968 letters (written by students in a business school) using natural language processing (at the university in question, students who satisfy the criteria to be classified as "at risk" are asked to explain their situation in a letter and work with an advisor on a study plan). Frequencies of the most common words and word pairs were extracted, enabling the researchers to identify the most common reasons. To gain more insights from the data and provide context for the keywords by modelling the relationship between them, the researchers used visualisation based on semantic network analysis and topic modelling to validate the findings. Five significant reasons were identified: learning skills, assessment, time management, courses and family issues. Géryk and Popelínský (2014) presented an interactive visual analytics tool EDAIME that explores academic analytics and examines whether changes in the fields of study are related to retention.
Most studies focused on student engagement reported either a pattern of engagement with the learning resources (Nkomo & Nat, 2021) or the results of an intervention aimed at increasing student engagement (e.g., Lawrence et al., 2019; Lu et al., 2017). For example, Karaoglan Yilmaz and Yilmaz (2022) provided students in the experimental group (N = 33) with personalised metacognitive feedback (based on learning analytics containing info on the weekly learning management system use) and personalised recommendations and compared their engagement to that of the control group. Similarly, Cobos and Ruiz‐Garcia (2021) provided intervention in the form of feedback about students' progress, as well as suggestions for improvement of performance, to examine whether an intervention can change the engagement of students enrolled in a MOOC (massive open online course) as well as the perception about their persistence, intention to complete and pass the course, and performance. Álvarez-Méndez et al. (2020) extracted Moodle log files (N = 33,776) for 121 students and analysed them to examine their interactions with the LMS resources.
Another, more unusual utilisation of Big Data in understanding engagement was observed by (Cheong et al., 2018). They presented the results of piloting an integrated Telegram Application and web-based forum that uses natural language processing and text mining to provide thoughtfulness scores on students' questions and answers as they write them with the idea that students might gain insight into how the document is developed and that their contribution to it may foster metacognitive skills. McNely et al. (2012) presented a visualisation tool (Uatu) that provides real-time engagement metrics on co-authorship and collaborative writing. Wang et al. (2021) were interested in what features could be derived from the logged interactions data of the problem-solving process during a simulation and whether the extracted features could predict success or failure in problem-solving.
Other uses of Big Data for supporting student learning can be seen in the work of Althbiti et al. (2021), who introduced PAARS (Personalised Academic Advisory Recommender System), an automated recommender system for courses that helps students with course selection. The students can input their research area or learning objectives, and the system provides a list of recommended courses based on content-based filtering algorithms and ensemble learning algorithms. Another example of miscellaneous use is Park and Jo (2015), who developed a learning analytics dashboard called LAPA (Learning Analytics for Prediction & Action), which supports students learning by informing them of their online behaviour.
Dawson and Hubball (2014) developed and implemented a curriculum analytics tool that visualises the connections between courses in a curriculum network. The tool can analyse individual student learning pathways and identify dominant student pathways or curriculum pathways that impede/promote timely completion. Similarly, Hilliger et al. (2020) present a tool that generates reports of attained competencies at the course and program levels. In a different paper, Barb and Kilicay-Ergin (2020) evaluate the curriculum coherence of the Information Science programme by identifying academic overlaps and gaps using ontologies and natural language processing.
Other uses include examining if students engage with the learning materials as intended. For example, Nagi (2019), Alachiotis et al. (2019), Ayub et al. (2017), and Llopis-Albert and Rubio (2021) extracted learning analytics from the learning management system to examine if students were engaging with the assigned hands-on activities (e.g., participation in quizzes, visits to the platform), while Harindranathan and Folkestad (2019) examined whether student behaviour in terms of quiz taking was aligned with the intention of the instructional design by extracting and analysing Canvas quiz log data. Similarly, Baralis et al. (2017) set out to find out whether the planned objectives of the educational video service, such as appreciation (number of accessed courses), effectiveness (as reflected in the correlation between use and performance) and flexibility, were reflected in the users’ behaviour. In addition to examining the use of video recordings, Sarsfield and Conway (2018) also looked at potential differences between subgroups of students and modules in terms of usage. To examine if there are differences in learners in the use of practical vs theoretical resources, Braccini et al. (2021) looked at 2,000,000 records while reviewing the visible (i.e. how many times a user replied to a post by a different user) and invisible interaction (i.e. how many times a user read a message by another user) of students, Hernández-García et al. (2016) extracted and analysed data from the LMS (~ 114,756 records).
Examples of miscellaneous use include Beasley et al. (2021), who analysed peer review text from two visualisation courses (~ 4,687 reviews in total) using sentiment analysis and Gottipati et al. (2017), who present a learning analytics tool that analyses qualitative data by extracting the sentiments of the feedback that students leave at the end of the course.
To capture the effectiveness of peer instruction compared to traditional teaching, Kuromiya et al. (2020) chose the number of access to Moodle content, Moodle quizzes, Moodle forum and Moodle resources as engagement indicators. They compared these indicators for periods when three teachers held traditional lectures vs peer instruction (intervention period).
Regarding supporting administration, one common use of Big Data is the examination of curriculum analytics. Armatas et al. (2022) developed a programme review tool to conduct learning analytics associated with the performance of a programme. The tool’s forms of analysis include network analysis, grades analysis (e.g., information about how complex a subject is), and prediction of award GPA (grade point average). Similarly, Cooper et al. (2016) use Big Data to identify accessibility deficits in courses by recording the percentage of students (2009–2013) who declared and did not declare a disability and comparing the odds ratio for completing each module and thus were able to pinpoint several modules that may have accessibility issues.
The use of Big Data for tracking admission is reported in the work of Khudzaeva et al. (2018), who used clustering techniques to group the high schools based on the GPA of students so that they could change the quotas for admission according to the results (e.g. increase quotas for school that produce students who have high GPA, and reduce for those with low). Another example is Burkhardt et al. (2016). They conducted retrospective data analysis of the admission and enrolment data. They produced a dashboard using Visual Basic and Excel to allow decision-makers to input student factors (e.g., financial aid offers) and obtain output as predicted enrolment.
Big Data was also used to study the available resources at an institutional level. For example, Alrehaili et al. (2021) present the Higher Education Activities and Processes Automation Framework (HEAPAF) and higher education ontology. This framework can be used to extract data from different resources. The authors use it to analyse, find, and rank the right resources for teaching a course. Likewise, to examine the collaborative relationship between authors and determine experts in particular fields through social network analysis (Elisabeth et al., 2019), the authors downloaded the metadata of published articles from Scopus, which contained the authors’ names, keywords, affiliations and funding. They examined author-author links, author-keywords, and author affiliations by creating a graph of the author network, which showed the connectedness between authors and the most central (famous) authors in a network. Anastasios et al. (2011) present a tool that evaluates the research performance of a university and the achievement of a research policy using multiple indicators. The tool provides graphical visualisation (e.g., network analysis) on four indicators: scientific publications, collaboration with other higher education institutions, and collaboration with industry and research sectors. Scholars (Srinivas & Rajendran, 2019) also report on SWOT (strengths, opportunities, weaknesses and threats) analysis with the help of text analysis of students' online reviews (N = 24,390) collected from a university review website. The authors used topic modelling (to automatically identify predominantly discussed topics and categorise each sentence in the appropriate topic), sentiment analysis (to detect the affect associated with each sentence), and SWOT analysis using the topic-based opinion summaries to identify strengths and weaknesses. They compared the topic opinion summaries concerning their university to those of other universities to identify threats and opportunities.
Other studies report administrative use of Big Data more broadly on the university campus. For example, Du et al. (2019) uncovered student recreation centre usage patterns by using historical data from swipe cards and user profile data. They developed a web app that predicts visit volume. At the same time, Xia and Liu (2018) used library data of 18,294 students and staff about the books borrowed in 2017. They examined the relationship between readers and book categories to inform decision-making and give recommendations (e.g., books that are frequently borrowed together to be located near each other). Chi et al. (2012) developed SAS (smart alumni system) to connect alumni and students, which incorporates a social networking style mentoring system and uses data mining to discover user relationships. Alumni who have expressed interest in mentoring and guiding students on their career path are matched with students based on interests, occupation or the city where they live. Big Data has also been applied to examine the effect of training on the use of the academic management system and how enhanced usage of the system is reflected in the performance of staff and faculty (Joy & Nambirajan, 2021).
Other uses include Rad et al. (2011), who endeavoured to cluster and rank university majors in Iran. The authors identified 177 university majors from a list solicited from the relevant ministry, and eight main specialisation groups were defined. Then 64 experts were asked to compare the university majors' influence on these eight specialisation groups and the importance of each specialisation group for present-day Iran. Ten clusters were derived using k-means clustering.
Some studies could not be categorised into supporting learning or teaching. Thus we created a separate category representing the intersection of supporting teaching and learning. A paper that fits into this category is Villamañe et al. (2016), who present RubricVis, a tool that provides visually enriched feedback for rubric assessment. The tool can present information as a radar graph; students can see their weak and strong areas, compare their performance to their peers in the group, and follow their progress. Teachers can observe a student's performance or a group of students, track their progress, or compare the performance of different groups.
A further example is Romero et al. (2013), who analysed the quiz results of 104 students and developed association rules. The quiz was changed, and the course was modified based on the results. The results of the students taking the original quiz were compared to those of two other groups taking the modified quiz. The updates in the quiz resulted in a better score, indicating that improvement in the course also improved performance.
Nguyen et al. (2018) examined the extent to which the students’ timing of engagement and instructors' learning design match, as well as how performance relates to study patterns (i.e. engagement), while Essa and Ayad (2012) present a tool (S3) that offers a 1) synoptic view of students' progress as well as visualisation and identification of students at risk; 2) comparison to another learner, as well as 3) sociogram showing patterns of collaborations.
Taniguchi et al. (2017) aimed to study the impression topics hidden in students' journals by extracting weekly keywords commonly mentioned in the journals and students' impressions regarding those keywords. Students (~ 100) in an "Information Science" course were instructed to write weekly reflective entries after class (total N of entries = 1,664). The weekly topics were frequently mentioned when students were writing about something problematic about them.
To profile students enrolled in a MOOC on an IELTS preparation course, Ocaña et al. (2019) obtained the data of 22,164 students via the EdX platform and applied the k means algorithm to attain five clusters: strong starters, weak finishers, more content, less assessment; more assessment, less content; very high engagement, moderate performance; high engagement, high performance. Laakso et al. (2018) present ViLLE, a tool that automatically assesses exercises and provides insights to the teacher.
This section summarises the application of Big Data in higher education, which could not be neatly classified into support for learning, teaching or administration, or the intersection of two or more. McCulloch et al. (2021) created a visualisation tool and web-based visual analytics dashboard for empowering autistic students to communicate their experiences and manage their activities. The visualisation highlights students' physiological status (e.g. stressed vs unstressed) and marks locations related to high stress (using geolocations and Fitbit data), nudging them to consider stress management techniques. The tool also visualises sleep quality (timing and duration of each sleeping phase).
To identify the discourse in the social media footprint left by students on unofficial Facebook pages of 41 private and public higher education institutions in the Ilcos Region in the Philippines, Aviles and Esquivel (2019) used sentiment analysis to determine the polarity of the posts and comments (N = 3,000). A web-based application classified the posts into five categories (academics, social engagement, emotions, finances, policies and health). A word cloud of the most frequent words was produced. The majority of the posts were positive and related to social engagement. Similarly, to analyse student feedback from Twitter API and web applications using sentiment analysis Sivakumar and Reddy (2017) extracted tweets on engineering education and then calculated their emotion and polarity.
To examine the affective response evoked by viewing one’s learning analytics information, Joseph-Richard et al. (2021) asked 42 students to indicate their emotional reactions using a questionnaire while they watched their learning analytics. Students reacted to watching their learning analytics with diverse emotions. Viewing one’s own PLA did not necessarily lead to increased motivation but also led to fear, scepticism, and doubt.
To analyse student agency, 130 computer programming students completed a questionnaire measuring 11 dimensions of agency, and, based on the answers, they were clustered into one of four profiles (Jääskelä et al., 2021).
The review reveals that the current application of Big Data and analytics in higher education is varied and spans the three spheres of supporting learning, teaching and administration. The majority of studies, however, reflected efforts to support learning, with fewer supporting teaching and administration. This trend suggests that the application of Big Data and analytics is recognised as a potentially powerful tool to enhance student learning outcomes, which aligns with previous research on the application of Big Data in education (Ifenthaler & Yau, 2020). Although most studies focused on exploring strategies to support student learning, the issues addressed predominantly assess students' performance, predict dropout rates and propose intervention strategies. Some studies employed sentiment analysis and the development of metacognitive skills.
In contrast, fewer studies focused on the role of Big Data in enhancing administrative issues. The few studies that reported using Big Data in decision-making related to administrative matters included curriculum analysis, addressing equity issues in representation and access to learning, admission processes, resources allocation, collaborative relationships or library usage, and conducting a SWOT analysis and social network analysis. This dearth of exploration might be due to factors such as more visible and immediate benefits in learning outcomes or potential skill gaps among administrative staff and aligns with Alkhalil et al.'s (2021) observation that current research on BD in higher education is still at an immature stage, which suggests the potential for further exploration in administrative decision-making. Leveraging Big Data in administrative functions presents immense opportunities for strategic planning, operational efficiency, and promoting equity in higher education institutions. By harnessing the predictive power and insightful analytics of Big Data, administrative decisions can be more data-driven and effective. This further underscores the importance of future research into administrative uses of Big Data. Such research would contribute to our understanding of the scope of Big Data applications in higher education and help identify and address potential barriers to implementation (Daniel, 2015).
Some studies could not be classified because they overlap between supporting teaching and supporting learning. It is good to remember that in practice, it is difficult to draw a line where the dimension of support for teaching ends and learning support begins. Ultimately, all teaching activity is aimed at supporting learning.
In comparing with the past studies, it is evident that our findings are consistent with previous findings (e.g., Aytaç & Bilge, 2020; Baig et al., 2020; Ifenthaler & Yau, 2020). However, a unique aspect we noticed is the emergence of Big Data applications to support stakeholder well-being, which has been less emphasised in earlier research. This observation opens up a new perspective on the potential of Big Data in education. Thus, we propose a three-dimensional model consisting of supporting the teaching and learning process, administration and supporting the well-being of actors in higher education. Thus, the model proposed by Daniel (2015) could be refined and updated to reflect the findings.
As discussed above, although we identified studies spanning the three applications, most focused on supporting students' learning. Moreover, even the studies focused on students have more to do with the data aspect than the learning aspect, as also identified by other studies (Ifenthaler & Yau, 2020). Thus, it seems that the potential of Big Data in higher education is underutilised.
As some other review studies noted (e.g., Ifenthaler & Yau, 2020), most publications focus on system performance and are more concerned with analytics than learning. This observation also aligns with other authors' views (Gašević et al., 2015; Roberts et al., 2017). Thus, the potential of Big Data and analytics is not adequately realised. For example, although Ocaña et al. (2019) discuss how profiling can identify at-risk students and improve course design and delivery, they do not take the profiling in their study one step further to implement this potential. Similarly, Ayub et al. (2017) state that the association rules obtained could be used to improve the learning management system but do not elaborate on how it could be achieved. Instead, they offer some generic advice, such as the introduction of gamification in the LMS, but it is unclear how those recommendations follow the association rules.
Further, as the findings by Joseph-Richard et al. (2021) suggest, the implementation of predictive learning analytics needs to be conducted with the student's well-being in mind, as some may experience nudging as nagging (Lawrence et al., 2019) and may not want to know their prediction (Afzaal et al., 2021) or may have privacy concerns (Laakso et al., 2018). Furthermore, the predictions may be confusing, especially if students do not know how the predictions are arrived at (Gutiérrez et al., 2020).
In some instances, the quality of evidence for a given claim needs to be more robust (Ferguson & Clow, 2017), as the studies examining intervention effectiveness do not always report the results of statistical tests. For example, Lawrence et al. (2019) report "increases in at-risk students engaging with their courses" (p. 53), but the researchers provide no statistics to back this claim. Likewise, in another cohort study with the intervention, 10 out of 35 took the final exam, and in the group without intervention, 4 out of 36 (Figueroa-Cañas & Sancho-Vinuesa, 2021), yet no formal statistical tests were reported. Hence, the effect remains unclear (i.e., are the descriptive differences due to chance, or do they represent a meaningful effect).
Similarly, Cobos and Ruiz‐Garcia (2021) report a "considerable imbalance" in the success rate between the control and experimental groups but have not provided the appropriate statistics. The conclusions drawn and the evidence to back them remain unclear even if statistical tests are performed. This is so because the applied statistical test is inappropriate for answering the research question. For example, Afzaal et al. (2021) conducted two separate paired samples t-tests to test if the students who used the dashboard between quizzes 1 and 2 performed better at the time two than those who did not. However, in this situation, a more appropriate approach would have been mixed ANOVA, with time one and time two performance used as a within-subjects factor and group as a between-subjects factor, as the increase in one group could be statistically insignificant from the increase in the other group.
Sample size considerations or post hoc discussions of achieved power are rarely mentioned. Still, in some instances, the sample size, and thus the subsequent power to detect an effect, is relatively small. For example, Zhang et al. (2020) had 49 participants divided into experimental and control groups. Similarly, Gong et al. (2018) examined the effectiveness of an intervention using 31 participants in total.
The data used in the predictive models, or for data mining, is limited to the one available in the learning management system, and other "offline" variables are not considered (Nguyen et al., 2018). Further, multiple tests are conducted on the same data set without making significance level adjustments to deal with type 1 error inflation, which in some cases (e.g., Álvarez-Méndez et al., 2020) could result in non-significant findings. The overreliance on p-value, disregarding effect size, may be further misleading (Nuzzo, 2014). Pre-processing the data is time-consuming (Harindranathan & Folkestad, 2019). In some cases, the results do not provide insights beyond the specific context (Sarsfield & Conway, 2018; Nkomo & Nat, 2021; Ji & Han, 2019). The need for some tools is unclear, or users do not see their attractiveness (e.g., Cheong et al., 2018). The diagrams produced by some tools can be overwhelming (Dawson & Hubball, 2014) and complex (McCulloch et al., 2021), making them difficult to understand, and the interventions based on learning analytics may have limited or no impact (Dodge et al., 2015; Park & Jo, 2015). Some of these challenges are overcome in those cases where the dashboards offer an interpretation or call to action. Still, even in those cases, we do not understand the long-term impact of "negative" predictions on students' well-being.
The papers implementing an intervention often lack scientific rigour, and the long-term outcomes of such interventions regarding motivation or well-being are unknown. Although the ultimate aim of Big Data and analytics is to inform decision-making (Lonn et al., 2012), a limited number of the articles reviewed achieved this aim.
Big Data and analytics are arguably two critical research paradigms that have emerged in an era where society rapidly generates data in large volumes. As an emerging paradigm, working with Big Data and analytics requires the knowledge of the 'fourth tradition' (Daniel, 2017). According to Daniel (2017), this fourth tradition is an empirical data-intensive scientific approach underpinned by the principles of knowledge discovery through data mining and visualisation. The fourth tradition necessitates the development of predictive and actionable analytics to solve complex societal problems. For over a decade, predictive and actionable analytics have been viewed as promising mechanisms for addressing challenges the higher education sector faces in the twenty-first century. The systematic review presented in the article closely examined published literature on the role of Big Data and associated analytics. As our goal was to gain a broad overview of the application of Big Data in higher education in the stated review period, we did not control the quality of the reviewed papers.
Furthermore, the search for articles was limited to only three databases. There is always the possibility that the inclusion or exclusion of keywords would have led to a different set of papers. Nonetheless, we believe the included keywords reflect the sufficient breadth of use of Big Data in higher education.
Despite these limitations, this systematic review offers a broad overview of the uses of Big Data in higher education. The review findings revealed that Big Data and analytics are predominantly used to support learning and, to a lesser extent, teaching and administration, and this is broadly consistent with other studies (Aytaç & Bilge, 2020; Baig et al., 2020). However, we observed some varied uses in supporting the latter. We also identified a set of studies focused on helping student well-being. Thus, we proposed refinement of the model postulated by Daniel (2015) to include supporting the teaching and learning process, administration, and the well-being of students and staff. Future studies could expand on the well-being aspect. Further, studies examining the effect of interventions based on big data should use more rigorous statistical tests to make a convincing argument for their effectiveness. In addition, studies utilising Big Data should be based on a theoretical perspective and be targeted towards solving an existing problem in practice.
Open Access funding enabled and organized by CAUL and its Member Institutions