Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We are using the Yelp Review Dataset as the streaming data source for the DataCI example. We have processed the Yelp review dataset into a daily-based dataset by its `date`. In this dataset, we will only use the data from 2020-09-01 to 2020-11-30 to simulate the streaming data scenario. We are downloading two versions of the training and validation datasets:
According to our latest research, the global streaming analytics market size reached USD 19.7 billion in 2024, reflecting robust adoption across industries driven by the demand for real-time data insights. The market is projected to expand at a CAGR of 21.6% from 2025 to 2033, reaching a forecasted value of USD 134.2 billion by 2033. This impressive growth trajectory is primarily fueled by the accelerated digital transformation initiatives, increasing volumes of streaming data, and the critical need for real-time decision-making capabilities in diverse sectors such as BFSI, IT and telecommunications, retail and e-commerce, healthcare, and manufacturing.
One of the primary growth factors for the streaming analytics market is the exponential increase in data generated from various sources, including IoT devices, social media platforms, mobile applications, and enterprise systems. Organizations are seeking advanced analytics solutions to process, analyze, and extract actionable insights from this continuous data flow. The proliferation of connected devices and the advent of Industry 4.0 have significantly contributed to the adoption of streaming analytics, as businesses strive to gain a competitive edge by leveraging real-time data for operational efficiency, customer engagement, and predictive maintenance. The integration of artificial intelligence and machine learning algorithms into streaming analytics platforms further enhances their capabilities, enabling automated pattern recognition, anomaly detection, and advanced forecasting.
Another significant driver is the increasing emphasis on fraud detection and risk management across industries such as BFSI, healthcare, and retail. Real-time analytics empower organizations to detect suspicious activities, prevent financial losses, and ensure compliance with regulatory requirements. For instance, financial institutions utilize streaming analytics to monitor transactions in real time, identify fraudulent behavior, and mitigate risks effectively. Similarly, healthcare providers leverage these solutions to track patient data, predict potential health risks, and optimize clinical workflows. The ability to process and analyze data as it is generated provides organizations with a substantial advantage in responding to emerging threats and opportunities swiftly.
Furthermore, the shift towards cloud-based deployment models is accelerating market growth by offering scalability, flexibility, and cost-effectiveness. Cloud-based streaming analytics solutions enable organizations to handle large volumes of data without the need for significant upfront infrastructure investments. This democratizes access to advanced analytics capabilities, particularly for small and medium enterprises (SMEs) that may lack the resources for on-premises solutions. The growing ecosystem of cloud service providers, coupled with advancements in data security and privacy, has made cloud adoption a preferred choice for organizations seeking to harness the power of streaming analytics.
Regionally, North America remains the dominant market, accounting for the largest revenue share in 2024, followed by Europe and Asia Pacific. The presence of major technology players, early adoption of advanced analytics, and substantial investments in digital infrastructure contribute to North America's leadership position. However, Asia Pacific is expected to witness the highest growth rate over the forecast period, driven by rapid industrialization, expanding internet penetration, and increasing adoption of IoT technologies. Latin America and the Middle East & Africa are also emerging as lucrative markets, supported by growing digitalization efforts and government initiatives to promote smart cities and digital economies.
The component segment of the streaming analytics market is bifurcated into software and services. The software component holds a substantial share of the market, as organizations across vario
https://opensource.org/licenses/BSD-3-Clausehttps://opensource.org/licenses/BSD-3-Clause
The datasets in this release support the results presented in the paper
P. Jamshidi, G. Casale, "An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing Systems", accepted for presentation at MASCOTS 2016.
An open access to the paper is available at https://arxiv.org/abs/1606.06543
Also open source code is available at https://github.com/dice-project/DICE-Configuration-BO4CO
The archive contains 10 comma separated datasets representing performance measurements (throughput and latency) for 3 different stream benchmark applications. These have been experimentally collected on 5 different cloud cluster over the course of 3 months (24/7). Each row in the datasets represents a different configuration setting for the application and the last two columns represent the average performance of the application measured over the course of 10 minutes under that specific configuration setting. The datasets contains a full factorial and exhaustive measurements for all possible settings limited to a predetermined interval for each variable. Each dataset is named in the following format: "benchmark_application-dimensions-cluster_name". For example, "wc-6d-c1" refers to WordCount benchmark application with 6 dimensions (i.e., we varied 6 configuration parameters) and the application was deployed on c1 cluster (OpenNebula, see Appendix). This resulted in a dataset of size 2880, i.e., it has taken 2880*10m=480h=20days for collecting the data!
For more information about the data refer to the appendix of the paper: https://arxiv.org/abs/1606.06543.
When referring to the dataset or code please cite the paper above.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
【 Study 1 】Shared employee management practice is the policy and system implemented by sharing platform enterprises for shared employees. Therefore, the construction of a shared employee management practice system should collect data from two aspects: shared platform enterprises and shared employees. This study collected information on what shared platform enterprises have done through official apps and websites, and collected information on what shared employees feel the platform has done through participatory observation and interviews. Specifically, the data collection for the Didi travel platform mainly comes from three channels: the official APP passenger end, the official APP driver end, and interviews; The data collection for the Sichuan pepper live streaming platform mainly comes from three channels: official apps and websites, participatory observation, and interviews. For the convenience of data analysis, this study encoded various data sources. Firstly, encode the data from Didi Chuxing platform as "A" and the data from Huajiao platform as "B"; Secondly, code the data from the official APP as "A", and if there are multiple APPs, code them as "A01", "A02", etc. respectively; Thirdly, encode the data from the official website as "OW"; Fourthly, encode the data from participatory observation as "PO"; Fifthly, encode the data from the interview as "IM". When encoding, continuously number data from the same source. For example, the codes for the first and second codes extracted from the participatory observation data of the Sichuan pepper live streaming platform are "YiPO-01" and "YiPO-02", respectively. 【 Study 2 】 A face-to-face interview questionnaire survey was conducted among 221 ride hailing drivers in Shanghai and Chengdu, with 8 respondents. The effective questionnaire rate was 100%. The research process is divided into six steps: (1) Design a questionnaire on "Questionnaire Star" and send the link to the researchers (2 people in each group); (2) Train researchers, focusing on explaining the implementation rules and safety hazards during the research process; (3) Make an appointment for the survey subject, book an online ride hailing service through the passenger end of the Didi Chuxing APP, present identification documents to the ride hailing driver, inform them of the research purpose and payment method, and prepare for the survey with the support of the ride hailing driver; (4) Communicate research methods, read out research guidelines, and inform the research process: Firstly, Investigator A reads each question item (including the question stem and options), then asks ride hailing drivers to choose one of the five options from "strongly disagree" to "strongly agree", and finally, Investigator B is responsible for filling out the questionnaire (while A supervises); (5) Conduct research; (6) Pay based on local starting price and duration. 【 Study 3 】 This study conducted a questionnaire survey on 273 ride hailing drivers in Shanghai and Chengdu using face-to-face interviews (Feng Xiaotian, 2009) with 8 respondents. The effective questionnaire rate was 100%. The research process and implementation details are the same as Study 2, and will not be repeated here.
This dataset, termed "GAGES II", an acronym for Geospatial Attributes of Gages for Evaluating Streamflow, version II, provides geospatial data and classifications for 9,322 stream gages maintained by the U.S. Geological Survey (USGS). It is an update to the original GAGES, which was published as a Data Paper on the journal Ecology's website (Falcone and others, 2010b) in 2010. The GAGES II dataset consists of gages which have had either 20+ complete years (not necessarily continuous) of discharge record since 1950, or are currently active, as of water year 2009, and whose watersheds lie within the United States, including Alaska, Hawaii, and Puerto Rico. Reference gages were identified based on indicators that they were the least-disturbed watersheds within the framework of broad regions, based on 12 major ecoregions across the United States. Of the 9,322 total sites, 2,057 are classified as reference, and 7,265 as non-reference. Of the 2,057 reference sites, 1,633 have (through 2009) 20+ years of record since 1950. Some sites have very long flow records: a number of gages have been in continuous service since 1900 (at least), and have 110 years of complete record (1900-2009) to date. The geospatial data include several hundred watershed characteristics compiled from national data sources, including environmental features (e.g. climate – including historical precipitation, geology, soils, topography) and anthropogenic influences (e.g. land use, road density, presence of dams, canals, or power plants). The dataset also includes comments from local USGS Water Science Centers, based on Annual Data Reports, pertinent to hydrologic modifications and influences. The data posted also include watershed boundaries in GIS format. This overall dataset is different in nature to the USGS Hydro-Climatic Data Network (HCDN; Slack and Landwehr 1992), whose data evaluation ended with water year 1988. The HCDN identifies stream gages which at some point in their history had periods which represented natural flow, and the years in which those natural flows occurred were identified (i.e. not all HCDN sites were in reference condition even in 1988, for example, 02353500). The HCDN remains a valuable indication of historic natural streamflow data. However, the goal of this dataset was to identify watersheds which currently have near-natural flow conditions, and the 2,057 reference sites identified here were derived independently of the HCDN. A subset, however, noted in the BasinID worksheet as “HCDN-2009”, has been identified as an updated list of 743 sites for potential hydro-climatic study. The HCDN-2009 sites fulfill all of the following criteria: (a) have 20 years of complete and continuous flow record in the last 20 years (water years 1990-2009), and were thus also currently active as of 2009, (b) are identified as being in current reference condition according to the GAGES-II classification, (c) have less than 5 percent imperviousness as measured from the NLCD 2006, and (d) were not eliminated by a review from participating state Water Science Center evaluators. The data posted here consist of the following items:- This point shapefile, with summary data for the 9,322 gages.- A zip file containing basin characteristics, variable definitions, and a more detailed report.- A zip file containing shapefiles of basin boundaries, organized by classification and aggregated ecoregion.- A zip file containing mainstem stream lines (Arc line coverages) for each gage.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Changing environments place stresses on ecosystems, and are contributing to widespread losses of biodiversity and ecosystem function. Comparisons of historical and contemporary data offer considerable utility in understanding how ecosystems respond to, adapt to, or recover from changing environments. Stream fishes offer a particularly interesting study system for this topic, as streams are naturally dynamic environments and human needs have placed increasing pressure on aquatic systems. The effects of fine sediments on stream fishes and aquatic ecosystems more broadly have been well studied. Yet studies from fluvial geomorphology have resulted in models of watershed morphological evolution which encompass far broader processes and changes to aquatic systems. Our dataset integrates a fluvial geomorphic approach to characterize stream channel and habitat evolution over a four decade period in the Bayou Pierre, Mississippi, an ecological approach to study related change in stream fish communities in the same watershed, and analyses linking the two. Fluvial geomorphic processes were characterized both from remote sensing data sources for historic and contemporary time periods, and local fish habitat data for contemporary time periods. Historical fish community data were extracted from museum records, and contemporary fish community data were collected via sampling for fishes at the same localities as historic efforts using similar methods. Methods Fluvial geomorphic data were collected from remote sensing sources using QGIS (open-source) geospatial software. Measurements were taken from aerial imagery sources (National High Altitude Imagery Program 1982 and National Agricultural Imagery Program 2020) and two digital elevation models (Mississippi Digital Earth Model, data source 1958-1980, and 2015/2016 Lidar-based 3DEP DEM) and exported to shapefile formats. Spatial sampling locations were determined by creating a 200m chain object using QChainage on the trunk stream and five tributaries via the NHD+ V2 flowline files. At each sample location we digitized channel width line features (1982 and 2020) and extracted channel thalweg elevations (both DEMs). We employed visual identification of knickpoint features using channel profiles generated from DEMs. These measurements were processed in R to create csv files of measurements for further analyses. Historical fish community data were extracted from the University of Southern Mississippi Ichthyological collections, and were filtered to remove non-community samples, and verified by consulting original field notes. Contemporary fish collections were repeated at the same localities as historical fish collections, using similar methods. We seined all available habitats in three subsample plots in proportion to their occurrence at a site. Contemporary habitat data were collected via a point-transect method at each sample site.
According to our latest research, the global Model Feature Store market size reached USD 1.26 billion in 2024, driven by the escalating adoption of AI and machine learning across industries. The market is experiencing robust expansion, with a recorded CAGR of 25.8% from 2025 to 2033. By the end of 2033, the Model Feature Store market is projected to attain a value of USD 9.78 billion, underscoring the pivotal role of feature stores in scaling and operationalizing machine learning workflows. This growth is primarily fueled by increasing demand for efficient data management, model reproducibility, and real-time feature serving in AI-powered applications.
A key growth factor propelling the Model Feature Store market is the exponential rise in the deployment of machine learning models in production environments. As organizations increasingly integrate AI into their core business processes, the need for standardized, scalable, and reliable feature management solutions becomes critical. Feature stores streamline the complex process of feature engineering, ensuring consistency, reusability, and governance of features across multiple models and teams. This not only accelerates model development cycles but also enhances model performance and reliability, which is particularly vital in industries such as BFSI, healthcare, and retail where data-driven insights are mission-critical.
Another significant driver is the growing emphasis on data governance and compliance. With stricter regulatory frameworks such as GDPR and CCPA, enterprises are under pressure to ensure transparency, traceability, and accountability in their AI pipelines. Model Feature Stores provide a centralized repository for features, enabling robust lineage tracking, access control, and auditability. This capability is increasingly sought after by enterprises looking to mitigate risks associated with data privacy and regulatory breaches. Furthermore, the integration of feature stores with cloud platforms and MLOps tools is simplifying the orchestration of end-to-end machine learning workflows, further boosting market adoption.
The rapid evolution of AI and the proliferation of real-time, data-intensive applications are fostering innovation in the Model Feature Store market. Enterprises are now leveraging feature stores to support advanced use cases such as fraud detection, personalized recommendations, and predictive maintenance, which require low-latency feature serving and seamless integration with streaming data sources. The rise of edge computing and IoT is also contributing to demand, as organizations seek to manage and serve features at scale across distributed environments. This trend is expected to drive continuous advancements in feature store architectures, including support for hybrid and multi-cloud deployments.
Regionally, North America continues to dominate the Model Feature Store market, accounting for the largest share in 2024. This leadership is attributed to the region’s mature AI ecosystem, significant investments in digital transformation, and the presence of leading technology vendors. Europe and Asia Pacific are also witnessing accelerated growth, fueled by increasing AI adoption in sectors such as manufacturing, healthcare, and financial services. The Asia Pacific region, in particular, is expected to register the highest CAGR during the forecast period, driven by rapid digitalization, government initiatives, and a burgeoning startup ecosystem. Meanwhile, Latin America and the Middle East & Africa are emerging as promising markets, supported by growing awareness and investments in AI infrastructure.
The Component segment of the Model Feature Store market is bifurcated into Platform and Services. The Platform sub-segment comprises the core software infrastructure that enables the storage, management, and serving of features for machine learning models. In 2024, Platforms accounted for the majority revenue share, reflecting ente
A study examining the sources and ages of fine-grained bed material (<0.063 mm) was conducted for 99 sites in the USGS NAWQA Midwest Stream Quality Assessment (MSQA) during the summer of 2014, including 15 suspended sediment and 5 cropland top soil samples. Bed material samples were analyzed for radionuclides (7Be, 210Pbex, 137Cs) and pesticides (bifenthrin and DDE); suspended sediment for radionuclides (7Be, 210Pbex, 137Cs): and cropland top soil was analyzed for 210Pbex and 137Cs. Data set includes only those data specifically discussed in a Journal of Environmental Management entitled, Sources and ages of fine-grained sediment to streams using fallout radionuclides in the Midwestern United States, by Allen C. Gellis, Christopher C. Fuller, Peter C. Van Metre. Journal of Environmental Management. In Press.
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
The Bitext Synthetic Data consist of pre-built training data for intent detection and are provided for 20 verticals for Spanish language (see ELRA-L0182 to ELRA-L0201). They cover the most common intents for each vertical and include a large number of example utterances for each intent, with optional entity/slot annotations for each utterance. The Media Streaming domain comprises 24 intents for Spanish.Data is distributed as models or open text files.
This data release includes physical and chemical data for streambed-sediment samples collected at 7 locations in the Little Flatrock Creek basin in Paulding County, Ohio during July 2019. Data include total nitrogen and carbon concentrations, carbon species, total concentrations for 41 metals, and particle-size analysis; not all samples had enough for all analyses. Source samples were identified as one of four land-cover types: cropland (corn, soybean, hay, and wheat; sites included a mix of conventional and conservation tillage), roads, preserved forest, and streambanks. All reaches had cropland on at least one side of the stream. These data will be used for source attribution in order to quantify the proportional contribution of individual sources to streambed sediment in the basin. The objective of "fingerprinting" these samples was to identify the potential contribution of surrounding land use to what is stored in the channel upstream from the gage.
Perennial and non-perennial streams in the Main Hawaiian Islands. Data accessed from the Hawaii Statewide GIS Program. Arcs were extracted from the 1983 USGS Digital Line Graphs (DLG) hydrography layers based on the State of Hawaii Commission on Water Resources Management (CWRM) Hawaii Stream Assessment (HSA) maps and database, then coded with the HSA stream code, and the HSA stream name (1993). The State of Hawaii Division of Aquatic Resources (DAR) added additional streams from the DLG hydrography layer, and added additional attribute data. Further additions, refinement, and editing were completed by DAR in 2003, 2004. Additional streams were added and coded (NON-PERENNIAL) in 2005. The State of Hawaii Office of Planning (OP) Staff standardized and re-ordered attributes and merged individual island shapefiles into one statewide shapefile in March, 2005. Note: Not all items are populated / assigned for each stream. For example, streams that were not part of the Hawaii Stream Assessment do not have an HSA Code. Update received from CWRM and DAR, 2013 (Data current to March, 2008). Update includes many attribute corrections and additional information (e.g. correction of stream number, stream type, and addition of tributary names). For additional information, please see the following website: http://state.hi.us/dlnr/dar/streams.htmlPerennial and non-perennial streams in the Main Hawaiian Islands. Data accessed from the Hawaii Statewide GIS Program. Arcs were extracted from the 1983 USGS Digital Line Graphs (DLG) hydrography layers based on the State of Hawaii Commission on Water Resources Management (CWRM) Hawaii Stream Assessment (HSA) maps and database, then coded with the HSA stream code, and the HSA stream name (1993). The State of Hawaii Division of Aquatic Resources (DAR) added additional streams from the DLG hydrography layer, and added additional attribute data. Further additions, refinement, and editing were completed by DAR in 2003, 2004. Additional streams were added and coded (NON-PERENNIAL) in 2005. The State of Hawaii Office of Planning (OP) Staff standardized and re-ordered attributes and merged individual island shapefiles into one statewide shapefile in March, 2005. Note: Not all items are populated / assigned for each stream. For example, streams that were not part of the Hawaii Stream Assessment do not have an HSA Code. Update received from CWRM and DAR, 2013 (Data current to March, 2008). Update includes many attribute corrections and additional information (e.g. correction of stream number, stream type, and addition of tributary names). For additional information, please see the following website: http://state.hi.us/dlnr/dar/streams.htmlPerennial and non-perennial streams in the Main Hawaiian Islands. Data accessed from the Hawaii Statewide GIS Program. Arcs were extracted from the 1983 USGS Digital Line Graphs (DLG) hydrography layers based on the State of Hawaii Commission on Water Resources Management (CWRM) Hawaii Stream Assessment (HSA) maps and database, then coded with the HSA stream code, and the HSA stream name (1993). The State of Hawaii Division of Aquatic Resources (DAR) added additional streams from the DLG hydrography layer, and added additional attribute data. Further additions, refinement, and editing were completed by DAR in 2003, 2004. Additional streams were added and coded (NON-PERENNIAL) in 2005. The State of Hawaii Office of Planning (OP) Staff standardized and re-ordered attributes and merged individual island shapefiles into one statewide shapefile in March, 2005. Note: Not all items are populated / assigned for each stream. For example, streams that were not part of the Hawaii Stream Assessment do not have an HSA Code. Update received from CWRM and DAR, 2013 (Data current to March, 2008). Update includes many attribute corrections and additional information (e.g. correction of stream number, stream type, and addition of tributary names). For additional information, please see the following website: http://state.hi.us/dlnr/dar/streams.htmlPerennial and non-perennial streams in the Main Hawaiian Islands. Data accessed from the Hawaii Statewide GIS Program. Arcs were extracted from the 1983 USGS Digital Line Graphs (DLG) hydrography layers based on the State of Hawaii Commission on Water Resources Management (CWRM) Hawaii Stream Assessment (HSA) maps and database, then coded with the HSA stream code, and the HSA stream name (1993). The State of Hawaii Division of Aquatic Resources (DAR) added additional streams from the DLG hydrography layer, and added additional attribute data. Further additions, refinement, and editing were completed by DAR in 2003, 2004. Additional streams were added and coded (NON-PERENNIAL) in 2005. The State of Hawaii Office of Planning (OP) Staff standardized and re-ordered attributes and merged individual island shapefiles into one statewide shapefile in March, 2005. Note: Not all items are populated / assigned for each stream. For example, streams that were not part of the Hawaii Stream Assessment do not have an HSA Code. Update received from CWRM and DAR, 2013 (Data current to March, 2008). Update includes many attribute corrections and additional information (e.g. correction of stream number, stream type, and addition of tributary names). For additional information, please see the following website: http://state.hi.us/dlnr/dar/streams.html
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Streams are important parts of our ecosystem and crucial water sources. Knowing composition of stream waters helps to understand where it came from and how it can be used.Geochemists test stream water samples using different methods to see the spread of elements and ions across the country. 16 different elements and properties were mapped on a 1:100,000 scale (1cm on the map relates to a distance of 1km). A total of 6836 stream water samples were collected between 2011 and 2017. Samples were taken from evenly spread smaller or medium sized streams. The data measures changes in the strength of a number of elements in two ways. The types of elements can point to what rocks the water had contact with. Their parent material is called as a source rock. Also tested were conductivity, which tells us the amount of dissolved ions are in the stream water, pH, which shows how acid or basic the waters are, alkalinity which is tells us how hard the water is and major ions which tells us what major dissolved salts are present in the waters. This is a raster dataset. Data is presented in form of an image and colour scales are used to show the different strengths of the elements. Raster data stores information in a cell-based manner and consists of a matrix of cells (or pixels) organised into rows and columns. The format of the raster is a grid. The grid cell size is 250 m which means that each cell (pixel) represents an area on the ground that is 250 meters across. The grid also contains location information. The Tellus survey is a national airborne geophysical and ground geochemical mapping project managed by the Geological Survey Ireland.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The purpose of this investigation was to systematically examine the variability associated with temporally-oriented invertebrate data collected by citizen scientists and consider the value of such data for use in stream management. Variability in invertebrate data was estimated for three sources of variation: sampling, within-reach spatial and long-term temporal. Long-term temporal data were also evaluated using ordinations and an Index of Biotic Integrity (IBI). Through two separate investigations over an 11-year study period, participants collected more than 400 within-reach samples during 44 sampling events at three streams in the western United States. Within-reach invertebrate abundance coefficient of variation (CV) ranged from 0.44–0.50 with approximately 62% of the observed variation strictly due to sampling. Long-term temporal CV ranged from 0.31–0.36 with 27–30% of the observed variation in invertebrate abundance related to climate conditions (El Niño strength) and sampling year. Ordinations showed that citizen-generated assemblage data could reliably detect differences between study streams and seasons. IBI scores were significantly different between streams but not seasons. The findings of this study suggest that citizen data would likely detect a change in mean invertebrate density greater than 50% and would also be useful for monitoring changes in assemblage. The information presented here will help stream managers interpret and evaluate changes to the stream invertebrate community detected by citizen-based programs.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains different water quality and -quantity data, collected in a small headwater catchment in the north of Hesse, Germany. Grab samples were collected manually on different sites within the stream and analysed for nutrients and stable water isotopes. On site 2, our main monitoring station, we installed probes for high frequency measurements of discharge, nitrate, and conductivity. Site 3 corresponds to the WWTP outlet. Meteorological data was collected via a meteorological station at the site of the WWTP. We also monitored the combined sewer overflow via a modified conductivity logger. CSO_on = 1 indicates, that the CSO is discharging. The data was processed and checked for quality, including sensor calibration and visual inspection, to remove obvious errors. Detailed procedures are described by Spill et al. (2023) (see Metadata)
This is the first live data stream on Kaggle providing a simple yet rich source of all soccer matches around the world 24/7 in real-time.
What makes it unique compared to other datasets?
Simply train your algorithm on the first version of training dataset of approximately 11.5k matches and predict the data provided in the following data feed.
The CSV file is updated every 30 minutes at minutes 20’ and 50’ of every hour. I kindly request not to download it more than twice per hour as it incurs additional cost.
You may download the csv data file from the following link from Amazon S3 server by changing the FOLDER_NAME as below,
https://s3.amazonaws.com/FOLDER_NAME/amasters.csv
*. Substitute the FOLDER_NAME with "**analyst-masters**"
Our goal is to identify the outcome of a match as Home, Draw or Away. The variety of sources and nature of information provided in this data stream makes it a unique database. Currently, FIVE servers are collecting data from soccer matches around the world, communicating with each other and finally aggregating the data based on the dominant features learned from 400,000 matches over 7 years. I describe every column and the data collection below in two categories, Category I – Current situation and Category II – Head-to-Head History. Hence, we divide the type of data we have from each team to 4 modes,
Below you can find a full illustration of each category.
I. Current situation
Col 1 to 3:
Votes_for_Home Votes_for_Draw Votes_for_Away
The most distinctive parts of the database are these 3 columns. We are releasing opinions of over 100 professional soccer analysts predicting the outcome of a match. Their votes is the result of every piece of information they receive on players, team line-up, injuries and the urge of a team to win a match to stay in the league. They are spread around the world in various time zones and are experts on soccer teams from various regions. Our servers aggregate their opinions to update the CSV file until kickoff. Therefore, even if 40 users predict Real-Madrid wins against Real-Sociedad in Santiago Bernabeu on January 6th, 2019 but 5 users predict Real-Sociedad (the away team) will be the winner, you should doubt the home win. Here, the “majority of votes” works in conjunction with other features.
Col 4 to 9:
Weekday Day Month Year Hour Minute
There are over 60,000 matches during a year, and approximately 400 ones are usually held per day on weekends. More critical and exciting matches, which are usually less predictable, are held toward the evening in Europe. We are currently providing time in Central Europe Time (CET) equivalent to GMT +01:00.
*. Please note that the 2nd row of the CSV file represents the time, data values are saved from all servers to the file.
Col 10 to 13:
Total_Bettors Bet_Perc_on_Home Bet_Perc_on_Draw Bet_Perc_on_Away
This data is recorded a few hours before the match as people place bets emotionally when kickoff approaches. The percentage of the overall number of people denoted as “Total_Bettors” is indicated in each column for “Home,” “Draw” and “Away” outcomes.
Col 14 to 15:
Team_1 Team_2
The team playing “Home” is “Team_1” and the opponent playing “Away” is “Team_2”.
Col 16 to 36:
League_Rank_1 League_Rank_2 Total_teams Points_1 Points_2 Max_points Min_points Won_1 Draw_1 Lost_1 Won_2 Draw_2 Lost_2 Goals_Scored_1 Goals_Scored_2 Goals_Rec_1 Goal_Rec_2 Goals_Diff_1 Goals_Diff_2
If the match is betw...
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
There has been much controversy regarding the origins of the natural polycyclic aromatic hydrocarbon (PAH) and chemical biomarker background in Prince William Sound (PWS), Alaska, site of the 1989 Exxon Valdez oil spill. Different authors have attributed the sources to various proportions of coal, natural seep oil, shales, and stream sediments. The different probable bioavailabilities of hydrocarbons from these various sources can affect environmental damage assessments from the spill. This study compares two different approaches to source apportionment with the same data (136 PAHs and biomarkers) and investigate whether increasing the number of coal source samples from one to six increases coal attributions. The constrained least-squares (CLS) source allocation method that fits concentrations meets geologic and chemical constraints better than partial least-squares (PLS) which predicts variance. The field data set was expanded to include coal samples reported by others, and CLS fits confirm earlier findings of low coal contributions to PWS.
Water quality data centered on the Animal Science Teaching and Research Center, Harford NY, 1974-1995 (referred to as T&R Center)
In many areas of NY the level, well drained gravel outwash valleys are intensively used for a variety of human activities such as industry, housing and farming. The gravel outwash is usually deep, the water is easily accessed by high yielding wells and as a consequence the water in the aquifers is highly valued. Specific references to the northeastern U.S., central NY and the aquifer at the T&R Center are the following:
Randall, Allan D., Deborah Snavelly, Thomas Holecek, and Roger Waller. 1988. Alternate sources of large seasonal ground-water supplies in the head waters of the Susquehanna River basin. U.S. Geological Survey. Water Resources Investigations Report 85-4127. USGS.
Morrissey, Daniel J. Allan, Randall, and John Williams.1988. Upland runoff as a major source of recharge to stratified drift in the glaciated northeast in Randall, Allan D. ed.: Regional Aquifer systems of the United States. The Northeast glacial aquifers. AWRA monograph series no. 11. American Water Resources Association. 5410 Grosvenor Lane, Suite 220. Bethesda MD 20814-2192.
The permeability of the outwash is high and hence soluble contaminants such as NO3 are leached into aquifers along with the recharge water As a consequence the impact of human activities on water quality is a major issue. A mitigating factor is that the land usage in the surrounding upland areas is much less intense and high quality water from this part of the landscape drains downslope as surface water but once it reaches the valley floor seeps into the outwash and joins (mixes with?) the recharge water from the intensively used valley floor.
In 1974 a water quality monitoring network was established on and near the Cornell Teaching and Research Center near Harford NY. This is an ideal location to study the effect of farming (mostly dairy) on the nitrate and phosphorus in streams and aquifers in a typical outwash valley with its surrounding upland areas. First, a major ground water divide runs through the center of the farm; part draining to Fall Creek and the other to the Susquehanna River. This means that we know where all of the water originates. Secondly, Cornell University owns the land except for some upland areas that so far (2007) are mostly wooded/abandoned agricultural land. This simplifies access, information on usage and in some cases control management.
The objectives were a) to monitor behavior of aquifers in the gravel outwash and b) monitor water quality in the aquifers and surrounding uplands.
In 1974 a cooperative program was developed between Cornell University and the USGS. In early 1974 Allan Randall of the USGS guided the location and instillation of 8 monitoring wells. Following the initial 8 wells, additional shallow wells were installed. Stream sampling locations were established. Seeps on the hillsides above the valley floor were also located. Well logs and methods of installation are documented in following: http://hdl.handle.net/1813/8146.
Beginning in 1974 and continuing through 1994 samples of water in streams, monitoring wells and seeps on the upland slopes above the valley floor were analyzed for the same constituents using the same procedures as the Fall Creek samples reported elsewhere: http://hdl.handle.net/1813/8148.
From Feb 28, 1979 through Jan 25, 1980 the USGS made a detailed study of the behavior of precipitation inputs and its flow though the landscape and aquifers at the T&R Center. During this period about 40% of the recharge to the aquifers was derived from precipitation on the area over the aquifers and 60% runoff from the uplands. Ground-water discharged down-valley as underflow about equaled recharge during this period. Details of the studies are reported in the 2 references listed above.
A description of the farming operations on the center for the period 1972 through 1994 is summarized in the following:Wang, S. J. 1999. Impact of dairy farming on well water nitrate level and soil content of phosphorus and potassium. J Dairy Sci. 82:2164-2169.
Briefly, in 1994 there were 400 milking cows producing exports in milk and meat of about 20 mt of N. Imports of N were 93 mt indicating a large excess of inputs relative to outputs. For comparison, a nitrogen balance for the adjacent Fall Creek watershed in 1974 can be found in an unpublished manuscript (ms10_nbl.doc, online: http://hdl.handle.net/1813/2547). Nitrate N in 5 monitoring wells in the intensively farmed area which received most of the manure varied from 2 to 15 ppm NO3-N with very high variability among years and within years.
Four streams drained watersheds that were without human habitations or farming operations. Two of these drained areas directly above the valley floor at the T&R Center which rec... Visit https://dataone.org/datasets/doi%3A10.5063%2FAA%2Fgss1.18.1 for complete metadata about this dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
S2D is a species occupancy dataset containing presence (1) or absence (0) values for 116 freshwater fish species in Wyoming, Montana, and the surrounding states. It contains data from 40,490 unique sample events (location, month, year). Data was derived from multiple sources (see Table 1 from README files) and limited to fish occurrences in rivers and streams.
According to our latest research, the global Renewable Energy Media Streaming CDN market size reached USD 2.85 billion in 2024, with a robust compound annual growth rate (CAGR) of 17.2% expected through the forecast period. This impressive growth trajectory is projected to propel the market to approximately USD 7.95 billion by 2033. The expansion is primarily driven by the rising demand for sustainable content delivery solutions, the proliferation of media streaming platforms, and the increasing environmental regulations compelling companies to adopt renewable energy-powered infrastructures. As per our latest research, the convergence of green energy with content delivery networks (CDNs) is emerging as a pivotal strategy for reducing carbon footprints and enhancing operational efficiency in the digital media ecosystem.
One of the most significant growth factors in the Renewable Energy Media Streaming CDN market is the escalating global consumption of digital content. As consumers continue to shift from traditional broadcast media to on-demand streaming services, the need for robust, scalable, and environmentally responsible content delivery networks has intensified. Streaming giants and emerging platforms alike are investing in renewable energy-powered CDNs to meet both user demand and corporate sustainability goals. This transition is further accelerated by increasing internet penetration, the rollout of high-speed broadband infrastructure, and the surge in mobile device usage. As video quality expectations rise and latency tolerance drops, CDN providers are compelled to upgrade their infrastructure with energy-efficient, renewable-powered data centers, directly fueling market growth.
Another critical driver for this market is the tightening regulatory landscape around carbon emissions and data center energy consumption. Governments worldwide are implementing stringent policies to curb greenhouse gas emissions, compelling enterprises across industries to adopt renewable energy solutions. Media streaming companies, which are among the largest consumers of data center resources, are under mounting pressure to decarbonize their operations. This has led to a surge in partnerships between CDN providers and renewable energy suppliers, as well as significant investments in green data centers. These initiatives not only align with regulatory requirements but also enhance brand reputation, attract eco-conscious consumers, and provide long-term cost savings through improved energy efficiency.
Technological advancements are also playing a pivotal role in shaping the Renewable Energy Media Streaming CDN market. Innovations in energy storage, grid integration, and intelligent load balancing are enabling CDN providers to optimize the use of renewable energy sources such as solar, wind, and hydroelectric power. The adoption of AI-driven resource management tools allows for real-time monitoring and dynamic allocation of energy resources, ensuring uninterrupted service delivery even during peak demand periods. Moreover, the integration of edge computing with renewable-powered CDNs is reducing latency and bandwidth costs, while simultaneously minimizing the environmental impact of data transmission. These technological developments are expected to further accelerate market growth and foster the emergence of new business models within the sector.
Regionally, North America currently dominates the Renewable Energy Media Streaming CDN market, accounting for the largest share in 2024. This leadership position is attributed to the region’s advanced digital infrastructure, high adoption of streaming services, and proactive sustainability initiatives by major technology firms. Europe follows closely, driven by stringent environmental regulations and ambitious renewable energy targets. The Asia Pacific region is poised for the fastest growth during the forecast period, fueled by rapid digitalization, expanding internet user base, and increasing investments in green technology by both public and private sectors. Latin America and the Middle East & Africa are also witnessing steady growth, albeit from a smaller base, as awareness of sustainable content delivery solutions continues to rise.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Stream drying is happening globally, with significant ecological and social consequences. Most examples of stream drying come from systems influenced by dam operations or those with highly exploited aquifers. Stream drying is also thought to be happening due to climate change, but examples are surprisingly limited. We explored flow trends from the five Mediterranean-climate regions with a focus on unregulated streams with long-term gauge records. We found consistent evidence of decreasing discharge trends, increasing zero-flow days, and steeper downward discharge trends in smaller basins. Beyond directional trends, many systems recently shifted flow state, including some streams that shifted from perennial to intermittent flow states. Our analyses provide evidence of stream drying consistent with climate change, but also highlight knowledge gaps and challenges in empirically and statistically documenting flow regime shifts. We discuss the myriad consequences of losing flow and propose strategies for improving detection and adapting to flow change. Methods To document flow change, we compiled gauge records from five Mediterranean-climate regions of the world, including California (U.S.), Chile, South Africa, Spain, and Western Australia. For each gauge, we downloaded daily discharge records from public sources. Next, we limited our analysis to gauges located in Mediterranean-climates zones by retaining the subset of gauges located in Köppen-Geiger climate classes Csa, Csb, Csc (i.e., areas with a dry summer) using maps from Beck et al. 2018. Second, we identified gauges located in minimally disturbed basins. In the US and Australia, we used “reference” gauges identified by the USGS and Bureau of Meteorology, respectively. In South Africa, Chile, and Spain - where reference gauges have not been designated by agencies – we instead used aerial image analysis of upstream watershed conditions to identify basins with no evidence of significant reservoirs or large water infrastructure projects. We note that our determination of “reference-quality” gauges in Spain [excluding Catalonia] is consistent with Messager et al. 2021. Third, we identified gauges with daily data from 1980-2019 (i.e., most recent 40 years in common across the five regions) and no more than one year of missing data. Overall, we identified 158 gauges that met our criteria for inclusion (i.e., Mediterranean-climate, reference-quality, 40 years of data from 1980-2019, and no more than one year of missing data, WebPanel 1, WebFigure1). To reduce noise in zero-flow conditions, we defined “zero flows” as flows < 0.1 cfs. Finally, for our analysis of zero-flow trends, we used a liberal definition of “intermittent” and included the subset of streams with ≥ to 1 day/year of zero-flow on average, i.e., ≥ 40 days across the 40 year study, following Messager et al. 2021. Using the population of gauges that met our criteria for inclusion, we conducted trend analyses on daily discharge (for each gauge in our population) and on the annual number of zero-flow days (for the subset of intermittent gauges) across the time series by means of non-parametric Mann-Kendall tests. We next explored evidence of flow regime shifts. Specifically, we conducted a breakpoint analysis on the zero-flow days per year using the ‘strucchange’ package in R. We constrained the analysis to test for evidence of a maximum of one breakpoint (indicating a state shift).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We are using the Yelp Review Dataset as the streaming data source for the DataCI example. We have processed the Yelp review dataset into a daily-based dataset by its `date`. In this dataset, we will only use the data from 2020-09-01 to 2020-11-30 to simulate the streaming data scenario. We are downloading two versions of the training and validation datasets: