Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Some say climate change is the biggest threat of our age while others say it’s a myth based on dodgy science. We are turning some of the data over to you so you can form your own view.
Even more than with other data sets that Kaggle has featured, there’s a huge amount of data cleaning and preparation that goes into putting together a long-time study of climate trends. Early data was collected by technicians using mercury thermometers, where any variation in the visit time impacted measurements. In the 1940s, the construction of airports caused many weather stations to be moved. In the 1980s, there was a move to electronic thermometers that are said to have a cooling bias.
Given this complexity, there are a range of organizations that collate climate trends data. The three most cited land and ocean temperature data sets are NOAA’s MLOST, NASA’s GISTEMP and the UK’s HadCrut.
We have repackaged the data from a newer compilation put together by the Berkeley Earth, which is affiliated with Lawrence Berkeley National Laboratory. The Berkeley Earth Surface Temperature Study combines 1.6 billion temperature reports from 16 pre-existing archives. It is nicely packaged and allows for slicing into interesting subsets (for example by country). They publish the source data and the code for the transformations they applied. They also use methods that allow weather observations from shorter time series to be included, meaning fewer observations need to be thrown away.
In this dataset, we have include several files:
Global Land and Ocean-and-Land Temperatures (GlobalTemperatures.csv):
Other files include:
The raw data comes from the Berkeley Earth data page.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
IntelligentMonitor: Empowering DevOps Environments With Advanced Monitoring and Observability aims to improve monitoring and observability in complex, distributed DevOps environments by leveraging machine learning and data analytics. This repository contains a sample implementation of the IntelligentMonitor system proposed in the research paper, presented and published as part of the 11th International Conference on Information Technology (ICIT 2023).
If you use this dataset and code or any herein modified part of it in any publication, please cite these papers:
P. Thantharate, "IntelligentMonitor: Empowering DevOps Environments with Advanced Monitoring and Observability," 2023 International Conference on Information Technology (ICIT), Amman, Jordan, 2023, pp. 800-805, doi: 10.1109/ICIT58056.2023.10226123.
For any questions and research queries - please reach out via Email.
Abstract - In the dynamic field of software development, DevOps has become a critical tool for enhancing collaboration, streamlining processes, and accelerating delivery. However, monitoring and observability within DevOps environments pose significant challenges, often leading to delayed issue detection, inefficient troubleshooting, and compromised service quality. These issues stem from DevOps environments' complex and ever-changing nature, where traditional monitoring tools often fall short, creating blind spots that can conceal performance issues or system failures. This research addresses these challenges by proposing an innovative approach to improve monitoring and observability in DevOps environments. Our solution, Intelligent-Monitor, leverages realtime data collection, intelligent analytics, and automated anomaly detection powered by advanced technologies such as machine learning and artificial intelligence. The experimental results demonstrate that IntelligentMonitor effectively manages data overload, reduces alert fatigue, and improves system visibility, thereby enhancing performance and reliability. For instance, the average CPU usage across all components showed a decrease of 9.10%, indicating improved CPU efficiency. Similarly, memory utilization and network traffic showed an average increase of 7.33% and 0.49%, respectively, suggesting more efficient use of resources. By providing deep insights into system performance and facilitating rapid issue resolution, this research contributes to the DevOps community by offering a comprehensive solution to one of its most pressing challenges. This fosters more efficient, reliable, and resilient software development and delivery processes.
Components The key components that would need to be implemented are:
Implementation Details The core of the implementation would involve the following: - Setting up the data collection pipelines. - Building and training anomaly detection ML models on historical data. - Developing a real-time data processing pipeline. - Creating an alerting framework that ties into the ML models. - Building visualizations and dashboards.
The code would need to handle scaled-out, distributed execution for production environments.
Proper code documentation, logging, and testing would be added throughout the implementation.
Usage Examples Usage examples could include:
References The implementation would follow the details provided in the original research paper: P. Thantharate, "IntelligentMonitor: Empowering DevOps Environments with Advanced Monitoring and Observability," 2023 International Conference on Information Technology (ICIT), Amman, Jordan, 2023, pp. 800-805, doi: 10.1109/ICIT58056.2023.10226123.
Any additional external libraries or sources used would be properly cited.
Tags - DevOps, Software Development, Collaboration, Streamlini...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Iris Flower Data Set Cleaned’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/larsen0966/iris-flower-data-set-cleaned on 14 February 2022.
--- Dataset description provided by original source is as follows ---
If this data Set is useful, and upvote is appreciated. British Statistician Ronald Fisher introduced the Iris Flower in 1936. Fisher published a paper that described the use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.
--- Original source retains full ownership of the source dataset ---
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This curated dataset consists of 269,353 patent documents (published patent applications and granted patents) spanning the 1976 to 2016 period and is intended to help identify promising R&D on the horizon in diagnostics, therapeutics, data analytics, and model biological systems.
USPTO Cancer Moonshot Patent Data was generated using USPTO examiner tools to execute a series of queries designed to identify cancer-specific patents and patent applications. This includes drugs, diagnostics, cell lines, mouse models, radiation-based devices, surgical devices, image analytics, data analytics, and genomic-based inventions.
“USPTO Cancer Moonshot Patent Data” by the USPTO, for public use. Frumkin, Jesse and Myers, Amanda F., Cancer Moonshot Patent Data (August, 2016).
Data Origin: https://bigquery.cloud.google.com/dataset/patents-public-data:uspto_oce_cancer
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘California Housing Data (1990)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/harrywang/housing on 12 November 2021.
--- Dataset description provided by original source is as follows ---
This is the dataset used in this book: https://github.com/ageron/handson-ml/tree/master/datasets/housing to illustrate a sample end-to-end ML project workflow (pipeline). This is a great book - I highly recommend!
The data is based on California Census in 1990.
"This dataset is a modified version of the California Housing dataset available from Luís Torgo's page (University of Porto). Luís Torgo obtained it from the StatLib repository (which is closed now). The dataset may also be downloaded from StatLib mirrors.
The following is the description from the book author:
This dataset appeared in a 1997 paper titled Sparse Spatial Autoregressions by Pace, R. Kelley and Ronald Barry, published in the Statistics and Probability Letters journal. They built it using the 1990 California census data. It contains one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).
The dataset in this directory is almost identical to the original, with two differences: 207 values were randomly removed from the total_bedrooms column, so we can discuss what to do with missing data. An additional categorical attribute called ocean_proximity was added, indicating (very roughly) whether each block group is near the ocean, near the Bay area, inland or on an island. This allows discussing what to do with categorical data. Note that the block groups are called "districts" in the Jupyter notebooks, simply because in some contexts the name "block group" was confusing."
http://www.dcc.fc.up.pt/%7Eltorgo/Regression/cal_housing.html
This is a dataset obtained from the StatLib repository. Here is the included description:
"We collected information on the variables using all the block groups in California from the 1990 Cens us. In this sample a block group on average includes 1425.5 individuals living in a geographically co mpact area. Naturally, the geographical area included varies inversely with the population density. W e computed distances among the centroids of each block group as measured in latitude and longitude. W e excluded all the block groups reporting zero entries for the independent and dependent variables. T he final data contained 20,640 observations on 9 variables. The dependent variable is ln(median house value)."
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘world military power 2020’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/mingookkim/world-military-power-2020 on 14 February 2022.
--- Dataset description provided by original source is as follows ---
I found this data on a site called data.world. It is a data material published as a dataset created by vizzup.
This is a data that allows you to see the world military rankings in 2020 and numerical status such as the army, navy, and air force.
In addition, some related data such as population and economy related to military power are also included.
Please refer to data analysis as a good data to compare military power.
Original Source : globalfirepower.com on 1st may 2020
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about book series. It has 1 row and is filtered where the books is Big data : how the information revolution is transforming lives. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book publisher is Education Data Surveys. It features 7 columns including author, publication date, language, and book publisher.
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
A. SUMMARY The dataset inventory provides a list of data maintained by departments that are candidates for open data publishing or have already been published and is collected in accordance with Chapter 22D of the Administrative Code. The inventory will be used in conjunction with department publishing plans to track progress toward meeting plan goals for each department.
B. HOW THE DATASET IS CREATED This dataset is collated through 2 ways: 1. Ongoing updates are made throughout the year to reflect new datasets, this process involves DataSF staff reconciling publishing records after datasets are published 2. Annual bulk updates - departments review their inventories and identify changes and updates and submit those to DataSF for a once a year bulk update - not all departments will have changes or their changes will have been captured over the course of the prior year already as ongoing updates
C. UPDATE PROCESS The dataset is synced automatically daily, but the underlying data changes manually throughout the year as needed
D. HOW TO USE THIS DATASET Interpreting dates in this dataset This dataset has 2 dates: 1. Date Added - when the dataset was added to the inventory itself 2. First Published - the open data portal automatically captures the date the dataset was first created, this is that system generated date
Note that in certain cases we may have published a dataset prior to it being added to the inventory. We do our best to have an accurate accounting of when something was added to this inventory and when it was published. In most cases the inventory addition will happen prior to publishing, but in certain cases it will be published and we will have missed updating the inventory as this is a manual process.
First published will give an accounting of when it was actually available on the open data catalog and date added when it was added to this list.
E. RELATED DATASETS
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Hitters Baseball Data’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/mathchi/hitters-baseball-data on 30 September 2021.
--- Dataset description provided by original source is as follows ---
Major League Baseball Data from the 1986 and 1987 seasons.
Hitters
A data frame with 322 observations of major league players on the following 20 variables.
AtBat: Number of times at bat in 1986
Hits: Number of hits in 1986
HmRun: Number of home runs in 1986
Runs: Number of runs in 1986
RBI: Number of runs batted in in 1986
Walks: Number of walks in 1986
Years: Number of years in the major leagues
CAtBat: Number of times at bat during his career
CHits: Number of hits during his career
CHmRun: Number of home runs during his career
CRuns: Number of runs during his career
CRBI: Number of runs batted in during his career
CWalks: Number of walks during his career
League: A factor with levels A and N indicating player's league at the end of 1986
Division: A factor with levels E and W indicating player's division at the end of 1986
PutOuts: Number of put outs in 1986
Assists: Number of assists in 1986
Errors: Number of errors in 1986
Salary: 1987 annual salary on opening day in thousands of dollars
NewLeague: A factor with levels A and N indicating player's league at the beginning of 1987
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. This is part of the data that was used in the 1988 ASA Graphics Section Poster Session. The salary data were originally from Sports Illustrated, April 20, 1987. The 1986 and career statistics were obtained from The 1987 Baseball Encyclopedia Update published by Collier Books, Macmillan Publishing Company, New York.
Games, G., Witten, D., Hastie, T., and Tibshirani, R. (2013) An Introduction to Statistical Learning with applications in R, www.StatLearning.com, Springer-Verlag, New York
summary(Hitters)
Dataset imported from https://www.r-project.org.
--- Original source retains full ownership of the source dataset ---
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Background: Attribution to the original contributor upon reuse of published data is important both as a reward for data creators and to document the provenance of research findings. Previous studies have found that papers with publicly available datasets receive a higher number of citations than similar studies without available data. However, few previous analyses have had the statistical power to control for the many variables known to predict citation rate, which has led to uncertain estimates of the "citation benefit". Furthermore, little is known about patterns in data reuse over time and across datasets. Method and Results: Here, we look at citation rates while controlling for many known citation predictors, and investigate the variability of data reuse. In a multivariate regression on 10,555 studies that created gene expression microarray data, we found that studies that made data available in a public repository received 9% (95% confidence interval: 5% to 13%) more citations than similar studies for which the data was not made available. Date of publication, journal impact factor, open access status, number of authors, first and last author publication history, corresponding author country, institution citation history, and study topic were included as covariates. The citation benefit varied with date of dataset deposition: a citation benefit was most clear for papers published in 2004 and 2005, at about 30%. Authors published most papers using their own datasets within two years of their first publication on the dataset, whereas data reuse papers published by third-party investigators continued to accumulate for at least six years. To study patterns of data reuse directly, we compiled 9,724 instances of third party data reuse via mention of GEO or ArrayExpress accession numbers in the full text of papers. The level of third-party data use was high: for 100 datasets deposited in year 0, we estimated that 40 papers in PubMed reused a dataset by year 2, 100 by year 4, and more than 150 data reuse papers had been published by year 5. Data reuse was distributed across a broad base of datasets: a very conservative estimate found that 20% of the datasets deposited between 2003 and 2007 had been reused at least once by third parties. Conclusion: After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported. We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data. Other factors that may also contribute to the citation benefit are considered.We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains the results from the ODDPubb text mining algorithm and the findings from manual analysis. Full-text PDFs of all articles parallel-published by Linköping University in 2022 were extracted from the institute's repository, DiVA. These were analyzed using the ODDPubb (https://github.com/quest-bih/oddpub) text mining algorithm to determine the extent of data sharing and identify the repositories where the data was shared. In addition to the results from ODDPubb, manual analysis was conducted to confirm the presence of data sharing statements, assess data availability, and identify the repositories used.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Loopkevers Grensmaas - Ground beetles near the river Meuse in Flanders, Belgium is a species occurrence dataset published by the Research Institute for Nature and Forest (INBO). The dataset contains over 5,800 beetle occurrences sampled between 1998 and 1999 from 28 locations on the left bank (Belgium) of the river Meuse on the border between Belgium and the Netherlands. The dataset includes over 100 ground beetles species (Carabidae) and some non-target species. The data were used to assess the dynamics of the Grensmaas area and to help river management. Issues with the dataset can be reported at https://github.com/LifeWatchINBO/data-publication/tree/master/datasets/kevers-grensmaas-occurrences
To allow anyone to use this dataset, we have released the data to the public domain under a Creative Commons Zero waiver (http://creativecommons.org/publicdomain/zero/1.0/). We would appreciate however, if you read and follow these norms for data use (http://www.inbo.be/en/norms-for-data-use) and provide a link to the original dataset (https://doi.org/10.15468/hy3pzl) whenever possible. If you use these data for a scientific paper, please cite the dataset following the applicable citation norms and/or consider us for co-authorship. We are always interested to know how you have used or visualized the data, or to provide more information, so please contact us via the contact information provided in the metadata, opendata@inbo.be or https://twitter.com/LifeWatchINBO.
PAH method development and sample collection. This dataset is associated with the following publication: Wallace, M., J. Pleil, D. Whitaker, and K. Oliver. Dataset of polycyclic aromatic hydrocarbon recoveries from a selection of sorbent tubes for thermal desorption-gas chromatography/mass spectrometry analysis. Data in Brief. Elsevier B.V., Amsterdam, NETHERLANDS, 29: 105252, (2020).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is Applied missing data analysis. It features 7 columns including author, publication date, language, and book publisher.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 2,597 rows and is filtered where the book publisher is Harper Collins e-books. It features 2 columns including publication date.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Home Price Index’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/PythonforSASUsers/hpindex on 28 January 2022.
--- Dataset description provided by original source is as follows ---
The Federal Housing Finance Agency House Price Index (HPI) is a broad measure of the movement of single-family house prices. The HPI is a weighted, repeat-sales index, meaning that it measures average price changes in repeat sales or refinancings on the same properties. The technical methodology for devising the index, collection, and publishing the data is at: http://www.fhfa.gov/PolicyProgramsResearch/Research/PaperDocuments/1996-03_HPI_TechDescription_N508.pdf
Contains monthly and quarterly time series from January 1991 to August 2016 for the U.S., state, and MSA categories. Analysis variables are the aggregate non-seasonally adjusted value and seasonally adjusted index values. The index value is 100 beginning January 1991.
This data is found on Data.gov
Can this data be combined with the corresponding census growth projections either at the state or MSA level to forecast 24 months out the highest and lowest home index values?
--- Original source retains full ownership of the source dataset ---
This Chicago Park District dataset includes information about event permits requested through the Chicago Park District, including the name of applicant, the name of the event and a brief description, contact information, time of event including set-up and tear-down times, the name of the Park and location, and estimated number of event attendees. Additional information may be included depending on the type of the event, including proof of insurance, route maps for all races and runs, security plans and medical services and required city documents. Permit levels issued by the Department of Revenue include picnic levels, athletic levels, corporate levels, media levels, promotions levels, and festivals/performances levels. For more information, visit http://www.chicagoparkdistrict.com/permits-and-rentals/.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book publisher is Victor Rumanyika Publishing. It features 2 columns including publication date.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset belonging to the report: Open access potential and uptake in the context of Plan S - a partial gap analysis
On the report:
The analysis presented in the report, carried out by Utrecht University Library, aims to provide cOAlition S, an international group of research funding organizations, with initial quantitative and descriptive data on the availability and usage of various open access options in different fields and subdisciplines, and, as far as possible, their compliance with Plan S requirements.
Plan S, launched in September 2018, aims to accelerate a transition to full and immediate Open Access. In the guidance to implementation, released in November 2018 and updated in May 2019, a gap analysis of Open Access journals/platforms was announced. Its goal was to inform Coalition S funders on the Open Access options per field and identify fields where there is a need to increase the share of Open Access journals/platforms.
The report should be seen as a first step: an exploration in methodology as much as in results. Subsequent interpretation (e.g. on fields where funder investment/action is needed) and decisions on next steps (e.g. on more complete and longitudinal monitoring of Plan S-compliant venues) is intentionally left to cOAlition S and its members.
This work was commissioned on behalf of cOAlition S by the Dutch Research Council (NWO), a member of cOAlition S. Bianca Kramer and Jeroen Bosman of Utrecht University Library were appointed to lead the project.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Some say climate change is the biggest threat of our age while others say it’s a myth based on dodgy science. We are turning some of the data over to you so you can form your own view.
Even more than with other data sets that Kaggle has featured, there’s a huge amount of data cleaning and preparation that goes into putting together a long-time study of climate trends. Early data was collected by technicians using mercury thermometers, where any variation in the visit time impacted measurements. In the 1940s, the construction of airports caused many weather stations to be moved. In the 1980s, there was a move to electronic thermometers that are said to have a cooling bias.
Given this complexity, there are a range of organizations that collate climate trends data. The three most cited land and ocean temperature data sets are NOAA’s MLOST, NASA’s GISTEMP and the UK’s HadCrut.
We have repackaged the data from a newer compilation put together by the Berkeley Earth, which is affiliated with Lawrence Berkeley National Laboratory. The Berkeley Earth Surface Temperature Study combines 1.6 billion temperature reports from 16 pre-existing archives. It is nicely packaged and allows for slicing into interesting subsets (for example by country). They publish the source data and the code for the transformations they applied. They also use methods that allow weather observations from shorter time series to be included, meaning fewer observations need to be thrown away.
In this dataset, we have include several files:
Global Land and Ocean-and-Land Temperatures (GlobalTemperatures.csv):
Other files include:
The raw data comes from the Berkeley Earth data page.