100+ datasets found

Climate Change: Earth Surface Temperature Data
kaggle.com
zip
Updated May 1, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Berkeley Earth (2017). Climate Change: Earth Surface Temperature Data [Dataset]. https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data
Explore at:
zip(88843537 bytes)Available download formats
Dataset updated
May 1, 2017
Dataset authored and provided by
Berkeley Earthhttp://berkeleyearth.org/
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Area covered
Earth
Description
Some say climate change is the biggest threat of our age while others say it’s a myth based on dodgy science. We are turning some of the data over to you so you can form your own view.

Even more than with other data sets that Kaggle has featured, there’s a huge amount of data cleaning and preparation that goes into putting together a long-time study of climate trends. Early data was collected by technicians using mercury thermometers, where any variation in the visit time impacted measurements. In the 1940s, the construction of airports caused many weather stations to be moved. In the 1980s, there was a move to electronic thermometers that are said to have a cooling bias.

Given this complexity, there are a range of organizations that collate climate trends data. The three most cited land and ocean temperature data sets are NOAA’s MLOST, NASA’s GISTEMP and the UK’s HadCrut.

We have repackaged the data from a newer compilation put together by the Berkeley Earth, which is affiliated with Lawrence Berkeley National Laboratory. The Berkeley Earth Surface Temperature Study combines 1.6 billion temperature reports from 16 pre-existing archives. It is nicely packaged and allows for slicing into interesting subsets (for example by country). They publish the source data and the code for the transformations they applied. They also use methods that allow weather observations from shorter time series to be included, meaning fewer observations need to be thrown away.

In this dataset, we have include several files:

Global Land and Ocean-and-Land Temperatures (GlobalTemperatures.csv):

Date: starts in 1750 for average land temperature and 1850 for max and min land temperatures and global ocean and land temperatures

LandAverageTemperature: global average land temperature in celsius

LandAverageTemperatureUncertainty: the 95% confidence interval around the average

LandMaxTemperature: global average maximum land temperature in celsius

LandMaxTemperatureUncertainty: the 95% confidence interval around the maximum land temperature

LandMinTemperature: global average minimum land temperature in celsius

LandMinTemperatureUncertainty: the 95% confidence interval around the minimum land temperature

LandAndOceanAverageTemperature: global average land and ocean temperature in celsius

LandAndOceanAverageTemperatureUncertainty: the 95% confidence interval around the global average land and ocean temperature

Other files include:

Global Average Land Temperature by Country (GlobalLandTemperaturesByCountry.csv)

Global Average Land Temperature by State (GlobalLandTemperaturesByState.csv)

Global Land Temperatures By Major City (GlobalLandTemperaturesByMajorCity.csv)

Global Land Temperatures By City (GlobalLandTemperaturesByCity.csv)

The raw data comes from the Berkeley Earth data page.
Intelligent Monitor
kaggle.com
Updated Apr 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ptdevsecops (2024). Intelligent Monitor [Dataset]. http://doi.org/10.34740/kaggle/ds/4383210
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/ds/4383210
Dataset updated
Apr 12, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
ptdevsecops
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
IntelligentMonitor: Empowering DevOps Environments With Advanced Monitoring and Observability aims to improve monitoring and observability in complex, distributed DevOps environments by leveraging machine learning and data analytics. This repository contains a sample implementation of the IntelligentMonitor system proposed in the research paper, presented and published as part of the 11th International Conference on Information Technology (ICIT 2023).

If you use this dataset and code or any herein modified part of it in any publication, please cite these papers:

P. Thantharate, "IntelligentMonitor: Empowering DevOps Environments with Advanced Monitoring and Observability," 2023 International Conference on Information Technology (ICIT), Amman, Jordan, 2023, pp. 800-805, doi: 10.1109/ICIT58056.2023.10226123.

For any questions and research queries - please reach out via Email.

Abstract - In the dynamic field of software development, DevOps has become a critical tool for enhancing collaboration, streamlining processes, and accelerating delivery. However, monitoring and observability within DevOps environments pose significant challenges, often leading to delayed issue detection, inefficient troubleshooting, and compromised service quality. These issues stem from DevOps environments' complex and ever-changing nature, where traditional monitoring tools often fall short, creating blind spots that can conceal performance issues or system failures. This research addresses these challenges by proposing an innovative approach to improve monitoring and observability in DevOps environments. Our solution, Intelligent-Monitor, leverages realtime data collection, intelligent analytics, and automated anomaly detection powered by advanced technologies such as machine learning and artificial intelligence. The experimental results demonstrate that IntelligentMonitor effectively manages data overload, reduces alert fatigue, and improves system visibility, thereby enhancing performance and reliability. For instance, the average CPU usage across all components showed a decrease of 9.10%, indicating improved CPU efficiency. Similarly, memory utilization and network traffic showed an average increase of 7.33% and 0.49%, respectively, suggesting more efficient use of resources. By providing deep insights into system performance and facilitating rapid issue resolution, this research contributes to the DevOps community by offering a comprehensive solution to one of its most pressing challenges. This fosters more efficient, reliable, and resilient software development and delivery processes.

Components The key components that would need to be implemented are:

Data Collection - Collect performance metrics and log data from the distributed system components. Could use technology like Kafka or telemetry libraries.

Data Processing - Preprocess and aggregate the collected data into an analyzable format. Could use Spark for distributed data processing.

Anomaly Detection - Apply machine learning algorithms to detect anomalies in the performance metrics. Could use isolation forest or LSTM models.

Alerting - Generate alerts when anomalies are detected. It could integrate with tools like PagerDuty.

Visualization - Create dashboards to visualize system health and key metrics. Could use Grafana or Kibana.

Data Storage - Store the collected metrics and log data. Could use Elasticsearch or InfluxDB.

Implementation Details The core of the implementation would involve the following: - Setting up the data collection pipelines. - Building and training anomaly detection ML models on historical data. - Developing a real-time data processing pipeline. - Creating an alerting framework that ties into the ML models. - Building visualizations and dashboards.

The code would need to handle scaled-out, distributed execution for production environments.

Proper code documentation, logging, and testing would be added throughout the implementation.

Usage Examples Usage examples could include:

Running the data collection agents on each system component.

Visualizing system metrics through Grafana dashboards.

Investigating anomalies detected by the ML models.

Tuning the alerting rules to minimize false positives.

Correlating metrics with log data to troubleshoot issues.

References The implementation would follow the details provided in the original research paper: P. Thantharate, "IntelligentMonitor: Empowering DevOps Environments with Advanced Monitoring and Observability," 2023 International Conference on Information Technology (ICIT), Amman, Jordan, 2023, pp. 800-805, doi: 10.1109/ICIT58056.2023.10226123.

Any additional external libraries or sources used would be properly cited.

Tags - DevOps, Software Development, Collaboration, Streamlini...
A
‘Iris Flower Data Set Cleaned’ analyzed by Analyst-2
analyst-2.ai
Updated Aug 4, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2020). ‘Iris Flower Data Set Cleaned’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-iris-flower-data-set-cleaned-430f/latest
Explore at:
Dataset updated
Aug 4, 2020
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Iris Flower Data Set Cleaned’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/larsen0966/iris-flower-data-set-cleaned on 14 February 2022.

--- Dataset description provided by original source is as follows ---

If this data Set is useful, and upvote is appreciated. British Statistician Ronald Fisher introduced the Iris Flower in 1936. Fisher published a paper that described the use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.

--- Original source retains full ownership of the source dataset ---
USPTO Cancer Moonshot Patent Data
kaggle.com
zip
Updated Feb 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2019). USPTO Cancer Moonshot Patent Data [Dataset]. https://www.kaggle.com/datasets/bigquery/uspto-oce-cancer
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Feb 12, 2019
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Fork this notebook to get started on accessing data in the BigQuery dataset by writing SQL queries using the BQhelper module.

Context

This curated dataset consists of 269,353 patent documents (published patent applications and granted patents) spanning the 1976 to 2016 period and is intended to help identify promising R&D on the horizon in diagnostics, therapeutics, data analytics, and model biological systems.

Content

USPTO Cancer Moonshot Patent Data was generated using USPTO examiner tools to execute a series of queries designed to identify cancer-specific patents and patent applications. This includes drugs, diagnostics, cell lines, mouse models, radiation-based devices, surgical devices, image analytics, data analytics, and genomic-based inventions.

Acknowledgements

“USPTO Cancer Moonshot Patent Data” by the USPTO, for public use. Frumkin, Jesse and Myers, Amanda F., Cancer Moonshot Patent Data (August, 2016).

Data Origin: https://bigquery.cloud.google.com/dataset/patents-public-data:uspto_oce_cancer

Banner photo by Jaron Nix on Unsplash
A
‘California Housing Data (1990)’ analyzed by Analyst-2
analyst-2.ai
Updated Nov 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘California Housing Data (1990)’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-california-housing-data-1990-a0c5/b7389540/?iid=007-628&v=presentation
Explore at:
Dataset updated
Nov 12, 2021
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
California
Description
Analysis of ‘California Housing Data (1990)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/harrywang/housing on 12 November 2021.

--- Dataset description provided by original source is as follows ---

Source

This is the dataset used in this book: https://github.com/ageron/handson-ml/tree/master/datasets/housing to illustrate a sample end-to-end ML project workflow (pipeline). This is a great book - I highly recommend!

The data is based on California Census in 1990.

About the Data (from the book):

"This dataset is a modified version of the California Housing dataset available from Luís Torgo's page (University of Porto). Luís Torgo obtained it from the StatLib repository (which is closed now). The dataset may also be downloaded from StatLib mirrors.

The following is the description from the book author:

This dataset appeared in a 1997 paper titled Sparse Spatial Autoregressions by Pace, R. Kelley and Ronald Barry, published in the Statistics and Probability Letters journal. They built it using the 1990 California census data. It contains one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).

The dataset in this directory is almost identical to the original, with two differences: 207 values were randomly removed from the total_bedrooms column, so we can discuss what to do with missing data. An additional categorical attribute called ocean_proximity was added, indicating (very roughly) whether each block group is near the ocean, near the Bay area, inland or on an island. This allows discussing what to do with categorical data. Note that the block groups are called "districts" in the Jupyter notebooks, simply because in some contexts the name "block group" was confusing."

About the Data (From Luís Torgo page):

http://www.dcc.fc.up.pt/%7Eltorgo/Regression/cal_housing.html

This is a dataset obtained from the StatLib repository. Here is the included description:

"We collected information on the variables using all the block groups in California from the 1990 Cens us. In this sample a block group on average includes 1425.5 individuals living in a geographically co mpact area. Naturally, the geographical area included varies inversely with the population density. W e computed distances among the centroids of each block group as measured in latitude and longitude. W e excluded all the block groups reporting zero entries for the independent and dependent variables. T he final data contained 20,640 observations on 9 variables. The dependent variable is ln(median house value)."

End-to-End ML Project Steps (Chapter 2 of the book)

Look at the big picture

Get the data

Discover and visualize the data to gain insights

Prepare the data for Machine Learning algorithms

Select a model and train it

Fine-tune your model

Present your solution

Launch, monitor, and maintain your system

The 10-Step Machine Learning Project Workflow (My Version)

Define business object

Make sense of the data from a high level

data types (number, text, object, etc.)

continuous/discrete

basic stats (min, max, std, median, etc.) using boxplot

frequency via histogram

scales and distributions of different features

Create the traning and test sets using proper sampling methods, e.g., random vs. stratified

Correlation analysis (pair-wise and attribute combinations)

Data cleaning (missing data, outliers, data errors)

Data transformation via pipelines (categorical text to number using one hot encoding, feature scaling via normalization/standardization, feature combinations)

Train and cross validate different models and select the most promising one (Linear Regression, Decision Tree, and Random Forest were tried in this tutorial)

Fine tune the model using trying different combinations of hyperparameters

Evaluate the model with best estimators in the test set

Launch, monitor, and refresh the model and system

--- Original source retains full ownership of the source dataset ---
A
‘world military power 2020’ analyzed by Analyst-2
analyst-2.ai
Updated May 1, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2020). ‘world military power 2020’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-world-military-power-2020-457a/latest
Explore at:
Dataset updated
May 1, 2020
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
World
Description
Analysis of ‘world military power 2020’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/mingookkim/world-military-power-2020 on 14 February 2022.

--- Dataset description provided by original source is as follows ---

I found this data on a site called data.world. It is a data material published as a dataset created by vizzup.

This is a data that allows you to see the world military rankings in 2020 and numerical status such as the army, navy, and air force.

In addition, some related data such as population and economy related to military power are also included.

Please refer to data analysis as a good data to compare military power.

Original Source : globalfirepower.com on 1st may 2020

--- Original source retains full ownership of the source dataset ---
w
Dataset of books series that contain Big data : how the information...
workwithdata.com
Updated Nov 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2024). Dataset of books series that contain Big data : how the information revolution is transforming lives [Dataset]. https://www.workwithdata.com/datasets/book-series?f=1&fcol0=j0-book&fop0=%3D&fval0=Big+data+%3A+how+the+information+revolution+is+transforming+lives&j=1&j0=books
Explore at:
Dataset updated
Nov 25, 2024
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about book series. It has 1 row and is filtered where the books is Big data : how the information revolution is transforming lives. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.
w
Dataset of books published by Education Data Surveys
workwithdata.com
Updated Apr 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2025). Dataset of books published by Education Data Surveys [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book_publisher&fop0=%3D&fval0=Education+Data+Surveys
Explore at:
Dataset updated
Apr 17, 2025
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about books. It has 1 row and is filtered where the book publisher is Education Data Surveys. It features 7 columns including author, publication date, language, and book publisher.
D
Dataset inventory
data.sfgov.org
application/rdfxml +5
Updated Jun 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DataSF (2025). Dataset inventory [Dataset]. https://data.sfgov.org/w/y8fp-fbf5/ikek-yizv?cur=kRuUwDH4vsx
Explore at:
csv, tsv, application/rssxml, json, application/rdfxml, xmlAvailable download formats
Dataset updated
Jun 30, 2025
Dataset authored and provided by
DataSF
License
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Description
A. SUMMARY The dataset inventory provides a list of data maintained by departments that are candidates for open data publishing or have already been published and is collected in accordance with Chapter 22D of the Administrative Code. The inventory will be used in conjunction with department publishing plans to track progress toward meeting plan goals for each department.

B. HOW THE DATASET IS CREATED This dataset is collated through 2 ways: 1. Ongoing updates are made throughout the year to reflect new datasets, this process involves DataSF staff reconciling publishing records after datasets are published 2. Annual bulk updates - departments review their inventories and identify changes and updates and submit those to DataSF for a once a year bulk update - not all departments will have changes or their changes will have been captured over the course of the prior year already as ongoing updates

C. UPDATE PROCESS The dataset is synced automatically daily, but the underlying data changes manually throughout the year as needed

D. HOW TO USE THIS DATASET Interpreting dates in this dataset This dataset has 2 dates: 1. Date Added - when the dataset was added to the inventory itself 2. First Published - the open data portal automatically captures the date the dataset was first created, this is that system generated date

Note that in certain cases we may have published a dataset prior to it being added to the inventory. We do our best to have an accurate accounting of when something was added to this inventory and when it was published. In most cases the inventory addition will happen prior to publishing, but in certain cases it will be published and we will have missed updating the inventory as this is a manual process.

First published will give an accounting of when it was actually available on the open data catalog and date added when it was added to this list.

E. RELATED DATASETS
Inventory of citywide enterprise systems of record
Dataset Inventory: Column-Level Details

‘Hitters Baseball Data’ analyzed by Analyst-2

analyst-2.ai

Updated Sep 30, 2021

Facebook

Twitter

Click to copy link

Link copied

Cite

Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Hitters Baseball Data’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-hitters-baseball-data-00a7/90da49b5/?iid=020-554&v=presentation

Explore at:

Dataset updated

Sep 30, 2021

Dataset authored and provided by

Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Analysis of ‘Hitters Baseball Data’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/mathchi/hitters-baseball-data on 30 September 2021.

--- Dataset description provided by original source is as follows ---

Baseball Data

Description

Major League Baseball Data from the 1986 and 1987 seasons.

Usage

Hitters

Format

A data frame with 322 observations of major league players on the following 20 variables.

AtBat: Number of times at bat in 1986
Hits: Number of hits in 1986
HmRun: Number of home runs in 1986
Runs: Number of runs in 1986
RBI: Number of runs batted in in 1986
Walks: Number of walks in 1986
Years: Number of years in the major leagues
CAtBat: Number of times at bat during his career
CHits: Number of hits during his career
CHmRun: Number of home runs during his career
CRuns: Number of runs during his career
CRBI: Number of runs batted in during his career
CWalks: Number of walks during his career
League: A factor with levels A and N indicating player's league at the end of 1986
Division: A factor with levels E and W indicating player's division at the end of 1986
PutOuts: Number of put outs in 1986
Assists: Number of assists in 1986
Errors: Number of errors in 1986
Salary: 1987 annual salary on opening day in thousands of dollars
NewLeague: A factor with levels A and N indicating player's league at the beginning of 1987

Source

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. This is part of the data that was used in the 1988 ASA Graphics Section Poster Session. The salary data were originally from Sports Illustrated, April 20, 1987. The 1986 and career statistics were obtained from The 1987 Baseball Encyclopedia Update published by Collier Books, Macmillan Publishing Company, New York.

References

Games, G., Witten, D., Hastie, T., and Tibshirani, R. (2013) An Introduction to Statistical Learning with applications in R, www.StatLearning.com, Springer-Verlag, New York

Examples

summary(Hitters)

lm(Salary~AtBat+Hits,data=Hitters)

Dataset imported from https://www.r-project.org.

--- Original source retains full ownership of the source dataset ---

Data from: Data reuse and the open data citation advantage

data.niaid.nih.gov
search.dataone.org
+2more

zip

Updated Oct 1, 2013

Facebook

Twitter

Click to copy link

Link copied

Cite

Heather A. Piwowar; Todd J. Vision (2013). Data reuse and the open data citation advantage [Dataset]. http://doi.org/10.5061/dryad.781pv

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5061/dryad.781pv

Dataset updated

Oct 1, 2013

Dataset provided by

National Evolutionary Synthesis Center

Authors

Heather A. Piwowar; Todd J. Vision

License

https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

Description

Background: Attribution to the original contributor upon reuse of published data is important both as a reward for data creators and to document the provenance of research findings. Previous studies have found that papers with publicly available datasets receive a higher number of citations than similar studies without available data. However, few previous analyses have had the statistical power to control for the many variables known to predict citation rate, which has led to uncertain estimates of the "citation benefit". Furthermore, little is known about patterns in data reuse over time and across datasets. Method and Results: Here, we look at citation rates while controlling for many known citation predictors, and investigate the variability of data reuse. In a multivariate regression on 10,555 studies that created gene expression microarray data, we found that studies that made data available in a public repository received 9% (95% confidence interval: 5% to 13%) more citations than similar studies for which the data was not made available. Date of publication, journal impact factor, open access status, number of authors, first and last author publication history, corresponding author country, institution citation history, and study topic were included as covariates. The citation benefit varied with date of dataset deposition: a citation benefit was most clear for papers published in 2004 and 2005, at about 30%. Authors published most papers using their own datasets within two years of their first publication on the dataset, whereas data reuse papers published by third-party investigators continued to accumulate for at least six years. To study patterns of data reuse directly, we compiled 9,724 instances of third party data reuse via mention of GEO or ArrayExpress accession numbers in the full text of papers. The level of third-party data use was high: for 100 datasets deposited in year 0, we estimated that 40 papers in PubMed reused a dataset by year 2, 100 by year 4, and more than 150 data reuse papers had been published by year 5. Data reuse was distributed across a broad base of datasets: a very conservative estimate found that 20% of the datasets deposited between 2003 and 2007 had been reused at least once by third parties. Conclusion: After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported. We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data. Other factors that may also contribute to the citation benefit are considered.We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003.

Dataset for "Do LiU researchers publish data – and where? Dataset analysis...

researchdata.se

Updated Mar 18, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Kaori Hoshi Larsson (2025). Dataset for "Do LiU researchers publish data – and where? Dataset analysis using ODDPub" [Dataset]. http://doi.org/10.5281/ZENODO.15017715

Explore at:

Unique identifier

https://doi.org/10.5281/ZENODO.15017715

Dataset updated

Mar 18, 2025

Dataset provided by

Linköping University

Authors

Kaori Hoshi Larsson

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset contains the results from the ODDPubb text mining algorithm and the findings from manual analysis. Full-text PDFs of all articles parallel-published by Linköping University in 2022 were extracted from the institute's repository, DiVA. These were analyzed using the ODDPubb (https://github.com/quest-bih/oddpub) text mining algorithm to determine the extent of data sharing and identify the repositories where the data was shared. In addition to the results from ODDPubb, manual analysis was conducted to confirm the presence of data sharing statements, assess data availability, and identify the repositories used.

Data from: Loopkevers Grensmaas - Ground beetles near the river Meuse in...

gbif.org
metadata.vlaanderen.be
+2more

Updated Apr 1, 2021

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Stijn Vanacker; Dimitri Brosens; Peter Desmet; Stijn Vanacker; Dimitri Brosens; Peter Desmet (2021). Loopkevers Grensmaas - Ground beetles near the river Meuse in Flanders, Belgium [Dataset]. http://doi.org/10.15468/hy3pzl

Explore at:

Unique identifier

https://doi.org/10.15468/hy3pzl

Dataset updated

Apr 1, 2021

Dataset provided by

Global Biodiversity Information Facilityhttps://www.gbif.org/
Research Institute for Nature and Forest (INBO)

Authors

Stijn Vanacker; Dimitri Brosens; Peter Desmet; Stijn Vanacker; Dimitri Brosens; Peter Desmet

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Time period covered

Aug 25, 1998 - Oct 4, 1999

Area covered

Description

Loopkevers Grensmaas - Ground beetles near the river Meuse in Flanders, Belgium is a species occurrence dataset published by the Research Institute for Nature and Forest (INBO). The dataset contains over 5,800 beetle occurrences sampled between 1998 and 1999 from 28 locations on the left bank (Belgium) of the river Meuse on the border between Belgium and the Netherlands. The dataset includes over 100 ground beetles species (Carabidae) and some non-target species. The data were used to assess the dynamics of the Grensmaas area and to help river management. Issues with the dataset can be reported at https://github.com/LifeWatchINBO/data-publication/tree/master/datasets/kevers-grensmaas-occurrences

To allow anyone to use this dataset, we have released the data to the public domain under a Creative Commons Zero waiver (http://creativecommons.org/publicdomain/zero/1.0/). We would appreciate however, if you read and follow these norms for data use (http://www.inbo.be/en/norms-for-data-use) and provide a link to the original dataset (https://doi.org/10.15468/hy3pzl) whenever possible. If you use these data for a scientific paper, please cite the dataset following the applicable citation norms and/or consider us for co-authorship. We are always interested to know how you have used or visualized the data, or to provide more information, so please contact us via the contact information provided in the metadata, opendata@inbo.be or https://twitter.com/LifeWatchINBO.

PAH Published Dataset Data In Brief

catalog.data.gov

Updated Nov 12, 2020

Facebook

Twitter

Click to copy link

Link copied

Cite

U.S. EPA Office of Research and Development (ORD) (2020). PAH Published Dataset Data In Brief [Dataset]. https://catalog.data.gov/dataset/pah-published-dataset-data-in-brief

Explore at:

Dataset updated

Nov 12, 2020

Dataset provided by

United States Environmental Protection Agencyhttp://www.epa.gov/

Description

PAH method development and sample collection. This dataset is associated with the following publication: Wallace, M., J. Pleil, D. Whitaker, and K. Oliver. Dataset of polycyclic aromatic hydrocarbon recoveries from a selection of sorbent tubes for thermal desorption-gas chromatography/mass spectrometry analysis. Data in Brief. Elsevier B.V., Amsterdam, NETHERLANDS, 29: 105252, (2020).

Dataset of books called Applied missing data analysis

workwithdata.com

Updated Apr 17, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Work With Data (2025). Dataset of books called Applied missing data analysis [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Applied+missing+data+analysis

Explore at:

Dataset updated

Apr 17, 2025

Dataset authored and provided by

Work With Data

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset is about books. It has 1 row and is filtered where the book is Applied missing data analysis. It features 7 columns including author, publication date, language, and book publisher.

Dataset of books and publication dates published by Harper Collins e-books

workwithdata.com

Updated Apr 17, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Work With Data (2025). Dataset of books and publication dates published by Harper Collins e-books [Dataset]. https://www.workwithdata.com/datasets/books?col=book%2Cpublication_date&f=1&fcol0=book_publisher&fop0=%3D&fval0=Harper+Collins+e-books

Explore at:

Dataset updated

Apr 17, 2025

Dataset authored and provided by

Work With Data

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset is about books. It has 2,597 rows and is filtered where the book publisher is Harper Collins e-books. It features 2 columns including publication date.

‘Home Price Index’ analyzed by Analyst-2

analyst-2.ai

Updated Jan 28, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Home Price Index’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-home-price-index-edf4/latest

Explore at:

Dataset updated

Jan 28, 2022

Dataset authored and provided by

Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Analysis of ‘Home Price Index’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/PythonforSASUsers/hpindex on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Context

The Federal Housing Finance Agency House Price Index (HPI) is a broad measure of the movement of single-family house prices. The HPI is a weighted, repeat-sales index, meaning that it measures average price changes in repeat sales or refinancings on the same properties. The technical methodology for devising the index, collection, and publishing the data is at: http://www.fhfa.gov/PolicyProgramsResearch/Research/PaperDocuments/1996-03_HPI_TechDescription_N508.pdf

Content

Contains monthly and quarterly time series from January 1991 to August 2016 for the U.S., state, and MSA categories. Analysis variables are the aggregate non-seasonally adjusted value and seasonally adjusted index values. The index value is 100 beginning January 1991.

Acknowledgements

This data is found on Data.gov

Inspiration

Can this data be combined with the corresponding census growth projections either at the state or MSA level to forecast 24 months out the highest and lowest home index values?

--- Original source retains full ownership of the source dataset ---

Chicago Park District - Event Permits

catalog.data.gov
data.cityofchicago.org
+2more

Updated Jun 29, 2025

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

data.cityofchicago.org (2025). Chicago Park District - Event Permits [Dataset]. https://catalog.data.gov/dataset/chicago-park-district-event-permits

Explore at:

Dataset updated

Jun 29, 2025

Dataset provided by

data.cityofchicago.org

Description

This Chicago Park District dataset includes information about event permits requested through the Chicago Park District, including the name of applicant, the name of the event and a brief description, contact information, time of event including set-up and tear-down times, the name of the Park and location, and estimated number of event attendees. Additional information may be included depending on the type of the event, including proof of insurance, route maps for all races and runs, security plans and medical services and required city documents. Permit levels issued by the Department of Revenue include picnic levels, athletic levels, corporate levels, media levels, promotions levels, and festivals/performances levels. For more information, visit http://www.chicagoparkdistrict.com/permits-and-rentals/.

Dataset of books and publication dates published by Victor Rumanyika...

workwithdata.com

Updated Apr 17, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Work With Data (2025). Dataset of books and publication dates published by Victor Rumanyika Publishing [Dataset]. https://www.workwithdata.com/datasets/books?col=book%2Cpublication_date&f=1&fcol0=book_publisher&fop0=%3D&fval0=Victor+Rumanyika+Publishing

Explore at:

Dataset updated

Apr 17, 2025

Dataset authored and provided by

Work With Data

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset is about books. It has 1 row and is filtered where the book publisher is Victor Rumanyika Publishing. It features 2 columns including publication date.

Dataset: Open access potential and uptake in the context of Plan S - a...

data.niaid.nih.gov
zenodo.org

Updated Feb 17, 2020

Facebook

Twitter

Click to copy link

Link copied

Cite

Kramer, Bianca (2020). Dataset: Open access potential and uptake in the context of Plan S - a partial gap analysis [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3549019

Explore at:

Dataset updated

Feb 17, 2020

Dataset provided by

Kramer, Bianca
Bosman, Jeroen

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Dataset belonging to the report: Open access potential and uptake in the context of Plan S - a partial gap analysis

On the report:

The analysis presented in the report, carried out by Utrecht University Library, aims to provide cOAlition S, an international group of research funding organizations, with initial quantitative and descriptive data on the availability and usage of various open access options in different fields and subdisciplines, and, as far as possible, their compliance with Plan S requirements.

Plan S, launched in September 2018, aims to accelerate a transition to full and immediate Open Access. In the guidance to implementation, released in November 2018 and updated in May 2019, a gap analysis of Open Access journals/platforms was announced. Its goal was to inform Coalition S funders on the Open Access options per field and identify fields where there is a need to increase the share of Open Access journals/platforms.

The report should be seen as a first step: an exploration in methodology as much as in results. Subsequent interpretation (e.g. on fields where funder investment/action is needed) and decisions on next steps (e.g. on more complete and longitudinal monitoring of Plan S-compliant venues) is intentionally left to cOAlition S and its members.

This work was commissioned on behalf of cOAlition S by the Dutch Research Council (NWO), a member of cOAlition S. Bianca Kramer and Jeroen Bosman of Utrecht University Library were appointed to lead the project.

Climate Change: Earth Surface Temperature Data

Intelligent Monitor

‘Iris Flower Data Set Cleaned’ analyzed by Analyst-2

USPTO Cancer Moonshot Patent Data

Fork this notebook to get started on accessing data in the BigQuery dataset by writing SQL queries using the BQhelper module.

Context

Content

Acknowledgements

‘California Housing Data (1990)’ analyzed by Analyst-2

Source

About the Data (from the book):

About the Data (From Luís Torgo page):

End-to-End ML Project Steps (Chapter 2 of the book)

The 10-Step Machine Learning Project Workflow (My Version)

‘world military power 2020’ analyzed by Analyst-2

Dataset of books series that contain Big data : how the information...

Dataset of books published by Education Data Surveys

Dataset inventory

‘Hitters Baseball Data’ analyzed by Analyst-2

Baseball Data

Description

Usage

Format

Source

References

Examples

lm(Salary~AtBat+Hits,data=Hitters)

Data from: Data reuse and the open data citation advantage

Dataset for "Do LiU researchers publish data – and where? Dataset analysis...

Data from: Loopkevers Grensmaas - Ground beetles near the river Meuse in...

PAH Published Dataset Data In Brief

Dataset of books called Applied missing data analysis

Dataset of books and publication dates published by Harper Collins e-books

‘Home Price Index’ analyzed by Analyst-2

Context

Content

Acknowledgements

Inspiration

Chicago Park District - Event Permits

Dataset of books and publication dates published by Victor Rumanyika...

Dataset: Open access potential and uptake in the context of Plan S - a...

Climate Change: Earth Surface Temperature Data

Exploring global temperatures since 1750