100+ datasets found

Data from: Non-dominated Sorting Genetic Algorithm-II
catalog.data.gov
gimi9.com
+1more
Updated Apr 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Non-dominated Sorting Genetic Algorithm-II [Dataset]. https://catalog.data.gov/dataset/non-dominated-sorting-genetic-algorithm-ii-099d0
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Servicehttps://www.ars.usda.gov/
Description
This code is implements the nondominated sorting genetic algorithm (NSGA-II) in the R statistical programming language. The function is theoretically applicable to any number of objectives without modification. The function automatically detects the number of objectives from the population matrix used in the function call. NSGA-II has been applied in ARS research for automatic calibration of hydrolgic models (whittaker link) and economic optimization (whittaker link). Resources in this dataset:Resource Title: Non-dominated Sorting Genetic Algorithm-II. File Name: Web Page, url: https://www.ars.usda.gov/research/software/download/?softwareid=393&modecode=20-72-05-00 download page
Using Descriptive Statistics to Analyse Data in R
kaggle.com
zip
Updated May 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Enrico68 (2024). Using Descriptive Statistics to Analyse Data in R [Dataset]. https://www.kaggle.com/datasets/enrico68/using-descriptive-statistics-to-analyse-data-in-r
Explore at:
zip(105561 bytes)Available download formats
Dataset updated
May 9, 2024
Authors
Enrico68
Description
Load and view a real-world dataset in RStudio

• Calculate “Measure of Frequency” metrics

• Calculate “Measure of Central Tendency” metrics

• Calculate “Measure of Dispersion” metrics

• Use R’s in-built functions for additional data quality metrics

• Create a custom R function to calculate descriptive statistics on any given dataset
R Package History on CRAN
kaggle.com
zip
Updated Jul 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Heads or Tails (2022). R Package History on CRAN [Dataset]. https://www.kaggle.com/datasets/headsortails/r-package-history-on-cran/code
Explore at:
zip(5637913 bytes)Available download formats
Dataset updated
Jul 18, 2022
Authors
Heads or Tails
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

The Comprehensive R Archive Network (CRAN) is the central repository for software packages in the powerful R programming language for statistical computing. It describes itself as "a network of ftp and web servers around the world that store identical, up-to-date, versions of code and documentation for R." If you're installing an R package in the standard way then it is provided by one of the CRAN mirrors.

The ecosystem of R packages continues to grow at an accelerated pace, covering a multitude of aspects of statistics, machine learning, data visualisation, and many other areas. This dataset provides monthly updates of all the packages available through CRAN, as well as their release histories. Explore the evolution of the R multiverse and all of its facets through this comprehensive data.

Content

I'm providing 2 csv tables that describe the current set of R packages on CRAN, as well as the version history of these packages. To derive the data, I made use of the fantastic functionality of the tools package, via the CRAN_package_db function, and the equally wonderful packageRank package and its packageHistory function. The results from those function were slightly adjusted and formatted. I might add further related tables over time.

See the associated blog post for how the data was derived, and for some ideas on how to explore this dataset.

These are the tables contained in this dataset:

cran_package_overview.csv: all R packages currently available through CRAN, with (usually) 1 row per package. (At the time of the creation of this Kaggle dataset there were a few packages with 2 entries and different dependencies. Feel free to contribute some EDA investigating those.) Packages are listed in alphabetical order according to their names.

cran_package_history.csv: version history of virtually all packages in the previous table. This table has one row for each combination of package name and version number, which in most cases leads to multiple rows per package. Packages are listed in alphabetical order according to their names.

I will update this dataset on a roughly monthly cadence by checking which packages have newer version in the overview table, and then replacing

Column Description

Table cran_package_overview.csv: I decided to simplify the large number of columns provided by CRAN and tools::CRAN_package_db into a smaller set of more focus features. All columns are formatted as strings, except for the boolean feature needs_compilation, but the date_published can be read as a ymd date:

package: package name following the official spelling and capitalisation. Table is sorted alphabetically according to this column.

version: current version.

depends: package depends on which other packages.

imports: package imports which other packages.

licence: the licence under which the package is distributed (e.g. GPL versions)

needs_compilation: boolean feature describing whether the package needs to be compiled.

author: package author.

bug_reports: where to send bugs.

url: where to read more.

date_published: when the current version of the package was published. Note: this is not the date of the initial package release. See the package history table for that.

description: relatively detailed description of what the package is doing.

title: the title and tagline of the package.

Table cran_package_history.csv: The output of packageRank::packageHistory for each package from the overview table. Almost all of them have a match in this table, and can be matched by package and version. All columns are strings, and the date can again be parsed as a ymd date:

package: package name. Joins to the feature of the same name in the overview table. Table is sorted alphabetically according to this column.

version: historical or current package version. Also joins. Secondary sorting column within each package name.

date: when this version was published. Should sort in the same way as the version does.

repository: on CRAN or in the Archive.

Acknowledgements

All data is being made publicly available by the Comprehensive R Archive Network (CRAN). I'm grateful to the authors and maintainers of the packages tools and packageRank for providing the functionality to query CRAN packages smoothly and easily.

The vignette photo is the official logo for the R language © 2016 The R Foundation. You can distribute the logo under the terms of the Creative Commons Attribution-ShareAlike 4.0 International license...
f
RT-Sort parameters.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Dec 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tovar, Kenneth R.; Petzold, Linda R.; Kosik, Kenneth S.; van der Molen, Tjitse; Hansma, Paul K.; Bartram, Julian; Haussler, David; Hierlemann, Andreas; Parks, David F.; Robbins, Ash; Lim, Max; Cheng, Zhuowei (2024). RT-Sort parameters. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001364083
Explore at:
Dataset updated
Dec 5, 2024
Authors
Tovar, Kenneth R.; Petzold, Linda R.; Kosik, Kenneth S.; van der Molen, Tjitse; Hansma, Paul K.; Bartram, Julian; Haussler, David; Hierlemann, Andreas; Parks, David F.; Robbins, Ash; Lim, Max; Cheng, Zhuowei
Description
With the use of high-density multi-electrode recording devices, electrophysiological signals resulting from action potentials of individual neurons can now be reliably detected on multiple adjacent recording electrodes. Spike sorting assigns these signals to putative neural sources. However, until now, spike sorting can only be performed after completion of the recording, preventing true real time usage of spike sorting algorithms. Utilizing the unique propagation patterns of action potentials along axons detected as high-fidelity sequential activations on adjacent electrodes, together with a convolutional neural network-based spike detection algorithm, we introduce RT-Sort (Real Time Sorting), a spike sorting algorithm that enables the sorted detection of action potentials within 7.5ms±1.5ms (mean±STD) after the waveform trough while the recording remains ongoing. RT-Sort’s true real-time spike sorting capabilities enable closed loop experiments with latencies comparable to synaptic delay times. We show RT-Sort’s performance on both Multi-Electrode Arrays as well as Neuropixels probes to exemplify RT-Sort’s functionality on different types of recording hardware and electrode configurations.
C
sort
data.cityofchicago.org
csv, xlsx, xml
Updated Dec 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chicago Police Department (2025). sort [Dataset]. https://data.cityofchicago.org/Public-Safety/sort/bnsx-zzcw
Explore at:
xml, xlsx, csvAvailable download formats
Dataset updated
Dec 1, 2025
Authors
Chicago Police Department
Description
This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days. Data is extracted from the Chicago Police Department's CLEAR (Citizen Law Enforcement Analysis and Reporting) system. In order to protect the privacy of crime victims, addresses are shown at the block level only and specific locations are not identified. Should you have questions about this dataset, you may contact the Research & Development Division of the Chicago Police Department at 312.745.6071 or RandD@chicagopolice.org. Disclaimer: These crimes may be based upon preliminary information supplied to the Police Department by the reporting parties that have not been verified. The preliminary crime classifications may be changed at a later date based upon additional investigation and there is always the possibility of mechanical or human error. Therefore, the Chicago Police Department does not guarantee (either expressed or implied) the accuracy, completeness, timeliness, or correct sequencing of the information and the information should not be used for comparison purposes over time. The Chicago Police Department will not be responsible for any error or omission, or for the use of, or the results obtained from the use of this information. All data visualizations on maps should be considered approximate and attempts to derive specific addresses are strictly prohibited. The Chicago Police Department is not responsible for the content of any off-site pages that are referenced by or that reference this web page other than an official City of Chicago or Chicago Police Department web page. The user specifically acknowledges that the Chicago Police Department is not responsible for any defamatory, offensive, misleading, or illegal conduct of other users, links, or third parties and that the risk of injury from the foregoing rests entirely with the user. The unauthorized use of the words "Chicago Police Department," "Chicago Police," or any colorable imitation of these words or the unauthorized use of the Chicago Police Department logo is unlawful. This web page does not, in any way, authorize such use. Data is updated daily Tuesday through Sunday. The dataset contains more than 65,000 records/rows of data and cannot be viewed in full in Microsoft Excel. Therefore, when downloading the file, select CSV from the Export menu. Open the file in an ASCII text editor, such as Wordpad, to view and search. To access a list of Chicago Police Department - Illinois Uniform Crime Reporting (IUCR) codes, go to http://data.cityofchicago.org/Public-Safety/Chicago-Police-Department-Illinois-Uniform-Crime-R/c7ck-438e
i
The Generation R Study. (2024). Columbia Card Task (CCT) [Data set]. Erasmus...
data.individualdevelopment.nl
Updated Oct 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). The Generation R Study. (2024). Columbia Card Task (CCT) [Data set]. Erasmus MC. https://doi.org/10.60641/frzn-7a42 [Dataset]. https://data.individualdevelopment.nl/dataset/7c9076381404f3582ab2eb697a6e7860
Explore at:
Dataset updated
Oct 17, 2024
Description
The Columbia Card Task (CCT) is a psychological test that measures cognitive functions related to executive functioning, such as planning, set shifting, decision-making, and inhibitory control. During the CCT, participants are presented with a deck of cards and are required to sort the cards based on different categories, with the rules for sorting changing over time.
Reddit: /r/science
kaggle.com
zip
Updated Dec 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Reddit: /r/science [Dataset]. https://www.kaggle.com/datasets/thedevastator/exploring-reddit-r-science-subreddit-interaction
Explore at:
zip(205948 bytes)Available download formats
Dataset updated
Dec 17, 2022
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Reddit: /r/science

Investigating Social Media Interactions and Popularity Metrics

By Reddit [source]

About this dataset

The Reddit Subreddit Science dataset offers an in-depth exploration of the science-related conversations and content taking place on the popular website, Reddit. This dataset provides valuable insights into user interactions, sentiment analysis and popularity trends across various types of science topics ranging from astrophysics to neuroscience. The data comprises key features such as post titles, post scores, comment counts, creation times and post URLs which will help us to understand the dynamics and sentiments of the scientific discussions within this popular forum. Utilizing this data set can empower us to analyze how a certain topic has changed over time in terms or relevance or what kind of posts are most successful at gaining attention from users. Ultimately we can leverage this analysis to better comprehend shifts in public opinion towards various aspects of current scientific knowledge

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

Research Ideas

Analyzing the topic trends within the subreddit over time, in order to understand which topics are most popular with readers.

Identifying relationships between levels of interaction (comments and upvotes) and sentiment (through text analysis), to track how users react to certain topics.

Tracking post and user metrics over time (such as average post length or number of comments per post), in order to monitor changes in outlook on the subreddit community as a whole

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: science.csv | Column name | Description | |:--------------|:-----------------------------------------------------------| | title | The title of the post. (String) | | score | The number of upvotes the post has received. (Integer) | | url | The URL associated with the post. (String) | | comms_num | The number of comments associated with the post. (Integer) | | created | The date and time the post was created. (DateTime) | | body | The content of the post. (String) | | timestamp | The timestamp of the post. (DateTime) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Reddit.
Z
Film Circulation dataset
data.niaid.nih.gov
zenodo.org
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Loist, Skadi; Samoilova, Evgenia (Zhenya) (2024). Film Circulation dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7887671
Explore at:
Dataset updated
Jul 12, 2024
Dataset provided by
Film University Babelsberg KONRAD WOLF
Authors
Loist, Skadi; Samoilova, Evgenia (Zhenya)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

Please cite this when using the dataset.

Detailed description of the dataset:

1 Film Dataset: Festival Programs

The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.

2 Survey Dataset

The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.

3 IMDb & Scripts

The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.

4 Festival Library Dataset

The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories, units of measurement, data sources and coding and missing data.

The csv file “4_festival-library_dataset_imdb-and-survey” contains data on all unique festivals collected from both IMDb and survey sources. This dataset appears in wide format, all information for each festival is listed in one row. This
ACNC 2019 Annual Information Statement Data
researchdata.edu.au
gimi9.com
+1more
Updated May 10, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Australian Charities and Not-for-profits Commission (ACNC) (2021). ACNC 2019 Annual Information Statement Data [Dataset]. https://researchdata.edu.au/acnc-2019-annual-statement-data/2975980
Explore at:
Dataset updated
May 10, 2021
Dataset provided by
Data.govhttps://data.gov/
Authors
Australian Charities and Not-for-profits Commission (ACNC)
License
Attribution 2.5 (CC BY 2.5)https://creativecommons.org/licenses/by/2.5/
License information was derived automatically
Description
This dataset is updated weekly. Please ensure that you use the most up-to-date version.###\r

\r The Australian Charities and Not-for-profits Commission (ACNC) is Australia’s national regulator of charities.\r \r Since 3 December 2012, charities wanting to access Commonwealth charity tax concessions (and other benefits), need to register with the ACNC. Although many charities choose to register, registration with the ACNC is voluntary.\r \r Each year, registered charities are required to lodge an Annual Information Statement (AIS) with the ACNC. Charities are required to submit their AIS within six months of the end of their reporting period.\r \r Registered charities can apply to the ACNC to have some or all of the information they provide withheld from the ACNC Register. However, there are only limited circumstances when the ACNC can agree to withhold information. If a charity has applied to have their data withheld, the AIS data relating to that charity has been excluded from this dataset.\r \r This dataset can be used to find the AIS information lodged by multiple charities. It can also be used to filter and sort by different variables across all AIS information.\r \r This dataset can be used to find the AIS information lodged by multiple charities. It can also be used to filter and sort by different variables across all AIS information. AIS Information for individual charities can be viewed via the ACNC Charity Register.\r \r The AIS collects information about charity finances, and financial information provides a basis for understanding the charity and its activities in greater detail. \r We have published explanatory notes to help you understand this dataset.\r \r When comparing charities’ financial information it is important to consider each charity's unique situation. This is particularly true for small charities, which are not compelled to provide financial reports – reports that often contain more details about their financial position and activities – as part of their AIS.\r \r For more information on interpreting financial information, please refer to the ACNC website.\r \r The ACNC also publishes other datasets on data.gov.au as part of our commitment to open data and transparent regulation. Please click here to view them.\r \r NOTE: It is possible that some information in this dataset might be subject to a future request from a charity to have their information withheld. If this occurs, this information will still appear in the dataset until the next update.\r \r Please consider this risk when using this dataset.
d
Data from: Raster Dataset Model of Oil Shale Resources in the Piceance...
catalog.data.gov
data.usgs.gov
+1more
Updated Nov 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Raster Dataset Model of Oil Shale Resources in the Piceance Basin, Colorado [Dataset]. https://catalog.data.gov/dataset/raster-dataset-model-of-oil-shale-resources-in-the-piceance-basin-colorado
Explore at:
Dataset updated
Nov 12, 2025
Dataset provided by
U.S. Geological Survey
Area covered
Colorado
Description
ESRI GRID raster datasets were created to display and quantify oil shale resources for seventeen zones in the Piceance Basin, Colorado as part of a 2009 National Oil Shale Assessment. The oil shale zones in descending order are: Bed 44, A Groove, Mahogany Zone, B Groove, R-6, L-5, R-5, L-4, R-4, L-3, R-3, L-2, R-2, L-1, R-1, L-0, and R-0. Each raster cell represents a one-acre square of the land surface and contains values for either oil yield in barrels per acre, gallons per ton, or isopach thickness, in feet, as defined by the grid name: _b (barrels per acre), _g (gallons per ton), and _i (isopach thickness) where "" can be replaced by the name of the oil shale zone.
Cleaned NHANES 1988-2018
figshare.com
txt
Updated Feb 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet (2025). Cleaned NHANES 1988-2018 [Dataset]. http://doi.org/10.6084/m9.figshare.21743372.v9
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21743372.v9
Dataset updated
Feb 18, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The National Health and Nutrition Examination Survey (NHANES) provides data and have considerable potential to study the health and environmental exposure of the non-institutionalized US population. However, as NHANES data are plagued with multiple inconsistencies, processing these data is required before deriving new insights through large-scale analyses. Thus, we developed a set of curated and unified datasets by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 135,310 participants and 5,078 variables. The variables conveydemographics (281 variables),dietary consumption (324 variables),physiological functions (1,040 variables),occupation (61 variables),questionnaires (1444 variables, e.g., physical activity, medical conditions, diabetes, reproductive health, blood pressure and cholesterol, early childhood),medications (29 variables),mortality information linked from the National Death Index (15 variables),survey weights (857 variables),environmental exposure biomarker measurements (598 variables), andchemical comments indicating which measurements are below or above the lower limit of detection (505 variables).csv Data Record: The curated NHANES datasets and the data dictionaries includes 23 .csv files and 1 excel file.The curated NHANES datasets involves 20 .csv formatted files, two for each module with one as the uncleaned version and the other as the cleaned version. The modules are labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments."dictionary_nhanes.csv" is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 5,078 variables in NHANES."dictionary_harmonized_categories.csv" contains the harmonized categories for the categorical variables.“dictionary_drug_codes.csv” contains the dictionary for descriptors on the drugs codes.“nhanes_inconsistencies_documentation.xlsx” is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES modules.R Data Record: For researchers who want to conduct their analysis in the R programming language, only cleaned NHANES modules and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file.“w - nhanes_1988_2018.RData” contains all the aforementioned datasets as R data objects. We make available all R scripts on customized functions that were written to curate the data.“m - nhanes_1988_2018.R” shows how we used the customized functions (i.e. our pipeline) to curate the original NHANES data.Example starter codes: The set of starter code to help users conduct exposome analysis consists of four R markdown files (.Rmd). We recommend going through the tutorials in order.“example_0 - merge_datasets_together.Rmd” demonstrates how to merge the curated NHANES datasets together.“example_1 - account_for_nhanes_design.Rmd” demonstrates how to conduct a linear regression model, a survey-weighted regression model, a Cox proportional hazard model, and a survey-weighted Cox proportional hazard model.“example_2 - calculate_summary_statistics.Rmd” demonstrates how to calculate summary statistics for one variable and multiple variables with and without accounting for the NHANES sampling design.“example_3 - run_multiple_regressions.Rmd” demonstrates how run multiple regression models with and without adjusting for the sampling design.
ACRA Information on Corporate Entities ('R')
data.gov.sg
Updated Nov 18, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Accounting and Corporate Regulatory Authority (2025). ACRA Information on Corporate Entities ('R') [Dataset]. https://data.gov.sg/datasets?sort=updatedAt&resultId=d_2b8c54b2a490d2fa36b925289e5d9572
Explore at:
Dataset updated
Nov 18, 2025
Dataset authored and provided by
Accounting and Corporate Regulatory Authorityhttp://www.acra.gov.sg/
License
https://data.gov.sg/open-data-licencehttps://data.gov.sg/open-data-licence
Time period covered
Jan 1970 - Nov 2025
Description
Dataset from Accounting and Corporate Regulatory Authority. For more information, visit https://data.gov.sg/datasets/d_2b8c54b2a490d2fa36b925289e5d9572/view
R
Dataset and R script associated to the publication of Billaud and co-authors...
entrepot.recherche.data.gouv.fr
text/x-r-notebook +2
Updated Aug 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
VÉRONIQUE LEFEBVRE; VÉRONIQUE LEFEBVRE; BENOÎT MOURY; BENOÎT MOURY; JUDITH HIRSCH; JUDITH HIRSCH; WILLIAM BILLAUD; LUCIE TAMISIER; LUCIE TAMISIER; Félicie LOPEZ-LAURI; ANNE MASSIRE; ANNE MASSIRE; VALENTIN RIBAUT; MARION SZADKOWSKI; WILLIAM BILLAUD; Félicie LOPEZ-LAURI; VALENTIN RIBAUT; MARION SZADKOWSKI (2024). Dataset and R script associated to the publication of Billaud and co-authors (2024) "Unveiling Pepper Immunity’s Robustness to Temperature Shifts: Insights for Empowering Future Crops" [Dataset]. http://doi.org/10.57745/3MN2BU
Explore at:
tsv(157308), tsv(8204), tsv(152785), text/x-r-notebook(129140), txt(7871)Available download formats
Unique identifier
https://doi.org/10.57745/3MN2BU
Dataset updated
Aug 16, 2024
Dataset provided by
Recherche Data Gouv
Authors
VÉRONIQUE LEFEBVRE; VÉRONIQUE LEFEBVRE; BENOÎT MOURY; BENOÎT MOURY; JUDITH HIRSCH; JUDITH HIRSCH; WILLIAM BILLAUD; LUCIE TAMISIER; LUCIE TAMISIER; Félicie LOPEZ-LAURI; ANNE MASSIRE; ANNE MASSIRE; VALENTIN RIBAUT; MARION SZADKOWSKI; WILLIAM BILLAUD; Félicie LOPEZ-LAURI; VALENTIN RIBAUT; MARION SZADKOWSKI
License
https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html
Dataset funded by
Horizon Europe Program
INRAE BAP
SFR TERSYS
INRAE SPE
Description
The dataset contains raw data and the script associated to the publication of Billaud et al. (2024) "Unveiling Pepper Immunity’s Robustness to Temperature Shifts: Insights for Empowering Future Crops" published by William BILLAUD, Judith HIRSCH, Valentin RIBAUT, Lucie TAMISIER, Anne MASSIRE, Marion SZADKOWSKI, Félicie LOPEZ-LAURI, Benoît MOURY, Véronique LEFEBVRE. This study aimed at proposing estimators of the robustness of immunity in pepper (Capsicum annuum L.). We examined robustness of the mean of immunity and robustness of the variation of immunity, as well as the deviation from the orthogonal regression (〖ODR〗_(S_i )) estimator, delivering a total of nine quantitative robustness estimators. We characterized the immunity of an INRAE core collection of accessions representative of pepper (Capsicum annuum L.) natural diversity, to two major pathogens, the oomycete Phytophthora capsici Leon. and potato virus Y. For each pathogen, the immunity of accessions was measured in two environments contrasted for temperature. The results showed that for each type of robustness and each pathogen, the impact of temperature change on immunity varied between accessions. The robustness estimators proved to be complementary and differed in terms of their heritability and ability to discriminate between accessions. A positive and significant correlation was observed between immunity and robustness. There was no significant relationship between the robustness of immunity to the two pathogens but some accessions showed both high immunity and high robustness against both pathogens. These results justified the need to consider both immunity and its robustness to environmental variations in order to select varieties adapted to current and future climate change. Robustness is also an important component of the value of sustainable cultivation and use, and should be considered when registering future varieties.
m
Data from: Datasets for lot sizing and scheduling problems in the...
data.mendeley.com
narcis.nl
Updated Jan 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juan Piñeros (2021). Datasets for lot sizing and scheduling problems in the fruit-based beverage production process [Dataset]. http://doi.org/10.17632/j2x3gbskfw.1
Explore at:
Unique identifier
https://doi.org/10.17632/j2x3gbskfw.1
Dataset updated
Jan 19, 2021
Authors
Juan Piñeros
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The datasets presented here were partially used in “Formulation and MIP-heuristics for the lot sizing and scheduling problem with temporal cleanings” (Toscano, A., Ferreira, D. , Morabito, R. , Computers & Chemical Engineering) [1], in “A decomposition heuristic to solve the two-stage lot sizing and scheduling problem with temporal cleaning” (Toscano, A., Ferreira, D. , Morabito, R. , Flexible Services and Manufacturing Journal) [2], and in “A heuristic approach to optimize the production scheduling of fruit-based beverages” (Toscano et al., Gestão & Produção, 2020) [3]. In fruit-based production processes, there are two production stages: preparation tanks and production lines. This production process has some process-specific characteristics, such as temporal cleanings and synchrony between the two production stages, which make optimized production planning and scheduling even more difficult. In this sense, some papers in the literature have proposed different methods to solve this problem. To the best of our knowledge, there are no standard datasets used by researchers in the literature in order to verify the accuracy and performance of proposed methods or to be a benchmark for other researchers considering this problem. The authors have been using small data sets that do not satisfactorily represent different scenarios of production. Since the demand in the beverage sector is seasonal, a wide range of scenarios enables us to evaluate the effectiveness of the proposed methods in the scientific literature in solving real scenarios of the problem. The datasets presented here include data based on real data collected from five beverage companies. We presented four datasets that are specifically constructed assuming a scenario of restricted capacity and balanced costs. These dataset is supplementary data for the submitted paper to Data in Brief [4]. [1] Toscano, A., Ferreira, D., Morabito, R., Formulation and MIP-heuristics for the lot sizing and scheduling problem with temporal cleanings, Computers & Chemical Engineering. 142 (2020) 107038. Doi: 10.1016/j.compchemeng.2020.107038. [2] Toscano, A., Ferreira, D., Morabito, R., A decomposition heuristic to solve the two-stage lot sizing and scheduling problem with temporal cleaning, Flexible Services and Manufacturing Journal. 31 (2019) 142-173. Doi: 10.1007/s10696-017-9303-9. [3] Toscano, A., Ferreira, D., Morabito, R., Trassi, M. V. C., A heuristic approach to optimize the production scheduling of fruit-based beverages. Gestão & Produção, 27(4), e4869, 2020. https://doi.org/10.1590/0104-530X4869-20. [4] Piñeros, J., Toscano, A., Ferreira, D., Morabito, R., Datasets for lot sizing and scheduling problems in the fruit-based beverage production process. Data in Brief (2021).
g
Essential data of the public order - Ville de Roubaix | gimi9.com
gimi9.com
Updated Jan 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Essential data of the public order - Ville de Roubaix | gimi9.com [Dataset]. https://gimi9.com/dataset/eu_https-opendata-roubaix-fr-explore-dataset-donneesessentielles-1-/
Explore at:
Dataset updated
Jan 12, 2024
Area covered
Roubaix
Description
Essential data of the public order of the city of Roubaix Articles R. 2196-1 and R. 3131-1 of the Public Procurement Code provide that the buyer or the granting authority must offer on its buyer profile free, direct and complete access to the essential data of public contracts and concession contracts, with the exception of information the disclosure of which would be contrary to public policy. These key data relate to the procurement procedure, the content of the contract and the execution, in order to create an ecosystem of public procurement data. Annex 15 to the Code relating to the essential data of public procurement specifies the lists of data to be published on the buyer profiles and the arrangements for their publication: in particular, it lays down the formats, standards and nomenclatures in which the data are to be published. Data published by the City on the portal https://www.marches-publics.info/ The dataset will be updated once every 2 months.
Reddit: /r/confession
kaggle.com
zip
Updated Dec 19, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Reddit: /r/confession [Dataset]. https://www.kaggle.com/datasets/thedevastator/uncovering-insights-into-human-behavior-through
Explore at:
zip(538733 bytes)Available download formats
Dataset updated
Dec 19, 2022
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Reddit: /r/confession

Analyzing Upvotes and Open-ended Responses

By Reddit [source]

About this dataset

This dataset provides an insight into human behaviour, by exploring the Reddit Confessions submitted between January 2018 - January 2019. From analyzing the upvotes on confessions, to gaining insights from open-ended responses, this dataset allows us to delve deeper into how individuals around the world feel, think and act. It contains title, score (upvotes), id, url, comments number, created date/time data of confessions and its full body along with its timestamp. With such a trove of valuable information at hand this dataset unlocks endless possibilities for researchers in human behaviour and beyond

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

Research Ideas

Using the collected Reddit confessions to study factors that lead to higher number of upvotes.

Analyzing the open-ended responses in reddit confessions to understand human behavior and motivations.

Comparing Reddit confession trends from different years or even months, such as topics and upvote scores, in order to predict future trends on Reddit confessions

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: confession.csv | Column name | Description | |:--------------|:-------------------------------------------------------------------| | title | The title of the confession post. (String) | | score | The number of upvotes the confession post has received. (Integer) | | url | The URL of the confession post. (String) | | comms_num | The number of comments the confession post has received. (Integer) | | created | The date the confession post was created. (Date) | | body | The body of the confession post. (String) | | timestamp | The timestamp of when the confession post was created. (Integer) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Reddit.
m
GERDA datasets including NGS and SGA data
data.mendeley.com
Updated Apr 26, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fabian Otte (2023). GERDA datasets including NGS and SGA data [Dataset]. http://doi.org/10.17632/8c4zbxfvwk.3
Explore at:
Unique identifier
https://doi.org/10.17632/8c4zbxfvwk.3
Dataset updated
Apr 26, 2023
Authors
Fabian Otte
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Datasets linked to publication "Revealing viral and cellular dynamics of HIV-1 at the single-cell level during early treatment periods", Otte et al 2023 published in Cell Reports Methods pre-ART (antiretroviral therapy) cryo-conserved and and whole blood specimen were sampled for HIV-1 virus reservoir determination in HIV-1 positive individuals from the Swiss HIV Study Cohort. Patients were monitored for proviral (DNA), poly-A transcripts (RNA), late protein translation (Gag and Envelope reactivation co-detection assay, GERDA) and intact viruses (golden standard: viral outgrowth assay, VOA). In this dataset we deposited the pipeline for the multidimensional data analysis of our newly established GERDA method, using DBScan and tSNE. For further comprehension NGS and Sanger sequencing data were attached as processed and raw data (GenBank).

Resubmitted to Cell Reports Methods (Jan-2023), accepted in principal (Mar-2023)

GERDA is a new detection method to decipher the HIV-1 cellular reservoir in blood (tissue or any other specimen). It integrates HIV-1 Gag and Env co-detection along with cellular surface markers to reveal 1) what cells still contain HIV-1 translation competent virus and 2) which marker the respective infected cells express. The phenotypic marker repertoire of the cells allow to make predictions on potential homing and to assess the HIV-1 (tissue) reservoir. All FACS data were acquired on a LSRFortessa BD FACS machine (markers: CCR7, CD45RA, CD28, CD4, CD25, PD1, IntegrinB7, CLA, HIV-1 Env, HIV-1 Gag) Raw FACS data (pre-gated CD4CD3+ T-cells) were arcsin transformed and dimensionally reduced using optsne. Data was further clustered using DBSCAN and either individual clusters were further analyzed for individual marker expression or expression profiles of all relevant clusters were analyzed by heatmaps. Sequences before/after therapy initiation and during viral outgrowth cultures were monitored for individuals P01-46 and P04-56 by Next-generation sequencing (NGS of HIV-1 Envelope V3 loop only) and by Sanger (single genome amplification, SGA)

data normalization code (by Julian Spagnuolo) FACS normalized data as CSV (XXX_arcsin.csv) OMIQ conText file (_OMIQ-context_XXX) arcsin normalized FACS data after optsne dimension reduction with OMIQ.ai as CSV file (XXXarcsin.csv.csv) R pipeline with codes (XXX_commented.R) P01_46-NGS and Sanger sequences P04_56-NGS and Sanger sequences
o
madelon
openml.org
Updated May 22, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2015). madelon [Dataset]. https://www.openml.org/d/1485
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 22, 2015
Description
Author: Isabelle Guyon
Source: UCI
Please cite: Isabelle Guyon, Steve R. Gunn, Asa Ben-Hur, Gideon Dror, 2004. Result analysis of the NIPS 2003 feature selection challenge.

Abstract:

MADELON is an artificial dataset, which was part of the NIPS 2003 feature selection challenge. This is a two-class classification problem with continuous input variables. The difficulty is that the problem is multivariate and highly non-linear.

Source:

Isabelle Guyon Clopinet 955 Creston Road Berkeley, CA 90708 isabelle '@' clopinet.com

Data Set Information:

MADELON is an artificial dataset containing data points grouped in 32 clusters placed on the vertices of a five-dimensional hypercube and randomly labeled +1 or -1. The five dimensions constitute 5 informative features. 15 linear combinations of those features were added to form a set of 20 (redundant) informative features. Based on those 20 features one must separate the examples into the 2 classes (corresponding to the +-1 labels). It was added a number of distractor feature called 'probes' having no predictive power. The order of the features and patterns were randomized.

This dataset is one of five datasets used in the NIPS 2003 feature selection challenge. The original data was split into training, validation and test set. Target values are provided only for two first sets (not for the test set). So, this dataset version contains all the examples from training and validation partitions.

There is no attribute information provided to avoid biasing the feature selection process.

Relevant Papers:

The best challenge entrants wrote papers collected in the book: Isabelle Guyon, Steve Gunn, Masoud Nikravesh, Lofti Zadeh (Eds.), Feature Extraction, Foundations and Applications. Studies in Fuzziness and Soft Computing. Physica-Verlag, Springer.

Isabelle Guyon, et al, 2007. Competitive baseline methods set new standards for the NIPS 2003 feature selection benchmark. Pattern Recognition Letters 28 (2007) 1438–1444.

Isabelle Guyon, et al. 2006. Feature selection with the CLOP package. Technical Report.
m
Reddit r/AskScience Flair Dataset
data.mendeley.com
Updated May 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumit Mishra (2022). Reddit r/AskScience Flair Dataset [Dataset]. http://doi.org/10.17632/k9r2d9z999.3
Explore at:
Unique identifier
https://doi.org/10.17632/k9r2d9z999.3
Dataset updated
May 23, 2022
Authors
Sumit Mishra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Reddit is a social news, content rating and discussion website. It's one of the most popular sites on the internet. Reddit has 52 million daily active users and approximately 430 million users who use it once a month. Reddit has different subreddits and here We'll use the r/AskScience Subreddit.

The dataset is extracted from the subreddit /r/AskScience from Reddit. The data was collected between 01-01-2016 and 20-05-2022. It contains 612,668 Datapoints and 25 Columns. The database contains a number of information about the questions asked on the subreddit, the description of the submission, the flair of the question, NSFW or SFW status, the year of the submission, and more. The data is extracted using python and Pushshift's API. A little bit of cleaning is done using NumPy and pandas as well. (see the descriptions of individual columns below).

The dataset contains the following columns and descriptions: author - Redditor Name author_fullname - Redditor Full name contest_mode - Contest mode [implement obscured scores and randomized sorting]. created_utc - Time the submission was created, represented in Unix Time. domain - Domain of submission. edited - If the post is edited or not. full_link - Link of the post on the subreddit. id - ID of the submission. is_self - Whether or not the submission is a self post (text-only). link_flair_css_class - CSS Class used to identify the flair. link_flair_text - Flair on the post or The link flair’s text content. locked - Whether or not the submission has been locked. num_comments - The number of comments on the submission. over_18 - Whether or not the submission has been marked as NSFW. permalink - A permalink for the submission. retrieved_on - time ingested. score - The number of upvotes for the submission. description - Description of the Submission. spoiler - Whether or not the submission has been marked as a spoiler. stickied - Whether or not the submission is stickied. thumbnail - Thumbnail of Submission. question - Question Asked in the Submission. url - The URL the submission links to, or the permalink if a self post. year - Year of the Submission. banned - Banned by the moderator or not.

This dataset can be used for Flair Prediction, NSFW Classification, and different Text Mining/NLP tasks. Exploratory Data Analysis can also be done to get the insights and see the trend and patterns over the years.
The Dynamics of Collective Action Corpus
zenodo.org
bin
Updated Mar 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dustin S. Stoltz; Dustin S. Stoltz; Marshall A. Taylor; Marshall A. Taylor; Jennifer S.K. Dudley; Jennifer S.K. Dudley (2024). The Dynamics of Collective Action Corpus [Dataset]. http://doi.org/10.5281/zenodo.8415049
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8415049
Dataset updated
Mar 13, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dustin S. Stoltz; Dustin S. Stoltz; Marshall A. Taylor; Marshall A. Taylor; Jennifer S.K. Dudley; Jennifer S.K. Dudley
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This respository includes two datasets, a Document-Term Matrix and associated metadata, for 17,493 New York Times articles covering protest events, both saved as single R objects.

These datasets are based on the original Dynamics of Collective Action (DoCA) dataset (Wang and Soule 2012; Earl, Soule, and McCarthy). The original DoCA datset contains variables for protest events referenced in roughly 19,676 New York Times articles reporting on collective action events occurring in the US between 1960 and 1995. Data were collected as part of the Dynamics of Collective Action Project at Stanford University. Research assistants read every page of all daily issues of the New York Times to find descriptions of 23,624 distinct protest events. The text for the news articles were not included in the original DoCA data.

We attempted to recollect the raw text in a semi-supervised fashion by matching article titles to create the Dynamics of Collective Action Corpus. In addition to hand-checking random samples and hand-collecting some articles (specifically, in the case of false positives), we also used some automated matching processes to ensure the recollected article titles matched their respective titles in the DoCA dataset. The final number of recollected and matched articles is 17,493.

We then subset the original DoCA dataset to include only rows that match a recollected article. The "20231006_dca_metadata_subset.Rdata" contains all of the metadata variables from the original DoCA dataset (see Codebook), with the addition of "pdf_file" (used to link to original article pdfs) and "pub_title" (which is the title of the recollected article and may differ from the "title" variable in the original dataset), for a total of 106 variables and 21,126 rows (noting that a row is a distinct protest events and one article may cover more than one protest event).

Once collected, we prepared these texts using typical preprocessing procedures (and some less typical procedures, which were necessary given that these were OCRed texts). We followed these steps in this order: We removed headers and footers that were consistent across all digitized stories and any web links or HTML; added a single space before an uppercase letter when it was flush against a lowercase letter to its right (e.g., turning "JohnKennedy'' into "John Kennedy''); removed excess whitespace; converted all characters to the broadest range of Latin characters and then transliterated to "Basic Latin'' ASCII characters; replaced curly quotes with their ASCII counterparts; replaced contractions (e.g., turned "it's'' into "it is''); removed punctuation; removed capitalization; removed numbers; fixed word kerning; applied a final extra round of whitespace removal.

We then tokenized them by following the rule that each word is a character string surrounded by a single space. At this step, each document is then a list of tokens. We count each unique token to create a document-term matrix (DTM), where each row is an article, each column is a unique token (occurring at least once in the corpus as a whole), and each cell is the number of times each token occurred in each article. Finally, we removed words (i.e., columns in the DTM) that occurred less than four times in the corpus as a whole or were only a single character in length (likely orphaned characters from the OCRing process). The final DTM has 66,552 unique words, 10,134,304 total tokens and 17,493 documents. The "20231006_dca_dtm.Rdata" is a sparse matrix class object from the Matrix R package.

In R, use the load() function to load the objects `dca_dtm` and `dca_meta`. To associate the `dca_meta` to the `dca_dtm` , match the "pdf_file" variable in`dca_meta` to the rownames of `dca_dtm`.

Facebook

Twitter

Click to copy link

Link copied

Cite

Agricultural Research Service (2025). Non-dominated Sorting Genetic Algorithm-II [Dataset]. https://catalog.data.gov/dataset/non-dominated-sorting-genetic-algorithm-ii-099d0

Data from: Non-dominated Sorting Genetic Algorithm-II

Explore at:

Dataset updated

Apr 21, 2025

Dataset provided by

Agricultural Research Servicehttps://www.ars.usda.gov/

Description

This code is implements the nondominated sorting genetic algorithm (NSGA-II) in the R statistical programming language. The function is theoretically applicable to any number of objectives without modification. The function automatically detects the number of objectives from the population matrix used in the function call. NSGA-II has been applied in ARS research for automatic calibration of hydrolgic models (whittaker link) and economic optimization (whittaker link). Resources in this dataset:Resource Title: Non-dominated Sorting Genetic Algorithm-II. File Name: Web Page, url: https://www.ars.usda.gov/research/software/download/?softwareid=393&modecode=20-72-05-00 download page

Clear search

Close search

Google apps

Main menu

Data from: Non-dominated Sorting Genetic Algorithm-II

Using Descriptive Statistics to Analyse Data in R

R Package History on CRAN

Context

Content

Column Description

Acknowledgements

RT-Sort parameters.

sort

The Generation R Study. (2024). Columbia Card Task (CCT) [Data set]. Erasmus...

Reddit: /r/science

Reddit: /r/science

Investigating Social Media Interactions and Popularity Metrics

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

Film Circulation dataset

ACNC 2019 Annual Information Statement Data

This dataset is updated weekly. Please ensure that you use the most up-to-date version.###\r

Data from: Raster Dataset Model of Oil Shale Resources in the Piceance...

Cleaned NHANES 1988-2018

ACRA Information on Corporate Entities ('R')

Dataset and R script associated to the publication of Billaud and co-authors...

Data from: Datasets for lot sizing and scheduling problems in the...

Essential data of the public order - Ville de Roubaix | gimi9.com

Reddit: /r/confession

Reddit: /r/confession

Analyzing Upvotes and Open-ended Responses

About this dataset

More Datasets

Featured Notebooks

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

GERDA datasets including NGS and SGA data

madelon

Abstract:

Source:

Data Set Information:

Relevant Papers:

Reddit r/AskScience Flair Dataset

The Dynamics of Collective Action Corpus

Data from: Non-dominated Sorting Genetic Algorithm-IISee More Versions

Data from: Non-dominated Sorting Genetic Algorithm-II