84 datasets found

Federal Court Cases: Integrated Database Series
catalog.data.gov
Updated Mar 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bureau of Justice Statistics (2025). Federal Court Cases: Integrated Database Series [Dataset]. https://catalog.data.gov/dataset/federal-court-cases-integrated-database-series-34e8a
Explore at:
Dataset updated
Mar 12, 2025
Dataset provided by
Bureau of Justice Statisticshttp://bjs.ojp.gov/
Description
Investigator(s): Federal Judicial Center The purpose of this data collection is to provide an official public record of the business of the federal courts. The data originate from 100 court offices throughout the United States. Information was obtained at two points in the life of a case: filing and termination. The termination data contain information on both filing and terminations, while the pending data contain only filing information. For the appellate and civil data, the unit of analysis is a single case. The unit of analysis for the criminal data is a single defendant.Years Produced: Updated bi-annually with annual data.
f
Data_Sheet_1_“R” U ready?: a case study using R to analyze changes in gene...
frontiersin.figshare.com
docx
Updated Mar 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder (2024). Data_Sheet_1_“R” U ready?: a case study using R to analyze changes in gene expression during evolution.docx [Dataset]. http://doi.org/10.3389/feduc.2024.1379910.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/feduc.2024.1379910.s001
Dataset updated
Mar 22, 2024
Dataset provided by
Frontiers
Authors
Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
As high-throughput methods become more common, training undergraduates to analyze data must include having them generate informative summaries of large datasets. This flexible case study provides an opportunity for undergraduate students to become familiar with the capabilities of R programming in the context of high-throughput evolutionary data collected using macroarrays. The story line introduces a recent graduate hired at a biotech firm and tasked with analysis and visualization of changes in gene expression from 20,000 generations of the Lenski Lab’s Long-Term Evolution Experiment (LTEE). Our main character is not familiar with R and is guided by a coworker to learn about this platform. Initially this involves a step-by-step analysis of the small Iris dataset built into R which includes sepal and petal length of three species of irises. Practice calculating summary statistics and correlations, and making histograms and scatter plots, prepares the protagonist to perform similar analyses with the LTEE dataset. In the LTEE module, students analyze gene expression data from the long-term evolutionary experiments, developing their skills in manipulating and interpreting large scientific datasets through visualizations and statistical analysis. Prerequisite knowledge is basic statistics, the Central Dogma, and basic evolutionary principles. The Iris module provides hands-on experience using R programming to explore and visualize a simple dataset; it can be used independently as an introduction to R for biological data or skipped if students already have some experience with R. Both modules emphasize understanding the utility of R, rather than creation of original code. Pilot testing showed the case study was well-received by students and faculty, who described it as a clear introduction to R and appreciated the value of R for visualizing and analyzing large datasets.
g
Coronavirus (Covid-19) Data in the United States
github.com
openicpsr.org
+3more
csv
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
New York Times, Coronavirus (Covid-19) Data in the United States [Dataset]. https://github.com/nytimes/covid-19-data
Explore at:
csvAvailable download formats
Dataset provided by
New York Times
License
https://github.com/nytimes/covid-19-data/blob/master/LICENSEhttps://github.com/nytimes/covid-19-data/blob/master/LICENSE
Description
The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. We are compiling this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak.
Since the first reported coronavirus case in Washington State on Jan. 21, 2020, The Times has tracked cases of coronavirus in real time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak.
We have used this data to power our maps and reporting tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists and government officials who would like access to the data to better understand the outbreak.
The data begins with the first reported coronavirus case in Washington State on Jan. 21, 2020. We will publish regular updates to the data in this repository.
INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET
zenodo.org
data.niaid.nih.gov
pdf
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nafiz Sadman; Nishat Anjum; Kishor Datta Gupta; Kishor Datta Gupta; Nafiz Sadman; Nishat Anjum (2024). INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET [Dataset]. http://doi.org/10.5281/zenodo.4047648
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4047648
Dataset updated
Jul 19, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nafiz Sadman; Nishat Anjum; Kishor Datta Gupta; Kishor Datta Gupta; Nafiz Sadman; Nishat Anjum
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Bangladesh, United States
Description
Introduction

There are several works based on Natural Language Processing on newspaper reports. Mining opinions from headlines [ 1 ] using Standford NLP and SVM by Rameshbhaiet. Al.compared several algorithms on a small and large dataset. Rubinet. al., in their paper [ 2 ], created a mechanism to differentiate fake news from real ones by building a set of characteristics of news according to their types. The purpose was to contribute to the low resource data available for training machine learning algorithms. Doumitet. al.in [ 3 ] have implemented LDA, a topic modeling approach to study bias present in online news media.

However, there are not many NLP research invested in studying COVID-19. Most applications include classification of chest X-rays and CT-scans to detect presence of pneumonia in lungs [ 4 ], a consequence of the virus. Other research areas include studying the genome sequence of the virus[ 5 ][ 6 ][ 7 ] and replicating its structure to fight and find a vaccine. This research is crucial in battling the pandemic. The few NLP based research publications are sentiment classification of online tweets by Samuel et el [ 8 ] to understand fear persisting in people due to the virus. Similar work has been done using the LSTM network to classify sentiments from online discussion forums by Jelodaret. al.[ 9 ]. NKK dataset is the first study on a comparatively larger dataset of a newspaper report on COVID-19, which contributed to the virus’s awareness to the best of our knowledge.

2 Data-set Introduction

2.1 Data Collection

We accumulated 1000 online newspaper report from United States of America (USA) on COVID-19. The newspaper includes The Washington Post (USA) and StarTribune (USA). We have named it as “Covid-News-USA-NNK”. We also accumulated 50 online newspaper report from Bangladesh on the issue and named it “Covid-News-BD-NNK”. The newspaper includes The Daily Star (BD) and Prothom Alo (BD). All these newspapers are from the top provider and top read in the respective countries. The collection was done manually by 10 human data-collectors of age group 23- with university degrees. This approach was suitable compared to automation to ensure the news were highly relevant to the subject. The newspaper online sites had dynamic content with advertisements in no particular order. Therefore there were high chances of online scrappers to collect inaccurate news reports. One of the challenges while collecting the data is the requirement of subscription. Each newspaper required $1 per subscriptions. Some criteria in collecting the news reports provided as guideline to the human data-collectors were as follows:

The headline must have one or more words directly or indirectly related to COVID-19.

The content of each news must have 5 or more keywords directly or indirectly related to COVID-19.

The genre of the news can be anything as long as it is relevant to the topic. Political, social, economical genres are to be more prioritized.

Avoid taking duplicate reports.

Maintain a time frame for the above mentioned newspapers.

To collect these data we used a google form for USA and BD. We have two human editor to go through each entry to check any spam or troll entry.

2.2 Data Pre-processing and Statistics

Some pre-processing steps performed on the newspaper report dataset are as follows:

Remove hyperlinks.

Remove non-English alphanumeric characters.

Remove stop words.

Lemmatize text.

While more pre-processing could have been applied, we tried to keep the data as much unchanged as possible since changing sentence structures could result us in valuable information loss. While this was done with help of a script, we also assigned same human collectors to cross check for any presence of the above mentioned criteria.

The primary data statistics of the two dataset are shown in Table 1 and 2.

Table 1: Covid-News-USA-NNK data statistics

No of words per headline

7 to 20

No of words per body content

150 to 2100

Table 2: Covid-News-BD-NNK data statistics No of words per headline

10 to 20

No of words per body content

100 to 1500

2.3 Dataset Repository

We used GitHub as our primary data repository in account name NKK^1. Here, we created two repositories USA-NKK^2 and BD-NNK^3. The dataset is available in both CSV and JSON format. We are regularly updating the CSV files and regenerating JSON using a py script. We provided a python script file for essential operation. We welcome all outside collaboration to enrich the dataset.

3 Literature Review

Natural Language Processing (NLP) deals with text (also known as categorical) data in computer science, utilizing numerous diverse methods like one-hot encoding, word embedding, etc., that transform text to machine language, which can be fed to multiple machine learning and deep learning algorithms.

Some well-known applications of NLP includes fraud detection on online media sites[ 10 ], using authorship attribution in fallback authentication systems[ 11 ], intelligent conversational agents or chatbots[ 12 ] and machine translations used by Google Translate[ 13 ]. While these are all downstream tasks, several exciting developments have been made in the algorithm solely for Natural Language Processing tasks. The two most trending ones are BERT[ 14 ], which uses bidirectional encoder-decoder architecture to create the transformer model, that can do near-perfect classification tasks and next-word predictions for next generations, and GPT-3 models released by OpenAI[ 15 ] that can generate texts almost human-like. However, these are all pre-trained models since they carry huge computation cost. Information Extraction is a generalized concept of retrieving information from a dataset. Information extraction from an image could be retrieving vital feature spaces or targeted portions of an image; information extraction from speech could be retrieving information about names, places, etc[ 16 ]. Information extraction in texts could be identifying named entities and locations or essential data. Topic modeling is a sub-task of NLP and also a process of information extraction. It clusters words and phrases of the same context together into groups. Topic modeling is an unsupervised learning method that gives us a brief idea about a set of text. One commonly used topic modeling is Latent Dirichlet Allocation or LDA[17].

Keyword extraction is a process of information extraction and sub-task of NLP to extract essential words and phrases from a text. TextRank [ 18 ] is an efficient keyword extraction technique that uses graphs to calculate the weight of each word and pick the words with more weight to it.

Word clouds are a great visualization technique to understand the overall ’talk of the topic’. The clustered words give us a quick understanding of the content.

4 Our experiments and Result analysis

We used the wordcloud library^4 to create the word clouds. Figure 1 and 3 presents the word cloud of Covid-News-USA- NNK dataset by month from February to May. From the figures 1,2,3, we can point few information:

In February, both the news paper have talked about China and source of the outbreak.

StarTribune emphasized on Minnesota as the most concerned state. In April, it seemed to have been concerned more.

Both the newspaper talked about the virus impacting the economy, i.e, bank, elections, administrations, markets.

Washington Post discussed global issues more than StarTribune.

StarTribune in February mentioned the first precautionary measurement: wearing masks, and the uncontrollable spread of the virus throughout the nation.

While both the newspaper mentioned the outbreak in China in February, the weight of the spread in the United States are more highlighted through out March till May, displaying the critical impact caused by the virus.

We used a script to extract all numbers related to certain keywords like ’Deaths’, ’Infected’, ’Died’ , ’Infections’, ’Quarantined’, Lock-down’, ’Diagnosed’ etc from the news reports and created a number of cases for both the newspaper. Figure 4 shows the statistics of this series. From this extraction technique, we can observe that April was the peak month for the covid cases as it gradually rose from February. Both the newspaper clearly shows us that the rise in covid cases from February to March was slower than the rise from March to April. This is an important indicator of possible recklessness in preparations to battle the virus. However, the steep fall from April to May also shows the positive response against the attack. We used Vader Sentiment Analysis to extract sentiment of the headlines and the body. On average, the sentiments were from -0.5 to -0.9. Vader Sentiment scale ranges from -1(highly negative to 1(highly positive). There were some cases

where the sentiment scores of the headline and body contradicted each other,i.e., the sentiment of the headline was negative but the sentiment of the body was slightly positive. Overall, sentiment analysis can assist us sort the most concerning (most negative) news from the positive ones, from which we can learn more about the indicators related to COVID-19 and the serious impact caused by it. Moreover, sentiment analysis can also provide us information about how a state or country is reacting to the pandemic. We used PageRank algorithm to extract
United States Supreme Court Judicial Database Terms Series
catalog.data.gov
datasets.ai
Updated Mar 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bureau of Justice Statistics (2025). United States Supreme Court Judicial Database Terms Series [Dataset]. https://catalog.data.gov/dataset/united-states-supreme-court-judicial-database-terms-series-aea2b
Explore at:
Dataset updated
Mar 12, 2025
Dataset provided by
Bureau of Justice Statisticshttp://bjs.ojp.gov/
Area covered
United States
Description
Investigator(s): Harold J. Spaeth, James L. Gibson, Michigan State University This data collection encompasses all aspects of United States Supreme Court decision-making from the beginning of the Warren Court in 1953 up to the completion of the 1995 term of the Rehnquist Court on July 1, 1996, including any decisions made afterward but before the start of the 1996 term on October 7, 1996. In this collection, distinct aspects of the court's decisions are covered by six types of variables: (1) identification variables including case citation, docket number, unit of analysis, and number of records per unit of analysis, (2) background variables offering information on origin of case, source of case, reason for granting cert, parties to the case, direction of the lower court's decision, and manner in which the Court takes jurisdiction, (3) chronological variables covering date of term of court, chief justice, and natural court, (4) substantive variables including multiple legal provisions, authority for decision, issue, issue areas, and direction of decision, (5) outcome variables supplying information on form of decision, disposition of case, winning party, declaration of unconstitutionality, and multiple memorandum decisions, and (6) voting and opinion variables pertaining to the vote in the case and to the direction of the individual justices' votes.Years Produced: Annually
d
The 101st Fair Trade Commission's case statistics - classified according to...
data.gov.tw
csv
Updated May 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fair Trade Commission, EY (2024). The 101st Fair Trade Commission's case statistics - classified according to the type of illegal behavior [Dataset]. https://data.gov.tw/en/datasets/6605
Explore at:
csvAvailable download formats
Dataset updated
May 17, 2024
Dataset authored and provided by
Fair Trade Commission, EY
License
https://data.gov.tw/licensehttps://data.gov.tw/license
Description
This dataset mainly provides statistics on reported cases and fair trade cases investigated by the commission according to its authority, as well as statistics on the patterns of conduct as stated in the disposition.
d
Johns Hopkins COVID-19 Case Tracker
data.world
csv, zip
Updated Mar 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Associated Press (2025). Johns Hopkins COVID-19 Case Tracker [Dataset]. https://data.world/associatedpress/johns-hopkins-coronavirus-case-tracker
Explore at:
zip, csvAvailable download formats
Dataset updated
Mar 25, 2025
Authors
The Associated Press
Time period covered
Jan 22, 2020 - Mar 9, 2023
Area covered
Description
Updates

Notice of data discontinuation: Since the start of the pandemic, AP has reported case and death counts from data provided by Johns Hopkins University. Johns Hopkins University has announced that they will stop their daily data collection efforts after March 10. As Johns Hopkins stops providing data, the AP will also stop collecting daily numbers for COVID cases and deaths. The HHS and CDC now collect and visualize key metrics for the pandemic. AP advises using those resources when reporting on the pandemic going forward.

CDC Weekly case and death counts (national and state level)

CDC County level cases and deaths

HHS New hospital admissions

CDC NowCast COVID variant proportions (national and regional level)

April 9, 2020

The population estimate data for New York County, NY has been updated to include all five New York City counties (Kings County, Queens County, Bronx County, Richmond County and New York County). This has been done to match the Johns Hopkins COVID-19 data, which aggregates counts for the five New York City counties to New York County.

April 20, 2020

Johns Hopkins death totals in the US now include confirmed and probable deaths in accordance with CDC guidelines as of April 14. One significant result of this change was an increase of more than 3,700 deaths in the New York City count. This change will likely result in increases for death counts elsewhere as well. The AP does not alter the Johns Hopkins source data, so probable deaths are included in this dataset as well.

April 29, 2020

The AP is now providing timeseries data for counts of COVID-19 cases and deaths. The raw counts are provided here unaltered, along with a population column with Census ACS-5 estimates and calculated daily case and death rates per 100,000 people. Please read the updated caveats section for more information.

September 1st, 2020

Johns Hopkins is now providing counts for the five New York City counties individually.

February 12, 2021

The Ohio Department of Health recently announced that as many as 4,000 COVID-19 deaths may have been underreported through the state’s reporting system, and that the "daily reported death counts will be high for a two to three-day period."

Because deaths data will be anomalous for consecutive days, we have chosen to freeze Ohio's rolling average for daily deaths at the last valid measure until Johns Hopkins is able to back-distribute the data. The raw daily death counts, as reported by Johns Hopkins and including the backlogged death data, will still be present in the new_deaths column.

February 16, 2021

- Johns Hopkins has reconciled Ohio's historical deaths data with the state.

Overview

The AP is using data collected by the Johns Hopkins University Center for Systems Science and Engineering as our source for outbreak caseloads and death counts for the United States and globally.

The Hopkins data is available at the county level in the United States. The AP has paired this data with population figures and county rural/urban designations, and has calculated caseload and death rates per 100,000 people. Be aware that caseloads may reflect the availability of tests -- and the ability to turn around test results quickly -- rather than actual disease spread or true infection rates.

This data is from the Hopkins dashboard that is updated regularly throughout the day. Like all organizations dealing with data, Hopkins is constantly refining and cleaning up their feed, so there may be brief moments where data does not appear correctly. At this link, you’ll find the Hopkins daily data reports, and a clean version of their feed.

The AP is updating this dataset hourly at 45 minutes past the hour.

To learn more about AP's data journalism capabilities for publishers, corporations and financial institutions, go here or email kromano@ap.org.

Queries

Use AP's queries to filter the data or to join to other datasets we've made available to help cover the coronavirus pandemic

Filter cases by state here

Rank states by their status as current hotspots. Calculates the 7-day rolling average of new cases per capita in each state: https://data.world/associatedpress/johns-hopkins-coronavirus-case-tracker/workspace/query?queryid=481e82a4-1b2f-41c2-9ea1-d91aa4b3b1ac

Find recent hotspots within your state by running a query to calculate the 7-day rolling average of new cases by capita in each county: https://data.world/associatedpress/johns-hopkins-coronavirus-case-tracker/workspace/query?queryid=b566f1db-3231-40fe-8099-311909b7b687&showTemplatePreview=true

Join county-level case data to an earlier dataset released by AP on local hospital capacity here. To find out more about the hospital capacity dataset, see the full details.

Pull the 100 counties with the highest per-capita confirmed cases here

Rank all the counties by the highest per-capita rate of new cases in the past 7 days here. Be aware that because this ranks per-capita caseloads, very small counties may rise to the very top, so take into account raw caseload figures as well.

Interactive

The AP has designed an interactive map to track COVID-19 cases reported by Johns Hopkins.

@(https://datawrapper.dwcdn.net/nRyaf/15/)

Interactive Embed Code

<iframe title="USA counties (2018) choropleth map Mapping COVID-19 cases by county" aria-describedby="" id="datawrapper-chart-nRyaf" src="https://datawrapper.dwcdn.net/nRyaf/10/" scrolling="no" frameborder="0" style="width: 0; min-width: 100% !important;" height="400"></iframe><script type="text/javascript">(function() {'use strict';window.addEventListener('message', function(event) {if (typeof event.data['datawrapper-height'] !== 'undefined') {for (var chartId in event.data['datawrapper-height']) {var iframe = document.getElementById('datawrapper-chart-' + chartId) || document.querySelector("iframe[src*='" + chartId + "']");if (!iframe) {continue;}iframe.style.height = event.data['datawrapper-height'][chartId] + 'px';}}});})();</script>

Caveats

This data represents the number of cases and deaths reported by each state and has been collected by Johns Hopkins from a number of sources cited on their website.

In some cases, deaths or cases of people who've crossed state lines -- either to receive treatment or because they became sick and couldn't return home while traveling -- are reported in a state they aren't currently in, because of state reporting rules.

In some states, there are a number of cases not assigned to a specific county -- for those cases, the county name is "unassigned to a single county"

This data should be credited to Johns Hopkins University's COVID-19 tracking project. The AP is simply making it available here for ease of use for reporters and members.

Caseloads may reflect the availability of tests -- and the ability to turn around test results quickly -- rather than actual disease spread or true infection rates.

Population estimates at the county level are drawn from 2014-18 5-year estimates from the American Community Survey.

The Urban/Rural classification scheme is from the Center for Disease Control and Preventions's National Center for Health Statistics. It puts each county into one of six categories -- from Large Central Metro to Non-Core -- according to population and other characteristics. More details about the classifications can be found here.

Johns Hopkins timeseries data - Johns Hopkins pulls data regularly to update their dashboard. Once a day, around 8pm EDT, Johns Hopkins adds the counts for all areas they cover to the timeseries file. These counts are snapshots of the latest cumulative counts provided by the source on that day. This can lead to inconsistencies if a source updates their historical data for accuracy, either increasing or decreasing the latest cumulative count. - Johns Hopkins periodically edits their historical timeseries data for accuracy. They provide a file documenting all errors in their timeseries files that they have identified and fixed here

Attribution

This data should be credited to Johns Hopkins University COVID-19 tracking project
COVID-19 Case Surveillance Public Use Data
healthdata.gov
opendatalab.com
+6more
application/rdfxml +5
Updated Feb 25, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.cdc.gov (2021). COVID-19 Case Surveillance Public Use Data [Dataset]. https://healthdata.gov/dataset/COVID-19-Case-Surveillance-Public-Use-Data/knt4-7efa
Explore at:
csv, application/rssxml, application/rdfxml, json, xml, tsvAvailable download formats
Dataset updated
Feb 25, 2021
Dataset provided by
data.cdc.gov
Description
Note: Reporting of new COVID-19 Case Surveillance data will be discontinued July 1, 2024, to align with the process of removing SARS-CoV-2 infections (COVID-19 cases) from the list of nationally notifiable diseases. Although these data will continue to be publicly available, the dataset will no longer be updated.

Authorizations to collect certain public health data expired at the end of the U.S. public health emergency declaration on May 11, 2023. The following jurisdictions discontinued COVID-19 case notifications to CDC: Iowa (11/8/21), Kansas (5/12/23), Kentucky (1/1/24), Louisiana (10/31/23), New Hampshire (5/23/23), and Oklahoma (5/2/23). Please note that these jurisdictions will not routinely send new case data after the dates indicated. As of 7/13/23, case notifications from Oregon will only include pediatric cases resulting in death.

This case surveillance public use dataset has 12 elements for all COVID-19 cases shared with CDC and includes demographics, any exposure history, disease severity indicators and outcomes, presence of any underlying medical conditions and risk behaviors, and no geographic data.

CDC has three COVID-19 case surveillance datasets:
COVID-19 Case Surveillance Public Use Data with Geography: Public use, patient-level dataset with clinical data (including symptoms), demographics, and county and state of residence. (19 data elements)
COVID-19 Case Surveillance Public Use Data: Public use, patient-level dataset with clinical and symptom data and demographics, with no geographic data. (12 data elements)
COVID-19 Case Surveillance Restricted Access Detailed Data: Restricted access, patient-level dataset with clinical and symptom data, demographics, and state and county of residence. Access requires a registration process and a data use agreement. (33 data elements)
The following apply to all three datasets:
Data elements can be found on the COVID-19 case report form located at www.cdc.gov/coronavirus/2019-ncov/downloads/pui-form.pdf.
Data are considered provisional by CDC and are subject to change until the data are reconciled and verified with the state and territorial data providers.
Some data cells are suppressed to protect individual privacy.
The datasets will include all cases with the earliest date available in each record (date received by CDC or date related to illness/specimen collection) at least 14 days prior to the creation of the current datasets. This 14-day lag allows case reporting to be stabilized and ensures that time-dependent outcome data are accurately captured.
Datasets are updated monthly.
Datasets are created using CDC’s Policy on Public Health Research and Nonresearch Data Management and Access and include protections designed to protect individual privacy.
For more information about data collection and reporting, please see https://www.cdc.gov/coronavirus/2019-ncov/covid-data/about-us-cases-deaths.html.
For more information about the COVID-19 case surveillance data, please see https://www.cdc.gov/coronavirus/2019-ncov/covid-data/faq-surveillance.html

Overview

The COVID-19 case surveillance database includes individual-level data reported to U.S. states and aut
Federal Justice Statistics Program: Defendants in Federal Criminal Cases --...
catalog.data.gov
icpsr.umich.edu
Updated Mar 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bureau of Justice Statistics (2025). Federal Justice Statistics Program: Defendants in Federal Criminal Cases -- Terminated, 2008 [United States] [Dataset]. https://catalog.data.gov/dataset/federal-justice-statistics-program-defendants-in-federal-criminal-cases-terminated-2008-un
Explore at:
Dataset updated
Mar 12, 2025
Dataset provided by
Bureau of Justice Statisticshttp://bjs.ojp.gov/
Area covered
United States
Description
The data contain records of defendants in federal criminal cases terminated in United States District Court during fiscal year 2008. The data were constructed from the Executive Office for United States Attorneys (EOUSA) Central System file. According to the EOUSA, the United States attorneys conduct approximately 95 percent of the prosecutions handled by the Department of Justice. The Central System data contain variables from the original EOUSA files as well as additional analysis variables, or "SAF" variables, that denote subsets of the data. These SAF variables are related to statistics reported in the Compendium of Federal Justice Statistics. Variables containing identifying information (e.g., name, Social Security Number) were replaced with blanks, and the day portions of date fields were also sanitized in order to protect the identities of individuals. These data are part of a series designed by the Urban Institute (Washington, DC) and the Bureau of Justice Statistics. Data and documentation were prepared by the Urban Institute.
d
Food poisoning cases statistics of eating places cases
data.gov.tw
csv, json, xml
Updated May 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Food and Drug Administration (2024). Food poisoning cases statistics of eating places cases [Dataset]. https://data.gov.tw/en/datasets/9838
Explore at:
csv, json, xmlAvailable download formats
Dataset updated
May 20, 2024
Dataset authored and provided by
Food and Drug Administration
License
https://data.gov.tw/licensehttps://data.gov.tw/license
Description
This dataset provides statistics on the number of food poisoning cases at food consumption locations after 1981, for the use of the general public, businesses, academic institutions, etc.
Data from: Water-quality data imputation with a high percentage of missing...
zenodo.org
data.niaid.nih.gov
csv
Updated Jun 8, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati (2021). Water-quality data imputation with a high percentage of missing values: a machine learning approach [Dataset]. http://doi.org/10.5281/zenodo.4731169
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4731169
Dataset updated
Jun 8, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries.

This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges.

To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)).

IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases.

In this dataset, we include the original and imputed values for the following variables:

Water temperature (Tw)

Dissolved oxygen (DO)

Electrical conductivity (EC)

pH

Turbidity (Turb)

Nitrite (NO2-)

Nitrate (NO3-)

Total Nitrogen (TN)

Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC].

More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318.

If you use this dataset in your work, please cite our paper:
Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318
d
Chiayi City Statistical Database - Theft Case Statistics
data.gov.tw
csv
Updated Feb 20, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chiayi City Government (2017). Chiayi City Statistical Database - Theft Case Statistics [Dataset]. https://data.gov.tw/en/datasets/52332
Explore at:
csvAvailable download formats
Dataset updated
Feb 20, 2017
Dataset authored and provided by
Chiayi City Government
License
https://data.gov.tw/licensehttps://data.gov.tw/license
Area covered
Chiayi City
Description
This is statistical data on theft cases from the "Chiayi City Statistics Database" query system of the Directorate-General of Budget, Accounting and Statistics, with monthly statistical indicators starting from 2014.
Simulated dataset for I = 0.48% and RR = 6
figshare.com
zip
Updated Jan 19, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aline Guttmann (2016). Simulated dataset for I = 0.48% and RR = 6 [Dataset]. http://doi.org/10.6084/m9.figshare.1308517.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1308517.v1
Dataset updated
Jan 19, 2016
Dataset provided by
figshare
Authors
Aline Guttmann
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A collection of 221 datasets in R format (rda), each corresponding to 1000 simulations of one cluster with a relative risk of 6 for a base incidence of 0.48 % births per year. Each dataset is a table of 221 000 rows and 6 columns.The rows contain: -the coordinates (longitude and latitude) of a SU, the observed number of cases, -the size of the at-risk population (i.e., the number of live births), -the expected number of cases in the specified SU assuming an inhomogeneous Poisson process for the cases distribution and -an indicator for the simulation ranging from 1 to 1000.
Federal Justice Statistics Program: Defendants in Federal Criminal Cases in...
catalog.data.gov
datadiscoverystudio.org
+2more
Updated Mar 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bureau of Justice Statistics (2025). Federal Justice Statistics Program: Defendants in Federal Criminal Cases in District Court -- Pending, 1996 [United States] [Dataset]. https://catalog.data.gov/dataset/federal-justice-statistics-program-defendants-in-federal-criminal-cases-in-district-court--2d168
Explore at:
Dataset updated
Mar 12, 2025
Dataset provided by
Bureau of Justice Statisticshttp://bjs.ojp.gov/
Area covered
United States
Description
The data contain records of defendants in criminal cases filed in United States District Court before or during fiscal year 1996 and still pending as of year-end. The data were constructed from the Administrative Office of the United States District Courts' (AOUSC) criminal file. Defendants in criminal cases may be either individuals or corporations. There is one record for each defendant in each case filed. Included in the records are data from court proceedings and offense codes for up to five offenses charged at the time the case was filed. (The most serious charge at termination may differ from the most serious charge at case filing, due to plea bargaining or action of the judge or jury.) In a case with multiple charges against the defendant, a "most serious" offense charge is determined by a hierarchy of offenses based on statutory maximum penalties associated with the charges. The data file contains variables from the original AOUSC files as well as additional analysis variables, or "SAF" variables, that denote subsets of the data. These SAF variables are related to statistics reported in the Compendium of Federal Justice Statistics, Tables 4.1-4.5 and 5.1-5.6. Variables containing identifying information (e.g., name, Social Security number) were replaced with blanks, and the day portions of date fields were also sanitized in order to protect the identities of individuals. These data are part of a series designed by the Urban Institute (Washington, DC) and the Bureau of Justice Statistics. Data and documentation were prepared by the Urban Institute.
Z
Statistics Corona
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated Apr 13, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statistics Corona [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4682076
Explore at:
Dataset updated
Apr 13, 2021
Dataset authored and provided by
Daniel Orbegoso
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In this dataset we can find information related to the population of all the countries listed in the website Worldometers. The dataset is composed, among others, with information like Country, Total Cases, New Cases or TotalDeaths. The dataset was created with the idea to implement it in any project where this information could help to fight against Covid-19.
s
COVID-19 cases in Pacific Island Countries and Territories
pacific-data.sprep.org
pacificdata.org
+1more
Updated Mar 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SPC (2025). COVID-19 cases in Pacific Island Countries and Territories [Dataset]. https://pacific-data.sprep.org/dataset/covid-19-cases-pacific-island-countries-and-territories
Explore at:
application/vnd.sdmx.data+csv; labels=name; version=2; charset=utf-8Available download formats
Dataset updated
Mar 26, 2025
Dataset provided by
Pacific Data Hub
Authors
SPC
License
Public Domain Mark 1.0https://creativecommons.org/publicdomain/mark/1.0/
License information was derived automatically
Area covered
[145.2148808223921, 15.726687346843647], [135.55075917519616, [210.56388888886144, [165.50902632210722, -19.123740708466755], [154.5786207439524, 0.647280195324868], -7.496644800517004], -10.111897222222069], Tokelau, Palau, Niue, Marshall Islands, Micronesia, Nauru, Samoa, Kiribati
Description
Statistics from SPC's Public Health Division (PHD) on the number of cases of COVID-19 and the number of deaths attributed to COVID-19 in Pacific Island Countries and Territories.

Find more Pacific data on PDH.stat.
A
Federal Justice Statistics Program: Defendants in Federal Criminal Cases...
data.amerigeoss.org
icpsr.umich.edu
+3more
html
Updated Jul 28, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
United States[old] (2019). Federal Justice Statistics Program: Defendants in Federal Criminal Cases Filed in District Court, 2005 [United States] [Dataset]. https://data.amerigeoss.org/th/dataset/federal-justice-statistics-program-defendants-in-federal-criminal-cases-filed-in-district-4227d
Explore at:
htmlAvailable download formats
Dataset updated
Jul 28, 2019
Dataset provided by
United States[old]
Area covered
United States
Description
The data contain records of defendants in criminal cases filed in United States District Court during fiscal year 2005. The data were constructed from the Administrative Office of the United States District Courts' (AOUSC) criminal file. Defendants in cri
Federal Justice Statistics Program: Criminal Appeals Cases Filed in Courts...
catalog.data.gov
icpsr.umich.edu
+1more
Updated Mar 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bureau of Justice Statistics (2025). Federal Justice Statistics Program: Criminal Appeals Cases Filed in Courts of Appeals, 1997 [United States] [Dataset]. https://catalog.data.gov/dataset/federal-justice-statistics-program-criminal-appeals-cases-filed-in-courts-of-appeals-1997-
Explore at:
Dataset updated
Mar 12, 2025
Dataset provided by
Bureau of Justice Statisticshttp://bjs.ojp.gov/
Area covered
United States
Description
The data contain records of criminal appeals cases filed in United States Courts of Appeals during fiscal year 1997. The data were constructed from the Administrative Office of the United States Courts' (AOUSC) Court of Appeals file. These contain variables on the nature of the criminal appeal, the underlying offense, and the disposition of the appeal. An appeal can be filed by the government or the offender, and the appellant can appeal the sentence, the verdict, or both sentence and verdict. The data file contains variables from the original AOUSC files as well as additional analysis variables, or "SAF" variables, that denote subsets of the data. These SAF variables are related to statistics reported in the Compendium of Federal Justice Statistics, Tables 6.1-6.5. Variables containing information (e.g., name, Social Security number) were replaced with blanks, and the day portions of date fields were also sanitized in order to protect the identities of individuals. These data are part of a series designed by the Urban Institute (Washington, DC) and the Bureau of Justice Statistics. Data and documentation were prepared by the Urban Institute.
Salary Prediction: Based on years of experience
kaggle.com
Updated Jan 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adil Shamim (2025). Salary Prediction: Based on years of experience [Dataset]. http://doi.org/10.34740/kaggle/dsv/10626597
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/10626597
Dataset updated
Jan 31, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Adil Shamim
Description
About the Dataset

Dataset Name: Salary Prediction Dataset

File Format: CSV

Rows: 100,000

Columns: 4

Overview

This dataset provides salary data based on years of experience, education level, and job role. It can be used for salary prediction models, regression analysis, and workforce analytics. The dataset includes realistic salary variations based on industry trends.

Columns Description

YearsExperience (float) – Number of years of experience (0 to 40 years).

Education Level (string) – The highest level of education attained. Categories include:

High School

Associate Degree

Bachelor's

Master's

PhD

Job Role (string) – Common job titles in the industry:

Software Engineer

Data Scientist

Product Manager

Marketing Specialist

Business Analyst

Salary (float) – The estimated annual salary in USD. The salary is influenced by experience, education level, and job role.

Potential Use Cases

Salary Prediction Models – Train regression models to predict salaries based on experience and qualifications.

Data Science & Machine Learning – Use this dataset for exploratory data analysis and feature engineering.

Workforce Analysis – Analyze salary trends across job roles and experience levels.

How the Data Was Generated

The dataset was synthetically generated using a linear regression-based formula with added randomness and scaling factors based on job roles and education levels. While not real-world data, it closely mimics actual salary distributions in the tech and business industries.

Acknowledgments

This dataset is designed for research, learning, and data science practice. It is not collected from real-world surveys but follows statistical patterns observed in salary data.
C
Public Health Statistics - Tuberculosis cases and average annual incidence...
data.cityofchicago.org
gimi9.com
+1more
application/rdfxml +5
Updated Apr 11, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Public Health (2014). Public Health Statistics - Tuberculosis cases and average annual incidence rate, Chicago, 2007-2011 - Historical [Dataset]. https://data.cityofchicago.org/w/ndk3-zftj/3q3f-6823?cur=X4S2RFBze7A&from=pkiRxBk-dLf
Explore at:
csv, xml, application/rssxml, tsv, application/rdfxml, jsonAvailable download formats
Dataset updated
Apr 11, 2014
Dataset authored and provided by
Public Health
Area covered
Chicago
Description
Note: This dataset is historical only and there are not corresponding datasets for more recent time periods. For that more-recent information, please visit the Chicago Health Atlas at https://chicagohealthatlas.org.

The annual number of new cases of tuberculosis and average annual tuberculosis incidence rate (new cases per 100,000 residents) with corresponding 95% confidence intervals, by Chicago community area, for the years 2007 – 2011. See the full description at https://data.cityofchicago.org/api/assets/E0205898-C378-4299-97C1-F9F89AAF603C.

Facebook

Twitter

Click to copy link

Link copied

Cite

Bureau of Justice Statistics (2025). Federal Court Cases: Integrated Database Series [Dataset]. https://catalog.data.gov/dataset/federal-court-cases-integrated-database-series-34e8a

Federal Court Cases: Integrated Database Series

Explore at:

16 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Mar 12, 2025

Dataset provided by

Bureau of Justice Statisticshttp://bjs.ojp.gov/

Description

Investigator(s): Federal Judicial Center The purpose of this data collection is to provide an official public record of the business of the federal courts. The data originate from 100 court offices throughout the United States. Information was obtained at two points in the life of a case: filing and termination. The termination data contain information on both filing and terminations, while the pending data contain only filing information. For the appellate and civil data, the unit of analysis is a single case. The unit of analysis for the criminal data is a single defendant.Years Produced: Updated bi-annually with annual data.

Clear search

Close search

Google apps

Main menu

Federal Court Cases: Integrated Database Series

Data_Sheet_1_“R” U ready?: a case study using R to analyze changes in gene...

Coronavirus (Covid-19) Data in the United States

INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET

United States Supreme Court Judicial Database Terms Series

The 101st Fair Trade Commission's case statistics - classified according to...

Johns Hopkins COVID-19 Case Tracker

Updates

- Johns Hopkins has reconciled Ohio's historical deaths data with the state.

Overview

Queries

Interactive

Interactive Embed Code

Caveats

Attribution

COVID-19 Case Surveillance Public Use Data

CDC has three COVID-19 case surveillance datasets:

Overview

Federal Justice Statistics Program: Defendants in Federal Criminal Cases --...

Food poisoning cases statistics of eating places cases

Data from: Water-quality data imputation with a high percentage of missing...

Chiayi City Statistical Database - Theft Case Statistics

Simulated dataset for I = 0.48% and RR = 6

Federal Justice Statistics Program: Defendants in Federal Criminal Cases in...

Statistics Corona

COVID-19 cases in Pacific Island Countries and Territories

Federal Justice Statistics Program: Defendants in Federal Criminal Cases...

Federal Justice Statistics Program: Criminal Appeals Cases Filed in Courts...

Salary Prediction: Based on years of experience

About the Dataset

Dataset Name: Salary Prediction Dataset

File Format: CSV

Rows: 100,000

Columns: 4

Overview

Columns Description

Potential Use Cases

How the Data Was Generated

Acknowledgments

Public Health Statistics - Tuberculosis cases and average annual incidence...

Federal Court Cases: Integrated Database SeriesSee More Versions

Federal Court Cases: Integrated Database Series