import React from 'react';
import { Bullets } from '@calico/calico-ui-editorial';
import { LinkExternal } from '@calico/calico-ui-kit';

const BASE_URL = 'https://storage.googleapis.com/calico-website-pin-public-bucket/datasets';

export default [
  {
    format: 'Zip',
    name: 'Gene expression data in wide format',
    size: '6.6 MB',
    src: `${BASE_URL}/idea_wide_format_data.zip`,
    title: 'Wide data',
    content: (
      <div className="max-line-length">
        <p>Each row contains a single gene, and each column is a timepoint of a specific perturbation experiment. We provide the data as a single text file as well as a CDT compatible for visualization in <LinkExternal to="http://jtreeview.sourceforge.net/" title="Java TreeView" />. In Java TreeView the gene names can be linked to the <LinkExternal to="https://www.yeastgenome.org/" title="Saccharomyces Genome Database (SGD)" /> by changing the following setting:</p>
        <ol>
          <li>Choose <strong>Settings > URL settings</strong></li>
          <li style={{ wordBreak: 'break-word'}}>Use http://www.yeastgenome.org/locus/HEADER/overview</li>
        </ol>
        <p>Includes the following files:</p>
        <Bullets
          bullets={[
            {
              title: <strong>idea_wide_format_data.txt</strong>,
              content: 'Gene expression data in text format.',
            },
            {
              title: <strong>idea_wide_format_data.cdt</strong>,
              content: 'Gene expression data for TreeView.',
            },
            {
              title: <strong>idea_wide_format_data.gtr</strong>,
              content: 'Clustergram for TreeView.',
            }
          ]}
          className="grey"
        />
      </div>
    ),
  },
  {
    format: 'Zip of tab-separated values (TSV)',
    name: 'Raw & processed gene expression data',
    size: '351 MB',
    src: `${BASE_URL}/idea_tall_expression_data.zip`,
    title: 'Gene expression data',
    content: (
      <div className="max-line-length">
        <p>For each of 10.4M gene expression observations, raw and progressively processed forms of the data are provided. The data are presented as a single tall-format table, with one observation per row.</p>
        <p>Variables are:</p>
        <ul>
          <li><span className="monospace">TF</span> — induced transcriptional regulator</li>
          <li><span className="monospace">strain</span> — strain name</li>
          <li><span className="monospace">date</span> - date performed</li>
          <li><span className="monospace">restriction</span> - nutrient limitation</li>
          <li><span className="monospace">mechanism</span> - GEV vs ZEV induction system</li>
          <li><span className="monospace">time</span> - time point (minutes)</li>
          <li><span className="monospace">GeneName</span> gene names</li>
          <li><span className="monospace">green_median</span> - median of green (reference) channel fluorescence</li>
          <li><span className="monospace">red_median</span> - median of red (experimental) channel fluorescence</li>
          <li><span className="monospace">log2_ratio</span> - log2(red / green) subtracting value at time zero</li>
          <li><span className="monospace">log2_cleaned_ratio</span> - Non-specific stress response and prominent outliers removed</li>
          <li><span className="monospace">log2_noise_model</span> - estimated noise standard deviation</li>
          <li><span className="monospace">log2_cleaned_ratio_zth2d</span> - cleaned timecourses hard-thresholded based on multiple observations (or last observation) passing the noise model</li>
          <li><span className="monospace">log2_selected_timecourses</span> - cleaned timecourses hard-thresholded based on single observations passing noise model and impulse evaluation of biological feasibility</li>
          <li><span className="monospace">log2_shrunken_timecourses</span> - selected timecourses with observation-level shrinkage based on local FDR (false discovery rate). <strong>Most users of the data will want to use this column.</strong></li>
        </ul>

      </div>
    ),
  },
  {
    format: 'Zip of tab-separated values (TSV)',
    name: 'Kinetic fitting',
    size: '3.4 MB',
    src: `${BASE_URL}/idea_kinetics.zip`,
    title: 'Kinetic fitting data',
    content: (
      <div className="max-line-length">
        <p>Kinetic parameters from sigmoidal and impulse-like parametric fits as described in the paper and <LinkExternal to="https://www.github.com/calico/impulse" title="GitHub" />. The best-fitting parametric model for each of 100,036 timecourses with timecourse-level signal. Sigmoidal models are defined by:</p>
        <ul>
          <li><span className="monospace">t_rise</span> — half-max time</li>
          <li><span className="monospace">v_inter</span> — asymptote</li>
          <li><span className="monospace">rate</span> - rate (steepness)</li>
        </ul>
        <p>Impulse models additionally have a second sigmoidal response defined by:</p>
        <ul>
          <li><span className="monospace">t_fall</span> — half-max time</li>
          <li><span className="monospace">v_final</span> — asymptote</li>
        </ul>
        <p>These parameters are shown in this schematic:</p>
        <img
          alt="Visualization"
          className="figure max-line-length"
          src="./content/kinetics_schema.png"
        />
      </div>
    ),
  },
  {
    format: 'Zip',
    name: 'Motifs',
    size: '14 KB',
    src: `${BASE_URL}/idea_motif_data.zip`,
    title: 'Motif data',
    content: (
      <div className="max-line-length">
        <p>We searched for small motifs that are enriched in the promoters of differentially expressed genes for a given TF relative to invariant genes using DREME. Motifs were matched to previously identified transcription factor motifs from YEASTRACT using TOMTOM. Motifs that are selectively enriched by time, magnitude or direction of effect were assessed using regression. Each kinetic coefficient was regressed on three gene-level motif summaries:</p>
        <ul>
          <li>a binary indicator of whether a motif was present</li>
          <li>number of motifs found</li>
          <li>the PWM score of the best match—for each term</li>
        </ul>  
        <p>Statistical summaries of these three terms are provided with effect sizes for significant results:</p>
        <ul>
          <li>q &gt; 0.1 n.s. (not significant)</li>
          <li>q &lt; 0.1 *</li>
          <li>q &lt; 0.01 **</li>
          <li>q &lt; 0.001 ***</li>
        </ul>
        <p>Includes the following files:</p>
        <Bullets
          bullets={[
            {
              title: <strong>idea_motif_summary_table.tsv</strong>,
              content: 'Motifs associated with kinetic parameters.',
            },
            {
              title: <strong>idea_kegg_nucleotide_codes.txt</strong>,
              content: 'Nucleotide readme file for interepreting motifs.',
            },
          ]}
        />
      </div>
    ),
  },
  {
    format: 'Zip',
    name: 'Dynamical systems model',
    size: '4.3 MB',
    src: `${BASE_URL}/idea_model_data.zip`,
    title: 'Model data',
    content: (
      <div className="max-line-length">
        <p>Whole-cell modeling summaries. Coefficient estimation described in <LinkExternal to="https://github.com/google-research/google-research/tree/master/yeast_transcription_network" title="Google Research GitHub" />. Contains the following files:</p>
        <Bullets
          bullets={[
            {
              title: <strong>idea_ode_coefficients.tsv</strong>,
              content: 'Cause-effect regression coefficients derived from the whole-cell model.',
            },
            {
              title: <strong>idea_attributed_drivers.tsv</strong>,
              content: 'For individual regulatory responses (rises and falls), the transitions that are fit well by the ODE model are separated in the marginal contributions of each regulator.',
            },
          ]}
        />
      </div>
    ),
  },
  {
    format: 'Cytoscape',
    name: 'Regulatory interaction network visualization for Cytoscape',
    size: '<1 MB',
    src: `${BASE_URL}/idea_regulatory_interaction_network.cys`,
    title: 'Regulatory interaction network',
    content: (
      <div className="max-line-length">
        <p>Synthesis of predicted regulatory interaction networks, including known regulators, predicted regulators, and GO terms. Direct regulation between genes is defined based on causal attribution analysis, and indirect regulation of an induced gene is defined if a gene is differentially expressed regardless of whether attribution analysis indicated a direct regulatory relationship.</p>
        <p>Edges between both genes with induction experiments and predicted regulators were formed based on regulatory interactions predicted from individual experiments. Predicted regulators are linked to GO categories based on having a significant overlap with their predicted targets. Similarly, genes with an induction experiment in our dataset are linked to GO categories based on overlap of either direct or indirect targets with GO categories.</p>
      </div>
    ),
  }
];
