datamodules.DivaHisDB package

Subpackages

Submodules

datamodules.DivaHisDB.datamodule_cropped module

class DivaHisDBDataModuleCropped(data_dir: str, data_folder_name: str, gt_folder_name: str, train_folder_name: str = 'train', val_folder_name: str = 'val', test_folder_name: str = 'test', selection_train: Optional[Union[int, List[str]]] = None, selection_val: Optional[Union[int, List[str]]] = None, selection_test: Optional[Union[int, List[str]]] = None, crop_size: int = 256, num_workers: int = 4, batch_size: int = 8, shuffle: bool = True, drop_last: bool = True)[source]

Bases: AbstractDatamodule

DataModule for the `DivaHisDB dataset<https://ieeexplore.ieee.org/abstract/document/7814109>`_ or a similar dataset with the same folder structure and ground truth encoding.

The ground truth encoding is like the following: Red = 0 everywhere (except boundaries) Green = 0 everywhere

Blue = 0b00…1000 = 0x000008: main text body Blue = 0b00…0100 = 0x000004: decoration Blue = 0b00…0010 = 0x000002: comment Blue = 0b00…0001 = 0x000001: background (out of page)

Blue = 0b…1000 | 0b…0010 = 0b…1010 = 0x00000A : main text body + comment Blue = 0b…1000 | 0b…0100 = 0b…1100 = 0x00000C : main text body + decoration Blue = 0b…0010 | 0b…0100 = 0b…0110 = 0x000006 : comment + decoration

The structure of the folder should be as follows:

data_dir
├── train_folder_name
│   ├── data_folder_name
│   │   ├── original_image_name_1
│   │   │   ├── image_crop_1.png
│   │   │   ├── ...
│   │   │   └── image_crop_N.png
│   │   └──original_image_name_N
│   │       ├── image_crop_1.png
│   │       ├── ...
│   │       └── image_crop_N.png
│   └── gt_folder_name
│       ├── original_image_name_1
│       │   ├── image_crop_1.png
│       │   ├── ...
│       │   └── image_crop_N.png
│       └──original_image_name_N
│           ├── image_crop_1.png
│           ├── ...
│           └── image_crop_N.png
├── validation_folder_name
│   ├── data_folder_name
│   │   ├── original_image_name_1
│   │   │   ├── image_crop_1.png
│   │   │   ├── ...
│   │   │   └── image_crop_N.png
│   │   └──original_image_name_N
│   │       ├── image_crop_1.png
│   │       ├── ...
│   │       └── image_crop_N.png
│   └── gt_folder_name
│       ├── original_image_name_1
│       │   ├── image_crop_1.png
│       │   ├── ...
│       │   └── image_crop_N.png
│       └──original_image_name_N
│           ├── image_crop_1.png
│           ├── ...
│           └── image_crop_N.png
└── test_folder_name
    ├── data_folder_name
    │   ├── original_image_name_1
    │   │   ├── image_crop_1.png
    │   │   ├── ...
    │   │   └── image_crop_N.png
    │   └──original_image_name_N
    │       ├── image_crop_1.png
    │       ├── ...
    │       └── image_crop_N.png
    └── gt_folder_name
        ├── original_image_name_1
        │   ├── image_crop_1.png
        │   ├── ...
        │   └── image_crop_N.png
        └──original_image_name_N
            ├── image_crop_1.png
            ├── ...
            └── image_crop_N.png
Parameters:
  • data_dir (str) – path to the data directory

  • data_folder_name (str) – name of the folder containing the images

  • gt_folder_name (str) – name of the folder containing the ground truth

  • train_folder_name (str) – name of the folder containing the training data

  • val_folder_name (str) – name of the folder containing the validation data

  • test_folder_name (str) – name of the folder containing the test data

  • selection_train (Union[int, List[str], None]) – selection of the training data

  • selection_val (Union[int, List[str], None]) – selection of the validation data

  • selection_test (Union[int, List[str], None]) – selection of the test data

  • crop_size (int) – size of the crops

  • num_workers (int) – number of workers for the dataloaders

  • batch_size (int) – batch size

  • shuffle (bool) – shuffle the data

  • drop_last (bool) – drop the last batch if it is smaller than the batch size

get_img_name_coordinates(index) Tuple[Path, Path, str, str, Tuple[int, int]][source]

Returns the original filename of the crop and its coordinate based on the index. You can just use this during testing!

Parameters:

index (int) – index of the crop

Returns:

filename, x, y

setup(stage: Optional[str] = None) None[source]

Called at the beginning of fit (train + validate), validate, test, or predict. This is a good hook when you need to build models dynamically or adjust something about them. This hook is called on every process when using DDP.

Parameters:

stage – either 'fit', 'validate', 'test', or 'predict'

Example:

class LitModel(...):
    def __init__(self):
        self.l1 = None

    def prepare_data(self):
        download_data()
        tokenize()

        # don't do this
        self.something = else

    def setup(self, stage):
        data = load_data(...)
        self.l1 = nn.Linear(28, data.num_classes)
test_dataloader(*args, **kwargs) Union[DataLoader, List[DataLoader]][source]

Implement one or multiple PyTorch DataLoaders for testing.

For data processing use the following pattern:

  • download in prepare_data()

  • process and split in setup()

However, the above are only necessary for distributed processing.

Warning

do not assign state in prepare_data

Note

Lightning adds the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.

Returns:

A torch.utils.data.DataLoader or a sequence of them specifying testing samples.

Example:

def test_dataloader(self):
    transform = transforms.Compose([transforms.ToTensor(),
                                    transforms.Normalize((0.5,), (1.0,))])
    dataset = MNIST(root='/path/to/mnist/', train=False, transform=transform,
                    download=True)
    loader = torch.utils.data.DataLoader(
        dataset=dataset,
        batch_size=self.batch_size,
        shuffle=False
    )

    return loader

# can also return multiple dataloaders
def test_dataloader(self):
    return [loader_a, loader_b, ..., loader_n]

Note

If you don’t need a test dataset and a test_step(), you don’t need to implement this method.

Note

In the case where you return multiple test dataloaders, the test_step() will have an argument dataloader_idx which matches the order here.

train_dataloader(*args, **kwargs) DataLoader[source]

Implement one or more PyTorch DataLoaders for training.

Returns:

A collection of torch.utils.data.DataLoader specifying training samples. In the case of multiple dataloaders, please see this section.

The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.

For data processing use the following pattern:

  • download in prepare_data()

  • process and split in setup()

However, the above are only necessary for distributed processing.

Warning

do not assign state in prepare_data

Note

Lightning adds the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.

Example:

# single dataloader
def train_dataloader(self):
    transform = transforms.Compose([transforms.ToTensor(),
                                    transforms.Normalize((0.5,), (1.0,))])
    dataset = MNIST(root='/path/to/mnist/', train=True, transform=transform,
                    download=True)
    loader = torch.utils.data.DataLoader(
        dataset=dataset,
        batch_size=self.batch_size,
        shuffle=True
    )
    return loader

# multiple dataloaders, return as list
def train_dataloader(self):
    mnist = MNIST(...)
    cifar = CIFAR(...)
    mnist_loader = torch.utils.data.DataLoader(
        dataset=mnist, batch_size=self.batch_size, shuffle=True
    )
    cifar_loader = torch.utils.data.DataLoader(
        dataset=cifar, batch_size=self.batch_size, shuffle=True
    )
    # each batch will be a list of tensors: [batch_mnist, batch_cifar]
    return [mnist_loader, cifar_loader]

# multiple dataloader, return as dict
def train_dataloader(self):
    mnist = MNIST(...)
    cifar = CIFAR(...)
    mnist_loader = torch.utils.data.DataLoader(
        dataset=mnist, batch_size=self.batch_size, shuffle=True
    )
    cifar_loader = torch.utils.data.DataLoader(
        dataset=cifar, batch_size=self.batch_size, shuffle=True
    )
    # each batch will be a dict of tensors: {'mnist': batch_mnist, 'cifar': batch_cifar}
    return {'mnist': mnist_loader, 'cifar': cifar_loader}
val_dataloader(*args, **kwargs) Union[DataLoader, List[DataLoader]][source]

Implement one or multiple PyTorch DataLoaders for validation.

The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.

It’s recommended that all data downloads and preparation happen in prepare_data().

  • fit()

  • validate()

  • prepare_data()

  • setup()

Note

Lightning adds the correct sampler for distributed and arbitrary hardware There is no need to set it yourself.

Returns:

A torch.utils.data.DataLoader or a sequence of them specifying validation samples.

Examples:

def val_dataloader(self):
    transform = transforms.Compose([transforms.ToTensor(),
                                    transforms.Normalize((0.5,), (1.0,))])
    dataset = MNIST(root='/path/to/mnist/', train=False,
                    transform=transform, download=True)
    loader = torch.utils.data.DataLoader(
        dataset=dataset,
        batch_size=self.batch_size,
        shuffle=False
    )

    return loader

# can also return multiple dataloaders
def val_dataloader(self):
    return [loader_a, loader_b, ..., loader_n]

Note

If you don’t need a validation dataset and a validation_step(), you don’t need to implement this method.

Note

In the case where you return multiple validation dataloaders, the validation_step() will have an argument dataloader_idx which matches the order here.

Module contents