Part 1: Step-by-Step Dataset Creation- Unstructured to Structured
in collaboration with Nitin Tiwari (ML GDE)
This blog is part 1 of three blog series for a project called — SciGemma. It comprises of Gemma 2B-IT model fine-tuned on science textbook dataset which allows students to ask questions to SciGemma for revising material during their examination prep without being disturbed by constant notifications poured in due to the internet access.
End-to-End Pipeline
In this blog, we will cover the first part of the pipeline i.e. Data preparation. The initial task would be generating a structured dataset from a science textbook (unstructured dataset).
Step 1: Dealing with Unstructured Data
Rewind to your 9th grade science book and visualize those colorfully imprinted molecular structures, atomic bonds, plant structures and more. That’s your unstructured data. Long short paragraphs, images in between, question answers, these all constitute our unstructured data. Here is a glimpse for you.
Let’s work around this data. We downloaded 9th grade science textbook from here. Then using a pdf reader, we extract text from the textbook. Here, we used PyPDF
library (imported as PdfReader)
import torch
from pdfminer.high_level import extract_text
from transformers import pipeline
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
from pypdf import PdfReader
text = ""
reader = PdfReader(pdf_path)
# Iterate through each page in the PDF
for page in reader.pages:
# Extract text from the current page
page_text = page.extract_text()
# Append the extracted text to the all_text variable
text += page_text + "\n"
Step 2: Question Answer Pipeline
Now, we initialize the environment for a question-answering task using a pre-trained BERT model. It sets up the model name for a version of BERT fine-tuned on the SQuAD dataset, loads a tokenizer and model using this configuration, and creates a pipeline for question-answering tasks.
The qa_pairs list will store question-answer pairs generated using the model and tokenizer, with the pipeline simplifying the process of making predictions by abstracting away the need for manual handling of tokenization and model inference steps.
model_name = "bert-large-uncased-whole-word-masking-finetuned-squad" # Or another suitable model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
qa_pipeline = pipeline("question-answering", model=model, tokenizer=tokenizer)
qa_pairs = []
Step 3: Generating Q and A pairs
Next, we start from a chunk of text, splitting the text into manageable sections, generating questions based on these sections, and then obtaining answers to these questions using a pre-trained BERT model. The process involves following steps:
- Splitting the text: The text is divided into 500-character chunks for processing, as BERT models have a maximum token limit.
- Generating inputs: Each chunk is tokenized and converted into model inputs.
- Model inference: The pre-trained BERT model predicts start and end logits representing the most likely positions in the text for the answer to a question.
- Extracting questions and answers: The positions with the highest logits are used to identify the span of text as a potential answer, and the corresponding section of the input tokens is decoded back into text as a question. If no question is generated, the loop continues to the next chunk.
- Answering questions: The question and its context (the chunk) are passed through a question-answering pipeline to generate an answer.
# Split the text into chunks (adjust chunk size as needed)
for i in range(0, len(text), 500):
chunk = text[i:i+500]
# Use the model to predict possible questions and answers
inputs = tokenizer(chunk, return_tensors="pt")
outputs = model(**inputs)
start_logits = outputs.start_logits
end_logits = outputs.end_logits
# Get the most likely question and answer
start_index = torch.argmax(start_logits)
end_index = torch.argmax(end_logits)
question = tokenizer.decode(inputs["input_ids"][0][start_index:end_index+1])
if not question:
continue
# Now use the qa_pipeline to get the answer for the generated question
answer = qa_pipeline(question=question, context=chunk)['answer']
qa_pairs.append({"question": question,
"answer": answer,
"context": chunk})
Step 4: Dataset is ready!
Once, we get output from the previous step, we store the question-answer-context triplet in form of a csv. Now, our dataset is ready to use for the next part of pipeline.
In order to use the dataset for Gemma model fine-tuning, it is required to convert dataset to jsonl format and simply upload it to HuggingFace Hub from where we can easily access it in our notbook.
# Create a Datasets.Dataset object
qa_dataset = Dataset.from_list(qa_pairs)
qa_dataset.to_csv(output_csv_path, index=False)
# CSV to JSONL conversion
# Load the CSV file
csv_file_path = '/content/Science_data.csv' # Update this to your CSV file path
df = pd.read_csv(csv_file_path)
# Convert the DataFrame to JSON Lines and save it
jsonl_file_path = 'science_dataset_class9.jsonl' # Update this to your desired output file path
df.to_json(jsonl_file_path, orient='records', lines=True)
Complete code
import torch
from pdfminer.high_level import extract_text
from transformers import pipeline
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
from pypdf import PdfReader
def pdf_to_qa_dataset(pdf_path, output_csv_path):
"""
Converts a PDF to a Q&A dataset and saves it as a CSV file.
Args:
pdf_path: Path to the PDF file.
output_csv_path: Path to save the CSV file.
"""
text = ""
# Load the PDF file
reader = PdfReader(pdf_path)
# Iterate through each page in the PDF
for page in reader.pages:
# Extract text from the current page
page_text = page.extract_text()
# Append the extracted text to the all_text variable
text += page_text + "\n"
# text = extract_text(pdf_path) # original way to read pdf
# Load a pre-trained question answering model and tokenizer
model_name = "bert-large-uncased-whole-word-masking-finetuned-squad" # Or another suitable model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
qa_pipeline = pipeline("question-answering", model=model, tokenizer=tokenizer)
qa_pairs = []
# Split the text into chunks (adjust chunk size as needed)
for i in range(0, len(text), 500):
chunk = text[i:i+500]
# Use the model to predict possible questions and answers
inputs = tokenizer(chunk, return_tensors="pt")
outputs = model(**inputs)
start_logits = outputs.start_logits
end_logits = outputs.end_logits
# Get the most likely question and answer
start_index = torch.argmax(start_logits)
end_index = torch.argmax(end_logits)
question = tokenizer.decode(inputs["input_ids"][0][start_index:end_index+1])
if not question:
continue
# Now use the qa_pipeline to get the answer for the generated question
answer = qa_pipeline(question=question, context=chunk)['answer']
qa_pairs.append({"question": question,
"answer": answer,
"context": chunk})
# Create a Datasets.Dataset object
qa_dataset = Dataset.from_list(qa_pairs)
qa_dataset.to_csv(output_csv_path, index=False)
return qa_dataset
# Example usage
pdf_path = "/content/iesc101.pdf"
output_csv_path = "qa_dataset.csv" # Choose your desired filename
pdf_to_qa_dataset(pdf_path, output_csv_path)
# CSV to JSONL conversion
# Load the CSV file
csv_file_path = '/content/Science_data.csv' # Update this to your CSV file path
df = pd.read_csv(csv_file_path)
# Convert the DataFrame to JSON Lines and save it
jsonl_file_path = 'science_dataset_class9.jsonl' # Update this to your desired output file path
df.to_json(jsonl_file_path, orient='records', lines=True)
Your dataset should look something like this:
Now that our dataset is ready, we can move to the next part of pipeline i.e. fine-tuning the Gemma model on this dataset. Hop-on to our next blog here.
One final thing!
If you want to directly use this dataset, then you can find this dataset on 🤗 here
If you have any queries or questions, feel free to connect with me or Nitin on LinkedIn.
Acknowledgment
This project was developed during Google’s ML Developer Programs Gemma sprint. We thank the MLDP team for the opportunity.