C Language: Dictionary Spell Checker


At the end of the fifth week of the CS50 Harvard course, we studied hash tables, which can be used to solve the problem of hashing words from text to check their spelling in a dictionary (speller.c). A dictionary is given in which words are listed on each new line. In a separate file, the text is given, which the program must check – which words in the text are spelled incorrectly. It is believed that all words that are not in the dictionary are wrong words.

It is necessary not only to create a program code that checks words from the text with a dictionary, but also to minimize the speed of this program. In order for the program to work quickly, you need to come up with your own hash function and choose the optimal number of buckets in the hash table (26 buckets are initially suggested). The task is located in the dictionary file (dictionary.c) and consists of four functions: dictionary loading (load), hash function (hash), word check (check) and memory release (unload).

Hash function

To better understand what a hash function is and why it is needed, let’s look at an example: suppose you have an object and you want to assign a key to it to make it easier to find. To solve such a problem, you can use a simple array and store data pairs there (an object and its key). If the keys are long and cannot be used directly as an index, hashing should be used.

In hashing, long keys are converted to short keys using hash functions. Their values ​​are then stored in a data structure called a hash table. Why does hashing make lookups fast? The fact is that during hashing, all records of pairs (object / key) are distributed evenly over the array and access to them goes through the calculated hash code (integer). To calculate the hash code for an object, an algorithm (hash function) is used, it calculates an index that shows where the entry can be found or inserted.

How to make a good hash function?

This function needs to be done independently so that the program works as quickly as possible. There are different approaches to how to write fast hash functions, here you need to do your research and choose the appropriate option. You can focus on letters in a word or on mathematical operations with letters and other available parameters. By testing the finished version, you can choose the optimal hash function and the number of “buckets” (cells).

There is an approach in which a hash function is created that always gives a unique result. In our task, there is no such need, since we are making “chains”. Moreover, even in a good hash function there can be collisions (crossings). Therefore, it is possible to create a hash function that will be fast and minimize collisions. The easiest way is to add all the letters and multiply by some number (for example, strlen) or add a fixed value (for example, the length of a word), and then take the modulo of the number of cells from it.

Dictionary download function

Operates with the file open function FILE *x = fopen(dictionary, "r"), through which we open the dictionary file. Next, we read each word from the file line by line using the fscanf(x, "%s", word) function. For each word, dynamically allocate memory via malloc(sizeof(node)), apply the hash function, and copy the word to the created node. Then we insert the node in the right place in the hash table. In the same function, you can count the number of words loaded from the dictionary.

Check function

The basis of this block is the strcasecmp() function, which compares arguments without case sensitivity. You need to return true if the words are the same (that is, the word from the text is in the dictionary, therefore, it is correct)! At this point, I got confused and for a long time could not understand why the function does not compare words correctly. It turned out that I returned false where it was necessary to return true.

Memory release function

To free memory for each node, you need to create a temporary variable that points to the beginning of the linked list (cursor) and another variable (tmp) that initially points nowhere. Then we give tmp the value of cursor, and the cursor itself is moved one step forward. And free tmp memory. Continuing to free memory in this way, we will not lose a single connection.

Code of program by igroglaz

// Implements a dictionary's functionality

#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#include <stdbool.h>
#include <string.h>
#include <strings.h>

#include "dictionary.h"

// Represents a node in a hash table
typedef struct node
{
    char word[LENGTH + 1];
    struct node *next;
}
node;

// Choose number of buckets in hash table
const unsigned int N = 4001; // 26 (alphabet) *  45 (length) = 1170

// Hash table
node *table[N];

// global variable for size_count()
unsigned int counter = 0;

// global flag for check_hash recursion search
bool flag;

// go through dictionary branch with hash-index and check is it exist there
bool check_hash(node *tabb, const char *word)
{
    if (tabb == NULL) // base case
        return false;

    check_hash(tabb->next, word); // recursion

    // compare strings and if word found - flag it up
    if (strcasecmp(tabb->word, word) == 0)
        flag = true;

    return flag;
}


// Returns true if word is in dictionary, else false
bool check(const char *word)
{
    // hash word and get the
    unsigned int x = hash(word);
    // reset flag for check_hash
    flag = false;

    if (table[x] == NULL) // paranoia: such hash doesn't exist in the dictionary
        return false;
    else    // hash exist, so lets find is the word exist in the hash's branch
        return check_hash(table[x], word);
}

// Hashes word to a number
unsigned int hash(const char *word)
{
    unsigned int hash = 0;

    for (int h = 0; h < LENGTH; h++) // compare with number of letters to take
    {
        if (word[h] == '\0')
            break;
        if (isalpha(word[h]))
            hash += toupper(word[h]) - 'A';
    }

    hash *= strlen(word);

    hash %= N;

    // paranoia
    if (hash < 0)
        hash = 0;
    else if (hash > N)
        hash = N;

    return hash;
}

// Loads dictionary into memory, returning true if successful, else false
bool load(const char *dictionary)
{
    char word[LENGTH+1];
    unsigned int i = 0; // will contain hash number

    FILE *file = fopen(dictionary, "rb");
    if (file == NULL)
        return false;

    while (fscanf(file,   "%s",   word) != EOF) // return EOF if it reach end of file.. while?
//filepointer from fopen  format  character array where we put the word
    {
        node *n = (node *)malloc(sizeof(node));
        if (n == NULL)
            return false;

        strcpy(n->word, word); // copy from buffer to node
        n->next = NULL; // point to NULL
        i = hash(word); // hash word

        // hash is array of linked list... table[26]
        // 1) add new node to a linked list..

        // start new branch of linked list
        if (table[i] == NULL)
        {
            table[i] = n;
            counter++;
        }
        // join to existing branch
        else
        {
            // 1st connection to existing node
            if (table[i]->next == NULL)
            {
                table[i]->next = n;
                counter++;
            }
            // node already got children
            else
            {
                n->next = table[i]->next;
                table[i]->next = n;
                counter++;
            }
        }
    }

    fclose(file);

    return true;
}

// Returns number of words in dictionary if loaded, else 0 if not yet loaded
unsigned int size(void)
{
    return counter;
}

// Unloads dictionary from memory, returning true if successful, else false
bool unload(void)
{
    node *cursor = NULL; // points to node which we look into
    node *tmp = NULL; // so we won't loose child address

    for (int i = 0; i < N; i++) // starting from 0..
    {
        if (table[i] != NULL)
        {
            cursor = table[i];
            tmp = table[i];

            while (cursor->next != NULL)
            {
                    cursor = tmp->next;
                    free(tmp);
                    tmp = cursor;
            }
            free(cursor);
        }
    }

    return true;
}

Result by command ./speller texts/lalaland.txt:

... < list of misspelled words >..

WORDS MISSPELLED:     955
WORDS IN DICTIONARY:  143091
WORDS IN TEXT:        17756
TIME IN load:         0.03
TIME IN check:        0.18
TIME IN size:         0.00
TIME IN unload:       0.01
TIME IN TOTAL:        0.21

Code of program by shtukensia

// Implements a dictionary's functionality

#include <ctype.h>
#include <stdbool.h>
#include <string.h>
#include <strings.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>

#include "dictionary.h"

int size_words = 0; // Keep track of the number of words added

// Represents a node in a hash table
typedef struct node
{
    char word[LENGTH + 1];
    struct node *next;
}
node;

// TODO: Choose number of buckets in hash table
const unsigned int N = 10000;

// Hash table
node *table[N];

// Returns true if word is in dictionary, else false
bool check(const char *word)
{
    // TODO
// Convert word to lowercase

    // Hash the word
    int hash_index = hash(word);
    int result;

    // Create variable that points to the start of linked list
    node *cursor = table[hash_index];

    while (cursor != NULL)
    {
        result = strcasecmp(word, cursor->word); // compare without case sensitivity

        if (result == 0)
        {
            return true;
        }

        cursor = cursor->next; // check next word in linked list
    }

    return false;
}

// Hashes word to a number
unsigned int hash(const char *word)
{
    // TODO: Improve this hash function
    // The output should be between 0 and N-1
    unsigned int result = 0;
    unsigned int word_lenth = strlen(word);


    for (int i = 0; i <= word_lenth; i++)
    {
        result += tolower(word[i]) + word_lenth / 2;
    }

    return result % N;
}

// Loads dictionary into memory, returning true if successful, else false
bool load(const char *dictionary)
{
    // TODO
    // Open the dictionary file
    FILE *dict = fopen(dictionary, "r");
    if (dict == NULL)
    {
        printf("Could not open dictionary.\n");
        return false;
    }

    // Read strings from file one at a time

    char word[LENGTH + 1];

    while (fscanf(dict, "%s", word) != EOF)
    {
        // Create a new node for each word
        node *n = malloc(sizeof(node));
        if (n == NULL)
        {
            printf("Not enough memory.\n");
            fclose(dict);
            return false;
        }

        // Hash word and get hash-value
        int hash_index = hash(word);

        strcpy(n->word, word); // Copy word

        // Insert node into hash table at that location
        node *head = table[hash_index]; // Create variable that points to the start of linked list

        if (head != NULL)
        {
            n->next = table[hash_index];
        }
        else
        {
            n->next = NULL;
        }

        table[hash_index] = n;
        size_words ++;
    }

    fclose(dict);
    return true;
}

// Returns number of words in dictionary if loaded, else 0 if not yet loaded
unsigned int size(void)
{
    // TODO
    return size_words;
}

// Unloads dictionary from memory, returning true if successful, else false
bool unload(void)
{
    // TODO
    for (int i = 0; i < N ; i++)
    {
        node *cursor = table[i]; // Create variable that points to the start of linked list
        node *tmp = NULL;

        while (cursor != NULL)
        {
            tmp = cursor;
            cursor = cursor->next;
            free(tmp);
        }

    }

    return true;
}

The result of the program on the example ./speller texts/lalaland.txt:

For comparison, you can use the solution of the CS50 team:


This entry was posted in C language (en). Bookmark the permalink.

Leave a Reply

🇬🇧 Attention! Comments with URLs/email are not allowed.
🇷🇺 Комментарии со ссылками/email удаляются автоматически.